[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN117546248A - Base detection using multiple base detector model - Google Patents

Base detection using multiple base detector model Download PDF

Info

Publication number
CN117546248A
CN117546248A CN202280044106.4A CN202280044106A CN117546248A CN 117546248 A CN117546248 A CN 117546248A CN 202280044106 A CN202280044106 A CN 202280044106A CN 117546248 A CN117546248 A CN 117546248A
Authority
CN
China
Prior art keywords
base
classification information
score
detector
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280044106.4A
Other languages
Chinese (zh)
Inventor
G·D·帕纳比
M·D·哈姆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inmair Ltd
Original Assignee
Inmair Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/876,528 external-priority patent/US20230041989A1/en
Application filed by Inmair Ltd filed Critical Inmair Ltd
Priority claimed from PCT/US2022/039208 external-priority patent/WO2023014741A1/en
Publication of CN117546248A publication Critical patent/CN117546248A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method of base detection using at least two base detectors is disclosed. The method comprises the following steps: performing at least a first base detector and a second base detector on sensor data generated for a sensing cycle of a series of sensing cycles; generating, by the first base detector, first classification information associated with the sensor data based on executing the first base detector on the sensor data; and generating, by the second base detector, second classification information associated with the sensor data based on performing the second base detector on the sensor data. In one example, final classification information is generated based on the first classification information and the second classification information, wherein the final classification information includes one or more base detections of the sensor data.

Description

Base detection using multiple base detector model
Priority application
The present application claims priority from U.S. non-provisional patent application No. 17/876,528 (attorney docket No. ILLM 1021-2/IP-1856-US) entitled "Base Calling Using Multiple Base Caller Models" filed on month 7, 28 of 2022, which in turn claims the benefit of U.S. provisional patent application No. 63/228,954 (attorney docket No. ILLM 1021-1/IP-1856-PRV) entitled "Base Calling Using Multiple Base Caller Models" filed on month 8, 3 of 2021. The priority application is hereby incorporated by reference for all purposes.
Technical Field
The disclosed technology relates to artificial intelligence type computers and digital data processing systems, as well as corresponding data processing methods and products for simulating intelligence (i.e., knowledge-based systems, inference systems, and knowledge acquisition systems); and include systems for uncertainty inference (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the disclosed technology relates to using deep neural networks, such as deep convolutional neural networks, for analyzing data.
Incorporation of documents
The following documents are incorporated by reference as if fully set forth herein:
U.S. provisional patent application No. 62/979,384 (attorney docket No. ILLM 1015-1/IP-1857-PRV), filed on 20/2/2020, entitled "Artificial Intelligence-Based Base Calling of Index Sequences";
U.S. provisional patent application No. 62/979,414 (attorney docket No. ILLM 1016-1/IP-l 858-PRV) filed on month 2 and 20 of 2020, entitled "Artificial Intelligence-Based management-to-Many Base Calling";
U.S. non-provisional patent application No. 16/825,987 (attorney docket No. ILLM 1008-16/IP-1693-US) filed 3/20/2020, entitled "Training Data Generation for Artificial Intelligence-Based Sequencing";
U.S. non-provisional patent application No. 16/825,991 (attorney docket No. ILLM 1008-17/IP-1741-US) filed 3/20/2020, entitled "Artificial Intelligence-Based Generation of Sequencing Metadata";
U.S. non-provisional patent application No. 16/826,126 (attorney docket No. ILLM 1008-18/IP-1744-US) filed 3/20/2020, entitled "Artificial Intelligence-Based Base Calling";
U.S. non-provisional patent application No. 16/826,134 (attorney docket No. ILLM 1008-19/IP-1747-US) filed 3/20/2020, entitled "Artificial Intelligence-Based Quality Scoring"; and
U.S. non-provisional patent application No. 16/826,168 (attorney docket No. ILLM 1008-20/IP-1752-PRV-US) entitled "Artificial Intelligence-Based Sequencing" filed on 3/21/2020.
Background
The subject matter discussed in this section should not be considered to be prior art merely as a result of the recitation in this section. Similarly, the problems mentioned in this section or associated with the subject matter provided as background should not be considered as having been previously recognized in the prior art. The subject matter in this section is merely representative of various methods that may themselves correspond to the specific implementations of the claimed technology.
In recent years, rapid increases in computing power have led to deep Convolutional Neural Networks (CNNs) with great success in many computer vision tasks with significantly improved accuracy. In the inference phase, many applications require low latency processing of an image with stringent power consumption requirements, which reduces the efficiency of Graphics Processing Units (GPUs) and other general-purpose platforms, and presents opportunities for specific acceleration hardware (e.g., field Programmable Gate Arrays (FPGAs)) by tailoring digital circuitry dedicated to deep learning algorithm reasoning. However, deploying CNNs on portable and embedded systems remains challenging due to the large data volumes, intensive computing, varying algorithmic structures, and frequent memory accesses.
Since convolution contributes most of the operations in CNN, the convolution acceleration scheme significantly affects the efficiency and performance of the hardware CNN accelerator. Convolution involves a Multiply and Accumulate (MAC) operation with four cyclic stages sliding along the kernel and feature diagram. The first loop stage computes the MAC of pixels within the kernel window. The second loop stage accumulates the sum of the products of the MACs across different input feature maps. After the first and second loop stages are completed, the final output element in the output signature is obtained by adding the bias. The third loop stage slides the kernel window within the input feature map. The fourth loop stage generates a different output profile.
FPGAs have gained increasing attention and popularity, especially in terms of accelerated reasoning tasks, due to (1) their high reconfigurability, (2) faster development times compared to Application Specific Integrated Circuits (ASICs) to keep pace with the rapid development of CNNs, (3) good performance, and (4) superior energy efficiency compared to GPUs. The high performance and efficiency of FPGAs can be achieved by synthesizing circuits tailored for specific computations to directly handle billions of operations with a custom memory system. For example, hundreds to thousands of Digital Signal Processing (DSP) blocks on modern FPGAs support core convolution operations, such as multiplication and addition, with high parallelism. The dedicated data buffer between the external on-chip memory and the on-chip Processing Engine (PE) can be designed to achieve a preferred data flow by configuring tens of megabytes of on-chip Block Random Access Memory (BRAM) on the FPGA chip.
Efficient data flow and CNN accelerated hardware architecture is needed to minimize data communications while maximizing resource utilization for high performance. There is therefore an opportunity to devise methods and frameworks that accelerate the reasoning process of various CNN algorithms on acceleration hardware with high performance, efficiency and flexibility.
Drawings
In the drawings, like reference characters generally refer to like parts throughout the different views. In addition, the drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the disclosed technology. In the following description, various implementations of the disclosed technology are described with reference to the following drawings, in which:
FIG. 1 illustrates a cross section of a biosensor that can be used in various embodiments.
Figure 2 shows one implementation of a flow cell containing clusters in its blocks.
Fig. 3 shows an exemplary flow cell with eight channels, and also shows an enlarged view of one block and its clusters and their surrounding background.
FIG. 4 is a simplified block diagram of a system for analyzing sensor data (such as base detection sensor output) from a sequencing system.
FIG. 5 is a simplified diagram illustrating aspects of a base detection operation that includes the functionality of a runtime program executed by a host processor.
Fig. 6 is a simplified diagram of a configuration of a configurable processor, such as the configurable processor of fig. 4.
Fig. 6A shows a system for performing a base detection operation on an original image output by a biosensor using two or more base detectors.
Fig. 7 is a diagram of a neural network architecture that may be implemented using a configurable or reconfigurable array configured as described herein.
Fig. 8A is a simplified illustration of the organization of blocks of sensor data used by the neural network architecture as in fig. 7.
Fig. 8B is a simplified illustration of a patch of a block of sensor data used by the neural network architecture as in fig. 7.
Fig. 9 shows a portion of the configuration of a neural network as in fig. 7 on a configurable or reconfigurable array, such as a Field Programmable Gate Array (FPGA).
Fig. 10 is a diagram of another alternative neural network architecture that may be implemented using a configurable or reconfigurable array configured as described herein.
FIG. 11 illustrates one implementation of a specialized architecture of a neural network-based base detector for isolating processing of data for different sequencing cycles.
FIG. 12 depicts one implementation of barrier layers, each of which may include convolutions.
FIG. 13A depicts one implementation of combined layers, each of which may include convolutions.
Fig. 13B depicts another implementation of combined layers, each of which may include a convolution.
FIG. 14 shows a base detection system including a plurality of base detectors to predict base detection of an unknown analyte including a base sequence.
FIGS. 15A, 15B, 15C, 15D and 15E illustrate corresponding flowcharts depicting various operations of the base detection system of FIG. 14 for a corresponding set of sensor data.
FIG. 16 illustrates a context information generation module of the base detection system of FIG. 14 that generates context information for an exemplary set of sensor data.
Fig. 17A shows a flow-through cell comprising blocks, which are categorized based on the spatial position of the blocks.
Fig. 17B shows a block of a flow cell comprising clusters, which clusters are categorized based on the spatial location of the clusters.
Fig. 17C shows an example of fading in which the signal intensity decreases with the number of cycles of sequencing run as a base detection operation.
Figure 17D conceptually illustrates the decreasing signal-to-noise ratio as the sequencing cycle progresses.
FIG. 18 shows base detection accuracy (1-base detection error rate) for base detection homopolymers (e.g., GGGGG) and near homopolymers (e.g., GGTGG) for different exemplary base detector configurations.
FIG. 19A shows final base detection that generates a set of sensor data based on a function of first base detection classification information from the first base detector and second base detection classification information from the second base detector of the base detection system of FIG. 14.
Fig. 19A1 illustrates a look-up table (LUT) indicating an exemplary weighting scheme to be used for the final confidence score based on the temporal context information.
FIG. 19B shows a LUT indicating a base detector to be used when the detected base includes a specific base sequence.
FIG. 19C shows a LUT indicating a weight given to a confidence score of each base detector when the detected base includes a particular base sequence.
Fig. 19D shows a LUT indicating the operation of the base detection combination module of fig. 14 according to the detection of one or more bubbles in a cluster of flow cells.
Fig. 19D1 shows a LUT indicating the operation of the base detection combination module of fig. 14 according to detection of out-of-focus images from clusters of flow cells.
Fig. 19E shows a LUT indicating weights given to confidence scores of the respective base detectors based on the reagent sets used.
Fig. 19F shows a LUT indicating the operation of the base detection combining module of fig. 14 according to the spatial classification of the block.
Fig. 19G shows a LUT indicating the operation of the base detection combining module of fig. 14 according to the spatial classification of clusters.
FIG. 20A shows the LUT indicating the operation of the base detection combination module of FIG. 14 when (i) a particular base sequence is detected and (ii) a first detected base from a first base detector does not match a second detected base from a second base detector.
FIG. 20B shows a LUT indicating the operation of the base detection combination module of FIG. 14 when (i) a bubble is detected in a cluster and (ii) a first detected base from a first base detector does not match a second detected base from a second base detector.
FIG. 20C shows a LUT indicating the operation of the base detection combination module of FIG. 14 when (i) one or more out-of-focus images are detected from at least one cluster and (ii) a first detected base from a first base detector does not match a second detected base from a second base detector.
FIG. 20D shows a LUT indicating the operation of the base detection combination module of FIG. 14 when (i) the sensor data is from an edge cluster and (ii) the first detected base from the first base detector does not match the second detected base from the second base detector.
FIG. 21 shows a base detection system including a plurality of base detectors to predict base detection of an unknown analyte including a base sequence, wherein a neural network-based final base detection determination module determines a final base detection based on the output of one or more of the plurality of base detectors.
FIG. 22 is a block diagram of a base detection system according to one implementation.
Fig. 23 is a block diagram of a system controller that may be used in the system of fig. 22.
FIG. 24 is a simplified block diagram of a computer system that may be used to implement the disclosed techniques.
Detailed Description
As used herein, the term "polynucleotide" or "nucleic acid" refers to deoxyribonucleic acid (DNA), however, the skilled artisan will recognize that the systems and devices herein are also applicable to ribonucleic acid (RNA), where appropriate. It is understood that the term includes as equivalents analogs of DNA or RNA formed from nucleotide analogs. As used herein, the term also encompasses cDNA, i.e., complementary DNA or copy DNA produced from an RNA template, e.g., by the action of reverse transcriptase.
Single-stranded polynucleotide molecules sequenced by the systems and apparatus herein may originate in single-stranded form, such as DNA or RNA, or in double-stranded DNA (dsDNA) form (e.g., genomic DNA fragments, PCR and amplification products, etc.). Thus, a single stranded polynucleotide may be the sense or antisense strand of a double helix of the polynucleotide. Methods for preparing single stranded polynucleotide molecules suitable for use in the methods of the present disclosure using standard techniques are well known in the art. The precise sequence of the primary polynucleotide molecule is generally not critical to the present disclosure and may be known or unknown. Single-stranded polynucleotide molecules may represent genomic DNA molecules (e.g., human genomic DNA) that include introns and exon sequences (coding sequences), as well as non-coding regulatory sequences such as promoter and enhancer sequences.
In certain embodiments, the nucleic acid to be sequenced by using the present disclosure is immobilized on a substrate (e.g., a substrate within a flow cell or a substrate such as one or more beads on a flow cell, etc.). The term "immobilized" as used herein is intended to encompass direct or indirect, covalent or non-covalent bonding, unless otherwise indicated or clearly indicated by context. In certain embodiments, covalent attachment may be preferred, but it is generally all that is required that the molecule (e.g., nucleic acid) remain immobilized or attached to the carrier under conditions intended for use of the carrier (e.g., in applications requiring nucleic acid sequencing).
As used herein, the term "solid support" (or "substrate" in some uses) refers to any inert substrate or matrix to which nucleic acids can be attached, such as, for example, glass surfaces, plastic surfaces, latex, dextran, polystyrene surfaces, polypropylene surfaces, polyacrylamide gels, gold surfaces, and silicon wafers. In many embodiments, the solid support is a glass surface (e.g., a planar surface of a flow cell channel). In certain embodiments, the solid support may comprise an inert substrate or matrix that has been "functionalized" for example by applying a layer or coating of an intermediate material that contains reactive groups that allow covalent attachment to molecules such as polynucleotides. By way of non-limiting example, such carriers may include polyacrylamide hydrogels supported on an inert substrate such as glass. In such embodiments, the molecule (polynucleotide) may be directly covalently attached to an intermediate material (e.g., a hydrogel), but the intermediate material itself may be non-covalently attached to a substrate or matrix (e.g., a glass substrate). Covalent attachment to a solid support should accordingly be construed as encompassing this type of arrangement.
As noted above, the present disclosure includes novel systems and devices for sequencing nucleic acids. It will be apparent to those skilled in the art that, depending on the context, reference herein to a particular nucleic acid sequence also refers to a nucleic acid molecule comprising such a nucleic acid sequence. Sequencing of the target fragment means that a time-sequential reading of the bases is established. The bases that are read need not be contiguous, although this is preferred, nor is it necessary to sequence every base on the entire fragment during sequencing. Sequencing can be performed using any suitable sequencing technique in which nucleotides or oligonucleotides are added sequentially to the free 3' hydroxyl groups, resulting in synthesis of the polynucleotide strand in the 5' to 3' direction. The nature of the added nucleotide is preferably determined after each nucleotide addition. Sequencing techniques using sequencing by ligation (where not every consecutive base is sequenced) and techniques such as massively parallel feature sequencing (MPSS) where bases are removed from the strand on the surface rather than added to it are also suitable for use with the systems and devices of the present disclosure.
In certain embodiments, the disclosure discloses sequencing-by-synthesis (SBS). In SBS, four fluorescently labeled modified nucleotides are used to sequence dense clusters (potentially millions of clusters) of amplified DNA present on the surface of a substrate (e.g., a flow cell). Various additional aspects regarding SBS processes and methods that may be used with the systems and devices herein are disclosed, for example, in W004018497, W004018493, and U.S. patent No. 7,057,026 (nucleotides), W005024010 and W006120433 (polymerase), W005065814 (surface attachment technology), and WO 9844151, W006064199, and W007010251, the contents of each of which are incorporated herein by reference in their entirety.
In a particular use of the systems/devices herein, a flow cell containing a nucleic acid sample for sequencing is placed in a suitable flow cell holder. The sample used for sequencing may take the following form: single molecules, amplified single molecules in clusters, or beads comprising nucleic acid molecules. The nucleic acid is prepared such that it comprises an oligonucleotide primer adjacent to an unknown target sequence. To initiate the first SBS sequencing cycle, one or more differently labeled nucleotides and DNA polymerase or the like are flowed into/through the flow cell by a fluid flow subsystem (various embodiments of which are described herein). A single nucleotide may be added at a time or the nucleotides used in the sequencing process may be specifically designed to have reversible termination properties so that each cycle of the sequencing reaction occurs simultaneously in the presence of all four labeled nucleotides (A, C, T, G). In the case of four nucleotides mixed together, the polymerase is able to select the correct base to incorporate and each sequence is extended by a single base. In using such a method of the system, the natural competition between all four alternatives yields a higher accuracy than in the case where only one nucleotide is present in the reaction mixture (where most of the sequence is therefore not exposed to the correct nucleotide). The sequence (e.g., homopolymer) in which a particular base is repeated one by one is addressed like any other sequence and with high accuracy.
The fluid flow subsystem also flows the appropriate reagents to remove the blocked 3' end (if appropriate) and fluorophore from each incorporated base. The substrate may be exposed to a second round of four blocked nucleotides, or optionally to a second round with a different single nucleotide. Such a cycle is then repeated and the sequence of each cluster is read over multiple chemical cycles. The computer aspects of the present disclosure may optionally align sequences collected from each single molecule, cluster, or bead to determine the sequence of longer polymers, and the like. Alternatively, the image processing and alignment may be performed on separate computers.
The heating/cooling components of the system regulate the reaction conditions within the flow cell channels and reagent storage areas/containers (and optionally cameras, optics, and/or other components) while the fluid flow components allow the substrate surface to be exposed to the appropriate reagents for incorporation (e.g., the appropriate fluorescently labeled nucleotides to be incorporated) while the unincorporated reagents are rinsed away. An optional movable stage on which the flow cell is placed allows the flow cell to enter the correct orientation for laser (or other light) excitation of the substrate and optionally to move relative to the lens objective to allow different areas of the substrate to be read. In addition, other components of the system (e.g., camera, lens objective, heater/cooler, etc.) are also optionally movable/adjustable. During laser excitation, an image/position of the fluorescence emitted from the nucleic acid on the substrate is captured by the camera component, thereby recording the kind of first base of each individual molecule, cluster or bead in the computer component.
The embodiments described herein may be used in a variety of biological processes and systems or chemical processes and systems for academic or commercial analysis. More specifically, the embodiments described herein may be used in a variety of processes and systems where it is desirable to detect events, attributes, qualities, or features indicative of a desired reaction. For example, embodiments described herein include cartridges, biosensors and their components, as well as bioassay systems operating with cartridges and biosensors. In certain embodiments, the cartridge and biosensor comprise a flow cell and one or more sensors, pixels, photodetectors, or photodiodes coupled together in a substantially unitary structure.
The following detailed description of certain embodiments will be better understood when read in conjunction with the following drawings. To the extent that the figures illustrate diagrams of the functional blocks of various embodiments, the functional blocks are not necessarily indicative of the division between hardware circuitry. Thus, for example, one or more of the functional blocks (e.g., processors or memories) may be implemented in a single piece of hardware (e.g., a general purpose signal processor or random access memory, hard disk, or the like). Similarly, the program may be a stand alone program, may be incorporated as a subroutine into an operating system, may be a function in an installed software package, or the like. It should be understood that the various embodiments are not limited to the arrangements and instrumentality shown in the drawings.
As used herein, an element or step recited in the singular and proceeded with the word "a" or "an" should be understood as not excluding plural said elements or steps, unless such exclusion is explicitly stated. Furthermore, references to "one embodiment" are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features. Furthermore, unless explicitly stated to the contrary, embodiments "comprising" or "having" or "including" one or more elements having a particular property may include additional elements whether or not they have that property.
As used herein, a "desired reaction" includes a change in at least one of a chemical, electrical, physical, or optical property (or mass) of an analyte of interest. In certain embodiments, the desired response is a positive binding event (e.g., a fluorescently labeled biomolecule binds to an analyte of interest). More generally, the desired reaction may be a chemical transformation, a chemical change, or a chemical interaction. The desired response may also be a change in an electrical property. For example, the desired reaction may be a change in the concentration of ions in a solution. Exemplary reactions include, but are not limited to, chemical reactions such as reduction, oxidation, addition, elimination, rearrangement, esterification, amidation, etherification, cyclization, or substitution; binding interactions of the first chemical species with the second chemical species; a dissociation reaction in which two or more chemical substances are separated from each other; fluorescence; emitting light; bioluminescence; chemiluminescence; and biological reactions such as nucleic acid replication, nucleic acid amplification, nucleic acid hybridization, nucleic acid ligation, phosphorylation, enzyme catalysis, receptor binding or ligand binding. The desired reaction may also be the addition or elimination of protons, for example, detectable as a pH change of the surrounding solution or environment. An additional desired reaction may be to detect ion flow across a membrane (e.g., a natural or synthetic bilayer membrane), for example, when ions flow through the membrane, the current is interrupted and the interruption may be detected.
In certain embodiments, the desired reaction comprises binding a fluorescently labeled molecule to the analyte. The analyte may be an oligonucleotide and the fluorescently labeled molecule may be a nucleotide. When excitation light is directed to an oligonucleotide having a labeled nucleotide and the fluorophore emits a detectable fluorescent signal, the desired reaction can be detected. In alternative embodiments, the fluorescence detected is a result of chemiluminescence or bioluminescence. The desired reaction may also increase fluorescence (or Forster) Resonance Energy Transfer (FRET), decrease FRET by separating the donor and acceptor fluorophores, increase fluorescence by separating the quencher from the fluorophore, or decrease fluorescence by co-locating the quencher and fluorophore, for example.
As used herein, "reaction component" or "reactant" includes any substance that can be used to obtain the desired reaction. For example, reaction components include reagents, enzymes, samples, other biomolecules, and buffers. The reactive components may be generally delivered to and/or immobilized at reactive sites in the solution. The reaction component may interact directly or indirectly with another substance, such as an analyte of interest.
As used herein, the term "reaction site" is a localized region where a desired reaction can occur. The reaction sites may include a support surface of a substrate on which the substance may be immobilized. For example, the reaction site may comprise a substantially planar surface in a channel of a flow cell, the surface having a population of nucleic acids thereon. Typically, but not always, the nucleic acids in the population have identical sequences, such as cloned copies of single-stranded or double-stranded templates. However, in some embodiments, the reaction site may comprise only a single nucleic acid molecule, e.g., single-stranded or double-stranded. Furthermore, the plurality of reaction sites may be unevenly distributed along the support surface or arranged in a predetermined manner (e.g., arranged side-by-side in a matrix, such as in a microarray). The reaction sites may also include reaction chambers (or wells) that at least partially define a spatial region or volume configured to separate a desired reaction.
The terms "reaction chamber" and "well" are used interchangeably herein. As used herein, the term "reaction chamber" or "orifice" includes a region of space in fluid communication with a flow channel. The reaction chamber may be at least partially isolated from the surrounding environment or other spatial region. For example, a plurality of reaction chambers may be separated from one another by a common wall. As a more specific example, the reaction chamber may include a cavity defined by an inner surface of the bore, and may have an opening or aperture such that the cavity may be in fluid communication with the flow channel. Biosensors comprising such reaction chambers are described in more detail in international application number PCT/US2011/057111 filed on 10 months 20 days 2011, which is incorporated herein by reference in its entirety.
In some embodiments, the reaction chamber is sized and shaped relative to a solid (including semi-solid) such that the solid may be fully or partially inserted therein. For example, the reaction chamber may be sized and shaped to accommodate only one capture bead. The capture beads may have clonally amplified DNA or other material thereon. Alternatively, the reaction chamber may be sized and shaped to receive an approximate number of beads or solid substrates. As another example, the reaction chamber may also be filled with a porous gel or substance configured to control diffusion or filter fluid that may flow into the reaction chamber.
In some embodiments, the sensor (e.g., photodetector, photodiode) is associated with a corresponding pixel region of the sample surface of the biosensor. Thus, a pixel area is a geometric configuration that represents the area of one sensor (or pixel) on the surface of a biosensor sample. The sensor associated with the pixel region detects the light emission collected from the associated pixel region when a desired reaction occurs at a reaction site or reaction chamber covering the associated pixel region. In a planar surface implementation, the pixel regions may overlap. In some cases, multiple sensors may be associated with a single reaction site or a single reaction chamber. In other cases, a single sensor may be associated with a set of reaction sites or a set of reaction chambers.
As used herein, a "biosensor" includes a structure having multiple reaction sites and/or reaction chambers (or wells). The biosensor may comprise a solid state imaging device (e.g., a CCD or CMOS imaging device) and optionally a flow cell mounted thereto. The flow cell may comprise at least one flow channel in fluid communication with the reaction sites and/or the reaction chamber. As a specific example, the biosensor is configured to be fluidly and electrically coupled to a biometric system. The bioassay system may deliver reactants to the reaction sites and/or reaction chambers according to a predetermined protocol (e.g., sequencing-by-synthesis) and perform a plurality of imaging events. For example, the bioassay system may direct the flow of solution along the reaction sites and/or the reaction chambers. At least one of the solutions may include four types of nucleotides having the same or different fluorescent labels. The nucleotides may bind to corresponding oligonucleotides located at the reaction site and/or the reaction chamber. The bioassay system may then illuminate the reaction sites and/or reaction chambers using an excitation light source (e.g., a solid state light source such as a Light Emitting Diode (LED)). The excitation light may have a predetermined wavelength or wavelengths, including a range of wavelengths. The excited fluorescent tag provides an emission signal that can be captured by the sensor.
In alternative embodiments, the biosensor may include electrodes or other types of sensors configured to detect other identifiable properties. For example, the sensor may be configured to detect a change in ion concentration. In another example, the sensor may be configured to detect ion current across the membrane.
As used herein, a "cluster" is a population of similar or identical molecules or nucleotide sequences or DNA strands. For example, a cluster may be an amplified oligonucleotide or any other set of polynucleotides or polypeptides having the same or similar sequence. In other embodiments, a cluster may be any element or group of elements that occupy a physical area on the sample surface. In embodiments, clusters are immobilized to reaction sites and/or reaction chambers during the base detection cycle.
As used herein, the term "immobilized" when used in reference to a biomolecule or biological or chemical substance includes substantially attaching the biomolecule or biological or chemical substance to a surface at the molecular level. For example, biomolecules or biological or chemical substances may be immobilized to the surface of a substrate material using adsorption techniques, including non-covalent interactions (e.g., electrostatic forces, van der Waals forces, and dehydration of hydrophobic interfaces) and covalent bonding techniques, wherein functional groups or linkers facilitate the attachment of the biomolecules to the surface. The immobilization of biomolecules or biological or chemical substances to the surface of a substrate material may be based on properties of the substrate surface, the liquid medium carrying the biomolecules or biological or chemical substances, and properties of the biomolecules or biological or chemical substances themselves. In some cases, the substrate surface may be functionalized (e.g., chemically or physically modified) to facilitate immobilization of biomolecules (or biological or chemical species) to the substrate surface. The substrate surface may first be modified to allow functional groups to bind to the surface. The functional group may then be bound to a biomolecule or biological or chemical substance to immobilize it thereon. The substance may be immobilized on the surface via a gel, for example, as in U.S. patent publication No. US2011/0059865 A1, which is incorporated herein by reference.
In some embodiments, the nucleic acid may be attached to a surface and amplified using bridge amplification. Useful bridge amplification methods are described, for example, in U.S. Pat. nos. 5,641,658; WO 2007/010251; U.S. patent No. 6,090,592; U.S. patent publication No. 2002/0055100A1; U.S. patent No. 7,115,400; U.S. patent publication No. 2004/0096853A1; U.S. patent publication No. 2004/0002090A1; U.S. patent publication No. 2007/01288624A 1; and U.S. patent publication No. 2008/0009420A1, each of which is incorporated herein in its entirety. Another useful method for amplifying nucleic acids on a surface is Rolling Circle Amplification (RCA), for example, using the method set forth in further detail below. In some embodiments, the nucleic acid may be attached to a surface and amplified using one or more primer pairs. For example, one of the primers may be in solution and the other primer may be immobilized on a surface (e.g., 5' -attachment). By way of example, a nucleic acid molecule may hybridize to one of the primers on the surface, after which the immobilized primer is extended to produce a first copy of the nucleic acid. The primer in solution then hybridizes to a first copy of the nucleic acid, which can be extended using the first copy of the nucleic acid as a template. Optionally, after the first copy of the nucleic acid is produced, the original nucleic acid molecule may be hybridized to a second immobilized primer on the surface, and may be extended simultaneously with or after primer extension in solution. In any embodiment, repeated rounds of extension (e.g., amplification) using immobilized primers and primers in solution provides multiple copies of the nucleic acid.
In certain embodiments, the assay protocols performed by the systems and methods described herein include the use of natural nucleotides and enzymes configured to interact with the natural nucleotides. Natural nucleotides include, for example, ribonucleotides (RNA) or Deoxyribonucleotides (DNA). The natural nucleotide may be in the form of a monophosphate, a diphosphate or a triphosphate, and may have a base selected from adenine (a), thymine (T), uracil (U), guanine (G) or cytosine (C). However, it is understood that non-natural nucleotides, modified nucleotides or analogs of the foregoing may be used. With respect to reversible terminator-based sequencing by synthetic methods, some examples of useful non-natural nucleotides are listed below.
In embodiments that include a reaction chamber, an article or solid substance (including a semi-solid substance) may be disposed within the reaction chamber. When disposed, the article or solid may be physically held or secured within the reaction chamber by an interference fit, adhesion, or entrapment. Exemplary articles or solids that may be disposed within the reaction chamber include polymer beads, pellets, agarose gels, powders, quantum dots, or other solids that may be compressed and/or held within the reaction chamber. In particular embodiments, the nucleic acid superstructures (such as DNA spheres) may be disposed in or at the reaction chamber, for example, by being attached to an inner surface of the reaction chamber or by resting in a liquid within the reaction chamber. DNA spheres or other nucleic acid superstructures can be performed and then placed in or at the reaction chamber. Alternatively, the DNA spheres may be synthesized in situ at the reaction chamber. DNA spheres can be synthesized by rolling circle amplification to produce concatemers of specific nucleic acid sequences, and the concatemers can be treated with conditions that form relatively compact spheres. DNA spheres and methods for their synthesis are described, for example, in U.S. patent publication nos. 2008/0243360 A1 or 2008/0234136 A1, each of which is incorporated herein in its entirety. The substance held or disposed in the reaction chamber may be solid, liquid or gaseous.
As used herein, "base detection" identifies nucleotide bases in a nucleic acid sequence. Base detection refers to the process of determining base detection (A, C, G, T) for each cluster in a specific cycle. As an example, base detection may be performed using the four channel method and system, the two channel method and system, or the one channel method and system described in the incorporated material of U.S. patent application publication No. 2013/007932. In certain embodiments, the base detection cycle is referred to as a "sampling event". In one dye and dual channel sequencing protocol, the sampling event includes two illumination phases in a time series such that a pixel signal is generated at each phase. The first illumination phase induces illumination from a given cluster indicative of nucleotide bases a and T in the AT pixel signal, and the second illumination phase induces illumination from a given cluster indicative of nucleotide bases C and T in the CT pixel signal.
The disclosed techniques (e.g., the disclosed base detector) may be implemented on a processor such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), a Coarse Grain Reconfigurable Architecture (CGRA), an Application Specific Integrated Circuit (ASIC), a special instruction set processor (ASIP), and a Digital Signal Processor (DSP).
Biosensor
Fig. 1 shows a cross section of a biosensor 100 that may be used in various embodiments. The biosensor 100 has pixel regions 106', 108', 110', 112', and 114', which may each hold more than one cluster (e.g., 2 clusters per pixel region) during a base detection cycle. As shown, the biosensor 100 may include a flow cell 102 mounted to a sampling device 104. In the illustrated embodiment, the flow cell 102 is directly attached to the sampling device 104. However, in alternative embodiments, the flow cell 102 may be removably coupled to the sampling device 104. The sampling device 104 has a sample surface 134 that can be functionalized (e.g., chemically or physically modified in a manner suitable for performing the desired reaction). For example, the sample surface 134 can be functionalized and can include a plurality of pixel regions 106', 108', 110', 112', and 114', which can each hold more than one cluster during a base detection cycle (e.g., each pixel region has a corresponding cluster pair 106A, 106B;108A, 108B;110A, 110B;112A, 112B; and 114A, 114B immobilized thereon). Each pixel region is associated with a corresponding sensor (or pixel or photodiode) 106, 108, 110, 112, and 114 such that light received by the pixel region is captured by the corresponding sensor. The pixel regions 106 'may also be associated with corresponding reaction sites 106 "that hold cluster pairs on the sample surface 134 such that light emitted from the reaction sites 106" is received by the pixel regions 106' and captured by the corresponding sensors 106. Due to this sensing structure, the pixel signals in the base detection cycle carry information based on all of the two or more clusters in the following case: wherein during a base detection cycle, there are two or more clusters in a pixel region of a particular sensor (e.g., each pixel region has a corresponding cluster pair). Thus, signal processing as described herein is used to distinguish each cluster, where there are more clusters than pixel signals in a given sampling event for a particular base detection cycle.
In the illustrated embodiment, the flow cell 102 includes sidewalls 138, 125 and a flow cap 136 supported by the sidewalls 138, 125. Sidewalls 138, 125 are coupled to sample surface 134 and extend between flow hood 136 and sidewalls 138, 125. In some embodiments, the sidewalls 138, 125 are formed from a curable adhesive layer that bonds the flow cap 136 to the sampling device 104.
The side walls 138, 125 are sized and shaped such that a flow channel 144 exists between the flow housing 136 and the sampling device 104. The flow cap 136 may comprise a material that is transparent to the excitation light 101 propagating into the flow channel 144 from outside the biosensor 100. In one example, the excitation light 101 approaches the flow mask 136 at a non-orthogonal angle.
Additionally as shown, the flow cap 136 may include inlet and outlet ports 142, 146 configured to fluidly engage other ports (not shown). For example, the other ports may come from a cartridge or workstation. The flow channel 144 is sized and shaped to direct fluid along the sample surface 134. Height H of flow channel 144 1 And other dimensions may be configured to maintain a substantially uniform flow of fluid along the sample surface 134. The size of the flow channel 144 may also be configured to control bubble formation.
By way of example, the flow cap 136 (or flow cell 102) may comprise a transparent material, such as glass or plastic. The flow cap 136 may be formed as a substantially rectangular block having a planar outer surface and a planar inner surface defining the flow channel 144. The block may be mounted to the side walls 138, 125. Alternatively, the flow cell 102 may be etched to define the flow cap 136 and the sidewalls 138, 125. For example, grooves may be etched into the transparent material. The grooves may become flow channels 144 when the etching material is mounted to the sampling device 104.
The sampling device 104 may be similar to, for example, an integrated circuit including a plurality of stacked substrate layers 120-126. The substrate layers 120 through 126 may include a base substrate 120, a solid-state imaging device 122 (e.g., a CMOS image sensor), a filter or light control layer 124, and a passivation layer 126. It should be noted that the above is merely illustrative, and that other embodiments may include fewer or additional layers. Further, each of the substrate layers 120 to 126 may include a plurality of sub-layers. The sampling device 104 may be fabricated using processes similar to those used in fabricating integrated circuits such as CMOS image sensors and CCDs. For example, the substrate layers 120-126, or portions thereof, may be grown, deposited, etched, etc. to form the sampling device 104.
Passivation layer 126 is configured to shield filter layer 124 from the fluid environment of flow channel 144. In some cases, the passivation layer 126 is also configured to provide a solid surface (i.e., sample surface 134) that allows biomolecules or other analytes of interest to be immobilized thereon. For example, each of the reaction sites may comprise a cluster of biomolecules immobilized to the sample surface 134. Thus, the passivation layer 126 may be formed of a material that allows reactive sites to be immobilized thereon. The passivation layer 126 may also include a material that is at least transparent to the desired fluorescence. By way of example, the passivation layer 126 may include silicon nitride (Si 2 N 4 ) And/or silicon dioxide (SiO) 2 ). However, other suitable materials may be used. In the illustrated embodiment, the passivation layer 126 may be substantially planar. However, in alternative embodiments, the passivation layer 126 may include grooves, such as pits, holes, grooves, and the like. In the illustrated embodiment, the passivation layer 126 has a thickness of about 150nm to 200nm, and more particularly about 170 nm.
The filter layer 124 may include various features that affect the transmission of light. In some implementations, the filter layer 124 may perform a number of functions. For example, the filter layer 124 may be configured to (a) filter unwanted optical signals, such as optical signals from an excitation light source; (b) Directing the emitted signals from the reaction sites to corresponding sensors 106, 108, 110, 112 and 114, which are configured to detect the emitted signals from the reaction sites; or (c) prevent or inhibit detection of unwanted emission signals from adjacent reaction sites. Thus, the filter layer 124 may also be referred to as a light management layer. In the illustrated embodiment, the filter layer 124 has a thickness of about 1 μm to 5 μm and more specifically about 2 μm to 4 μm. In alternative embodiments, the filter layer 124 may include an array of microlenses or other optical elements. Each of the microlenses may be configured to direct an emission signal from an associated reaction site to the sensor.
In some embodiments, the solid-state imaging device 122 and the base substrate 120 may be provided together as a previously configured solid-state imaging apparatus (e.g., CMOS chip). For example, the base substrate 120 may be a silicon wafer, and the solid-state imaging device 122 may be mounted thereon. The solid-state imaging device 122 includes a layer of semiconductor material (e.g., silicon) and sensors 106, 108, 110, 112, and 114. In the illustrated embodiment, the sensor is a photodiode configured to detect light. In other embodiments, the sensor comprises a photodetector. The solid-state imaging device 122 may be manufactured as a single chip through a CMOS-based manufacturing process.
The solid-state imaging device 122 may include a dense array of sensors 106, 108, 110, 112, and 114 configured to detect activity indicative of a desired reaction within or along the flow channel 144. In some embodiments, each sensor has a thickness of about 1 square micron to 2 square microns (μm) 2 ) Is a pixel region (or detection region). The array may include one hundred million sensors, five million sensors, one million sensors, or even two hundred million sensors. The sensors 106, 108, 110, 112, and 114 may be configured to detect light of a predetermined wavelength indicative of a desired reaction.
In some embodiments, sampling device 104 includes a microcircuit arrangement, such as that described in U.S. patent No. 7,595,882, which is incorporated herein by reference in its entirety. More specifically, sampling device 104 may include an integrated circuit having a planar array of sensors 106, 108, 110, 112, and 114. The circuitry formed within sampling device 104 may be configured for at least one of signal amplification, digitizing, storage, and processing. The circuit may collect and analyze the detected fluorescence and generate a pixel signal (or detection signal) for transmitting the detection data to the signal processor. The circuitry may also perform additional analog and/or digital signal processing in the sampling device 104. The sampling device 104 may include conductive vias 130 that perform signal routing (e.g., transmitting pixel signals to a signal processor). Pixel signals may also be transmitted through electrical contacts 132 of sampling device 104.
Sampling device 104 is discussed in further detail with respect to U.S. non-provisional patent application Ser. No. 16/874,599 (attorney docket No. ILLM 1011-4/IP-1750-US), entitled "Systems and Devices for Characterization and Performance Analysis of Pixel-Based Sequencing," filed on even 14, 5 months 2020, which is incorporated herein by reference as if fully set forth herein. The sampling device 104 is not limited to the above-described configuration or use as described above. In alternative embodiments, sampling device 104 may take other forms. For example, the sampling device 104 may include a CCD device (such as a CCD camera) coupled to the flow cell or moving to interact with the flow cell having reaction sites therein.
Fig. 2 shows one implementation of a flow cell 200 that includes clusters in its blocks. The flow cell 200 corresponds to the flow cell 102 of fig. 1, e.g., without the flow cap 136. Furthermore, the depiction of the flow cell 200 is symbolized in nature, and the flow cell 200 symbolically depicts the various channels and blocks therein, without showing the various other components therein. Fig. 2 shows a top view of a flow cell 200.
In one embodiment, the flow cell 200 is divided or partitioned into a plurality of channels, such as channels 202a, 202b, …, 202P, i.e., P channels. In the example of fig. 2, the flow cell 200 is shown to include 8 channels, i.e., in this example, p=8, but the number of channels within the flow cell is implementation specific.
In one implementation, each channel 202 is further partitioned into non-overlapping regions referred to as "tiles" 212. For example, fig. 2 shows an enlarged view of a section 208 of an exemplary channel. The section 208 is shown to include a plurality of blocks 212.
In one example, each channel 202 includes one or more columns of tiles. For example, in fig. 2, each channel 202 includes two corresponding columns of tiles 212, as shown within the enlarged section 208. The number of blocks in each column of blocks within each channel is implementation specific and in one example there may be 50 blocks, 60 blocks, 100 blocks, or another suitable number of blocks in each column of blocks within each channel.
Each block includes a corresponding plurality of clusters. During sequencing, clusters on the block and their surrounding background are imaged. For example, fig. 2 shows an exemplary cluster 216 within an exemplary tile.
FIG. 3 shows an exemplary Illumina GA-IIx with eight channels TM A flow cell and also shows an enlarged view of one block and its clusters and their surrounding background. For example, there are one hundred blocks per channel in Illumina genome analyzer II and sixty-eight blocks per channel in Illumina HiSeq 2000. The block 212 accommodates hundreds of thousands to millions of clusters. In fig. 3, an image generated from a tile having clusters shown as bright spots (e.g., 308 is an enlarged image view of the tile) is shown at 308, with exemplary clusters 304 marked. Clusters 304 comprise about one thousand identical copies of the template molecule, but the size and shape of the clusters are different. Clusters are generated from template molecules by bridge amplification of the input library prior to sequencing runs. The purpose of amplification and cluster growth is to increase the intensity of the emitted signal, as the imaging device is not able to reliably sense a single fluorophore. However, the physical distance of the DNA fragments within the cluster 304 is small, so the imaging device perceives the cluster of fragments as a single point 304.
Clusters and tiles are discussed in further detail with respect to U.S. non-provisional patent application No. 16/825,987 (attorney docket No. ILLM 1008-16/IP-1693-US) entitled "Training Data Generation For Artificial Intelligence-Based Sequencing," filed 3/20, 2020.
Fig. 4 is a simplified block diagram of a system for analyzing sensor data, such as base detection sensor output (e.g., see fig. 1), from a sequencing system. In the example of fig. 4, the system includes a sequencing machine 400 and a configurable processor 450. The configurable processor 450 may execute a neural network-based base detector and/or a non-neural network-based base detector (which will be discussed in further detail herein) in coordination with runtime programs executed by a host processor, such as a Central Processing Unit (CPU) 402. The sequencing machine 400 includes a base detection sensor and a flow cell 401 (e.g., as discussed with respect to fig. 1-3). The flow-through cell may comprise one or more blocks in which clusters of genetic material are exposed to a sequence of analyte streams for causing reactions in the clusters to identify bases in the genetic material, as discussed with respect to fig. 1-3. The sensor senses the reaction of each cycle of the sequence in each block of the flow cell to provide block data. Examples of this technique are described in more detail below. Genetic sequencing is a data-intensive operation that converts base-detection sensor data into base-detection sequences for each cluster of genetic material sensed during the base-detection operation.
The system in this example includes a CPU 402 that executes a runtime program to coordinate base detection operations, a memory 403 for storing sequences of block data arrays, base detection reads generated by base detection operations, and other information used in base detection operations. Additionally, in this illustration, the system includes a memory 404 for storing configuration file(s), such as FPGA bit files, and model parameters of the neural network for configuring and reconfiguring the configurable processor 450 and executing the neural network. The sequencing machine 400 may include a program for configuring a configurable processor and, in some embodiments, a reconfigurable processor to execute a neural network.
The sequencing machine 400 is coupled to a configurable processor 450 via a bus 405. Bus 405 may be implemented using high-throughput technology, such as in one example, bus technology compatible with the PCIe standard (peripheral component interconnect express) currently maintained and developed by the PCI-SIG (PCI special interest group). In addition, in this example, memory 460 is coupled to configurable processor 450 by bus 461. Memory 460 may be an on-board memory disposed on a circuit board having a configurable processor 450. Memory 460 is used to high-speed access by configurable processor 450 to working data used in base detection operations. Bus 461 may also be implemented using high-throughput technologies such as bus technologies compatible with PCIe standards.
Configurable processors, including Field Programmable Gate Arrays (FPGAs), coarse-grained reconfigurable arrays (CGRA), and other configurable and reconfigurable devices, may be configured to perform various functions more efficiently or faster than possible using general purpose processors executing computer programs. The configuration of a configurable processor involves compiling a functional description to produce a configuration file, sometimes referred to as a bit stream or bit file, and distributing the configuration file to the configurable elements on the processor.
The configuration file defines the logic functions to be performed by the configurable processor by configuring the circuit to set up data flow patterns, use of distributed memory and other on-chip memory resources, look-up table content, operation of the configurable logic blocks and the configurable execution units (e.g., multiply-accumulate units, configurable interconnects, and other elements of the configurable array). The configurable processor is reconfigurable if the configuration file can be changed in the field by changing the loaded configuration file. For example, the configuration files may be stored in volatile SRAM elements, in nonvolatile read-write memory elements, and combinations thereof, distributed among an array of configurable elements on a configurable or reconfigurable processor. A variety of commercially available configurable processors are suitable for use in the base detection operations as described herein. Examples include commercially available products such as Xilinx Alveo TM U200、Xilinx Alveo TM U250、Xilinx Alveo TM U280、Intel/Altera Stratix TM GX2800、Intel/Altera Stratix TM GX2800 and Intel Stratix TM GX10M. In some examples, the host CPU may be implemented on the same integrated circuit as the configurable processor.
The embodiments described herein implement a multi-cycle neural network using a configurable processor 450. The configuration files of the configurable processor may be implemented using a high level description language (HDL) or Register Transfer Level (RTL) language specification to specify the logic functions to be performed. The specification may be compiled using resources designed for the selected configurable processor to generate a configuration file. The same or similar specifications may be compiled in order to generate designs for application specific integrated circuits that may not be configurable processors.
Thus, in all embodiments described herein, an alternative to a configurable processor comprises a configured processor comprising an application specific ASIC or application specific integrated circuit or group of integrated circuits, or a system on a chip (SOC) device, configured to perform neural network based base detection operations as described herein.
Generally, a configurable processor and a configured processor as described herein configured to perform the operation of a neural network are referred to herein as a neural network processor. In another example, a configurable processor and a configured processor as described herein configured to perform the operation of a non-neural network based base finder are referred to herein as a non-neural network processor. In general, the configurable processor and the configured processor can be used to implement one or both of a neural network-based base detector and a non-neural network-based base detector, as will be discussed later herein.
In this example, the configurable processor 450 is configured by using a configuration file loaded by a program executed by the CPU 402 or other source that configures an array of configurable elements on the configurable processor 454 to perform a base detection function. In this example, the configuration includes data flow logic 451 that is coupled to bus 405 and bus 461 and performs functions for distributing data and control parameters among elements used in a base detection operation.
In addition, the configurable processor 450 is configured with base detection execution logic 452 to execute a multi-loop neural network. Logic 452 includes a plurality of multi-loop execution clusters (e.g., 453) including, in this example, multi-loop cluster 1 to multi-loop cluster X. The number of multi-cycle clusters may be selected based on a tradeoff involving the required throughput of operations and the resources available on the configurable processor.
The multi-loop cluster is coupled to the data flow logic 451 through a data flow path 454 implemented using configurable interconnect and memory resources on the configurable processor. In addition, the multi-cycle cluster is coupled to the data flow logic 451 through a control path 455 implemented using, for example, configurable interconnect and memory resources on a configurable processor, which provides control signals indicating available clusters, input units ready to provide available clusters for performing the operation of the neural network, output patches ready to provide trained parameters for the neural network, ready to provide base detection classification data, and other control data for performing the neural network.
The configurable processor is configured to perform an operation of the multi-cycle neural network using the trained parameters to generate classification data for a sensing cycle of the base stream operation. The operation of the neural network is performed to generate classification data for the subject sensing cycle of the base detection operation. The operation of the neural network operates on a sequence (including a digital N arrays of block data from respective ones of N sense cycles, where N sense cycles provide sensor data for different base detection operations for one base position of each operation in the time sequence in the examples described herein. Optionally, some of the N sensing cycles may be out of sequence, if desired, depending on the particular neural network model being performed. The number N may be any number greater than one. In some examples described herein, a sensing cycle of the N sensing cycles represents a set of sensing cycles of at least one sensing cycle preceding a subject sensing cycle and at least one sensing cycle following a subject cycle (subject cycle) in a time series. Examples are described herein wherein the number N is an integer equal to or greater than five.
The data flow logic 451 is configured to move at least some trained parameters of the block data and model from the memory 460 to a configurable processor for operation of the neural network using an input unit for a given operation (block data of a spatially aligned patch comprising N arrays). The input unit may be moved by a direct memory access operation in one DMA operation or in a smaller unit that moves in coordination with the execution of the deployed neural network during the available time slots.
The block data for a sensing cycle as described herein may include an array of sensor data having one or more features. For example, the sensor data may include two images that are analyzed to identify one of four bases at a base position in a genetic sequence of DNA, RNA, or other genetic material. The tile data may also include metadata about the image and the sensor. For example, in an embodiment of a base detection operation, the tile data may include information about the alignment of the image with the cluster, such as information about the distance from the center, which indicates the distance of each pixel in the sensor data array from the center of the cluster of genetic material on the tile.
During execution of the multi-cycle neural network as described below, the block data may also include data generated during execution of the multi-cycle neural network, referred to as intermediate data, which may be reused during operation of the multi-cycle neural network rather than recalculated. For example, during execution of the multi-cycle neural network, the data flow logic may write intermediate data to memory 460 in place of sensor data for a given patch of the block data array. Embodiments similar thereto are described in more detail below.
As shown, a system for analyzing base detection sensor output is described that includes a memory (e.g., 460) accessible by a runtime program that stores block data that includes sensor data from blocks of a sense cycle of a base detection operation. In addition, the system includes a neural network processor, such as a configurable processor 450 that has access to memory. The neural network processor is configured to perform operation of the neural network using the trained parameters to generate classification data for the sensing cycle. As described herein, the operation of the neural network operates on a sequence of N arrays of block data from respective sense cycles of the N sense cycles (including the subject cycle) to generate classification data for the subject cycle. The data flow logic 451 is provided to move the tile data and trained parameters from memory to the neural network processor for operation of the neural network using an input unit (data comprising spatially aligned patches from N arrays of respective sense cycles of the N sense cycles).
In addition, a system is described in which a neural network processor has access to a memory and includes a plurality of execution clusters, an execution logic cluster of the plurality of execution clusters configured to execute the neural network. The data flow logic is capable of accessing the memory and an execution cluster of the plurality of execution clusters to provide input units of block data to available ones of the plurality of execution clusters, the input units including a number N of spatially aligned patches from an array of block data of a respective sensing cycle (including a subject sensing cycle), and causing the execution cluster to apply the N spatially aligned patches to the neural network to produce output patches of classification data of spatially aligned patches of the subject sensing cycle, wherein N is greater than 1.
FIG. 5 is a simplified diagram illustrating aspects of a base detection operation that includes the functionality of a runtime program executed by a host processor. In this figure, the output from the image sensor of a flow cell (such as the flow cell shown in fig. 1-2) is provided on line 500 to an image processing thread 501, which may perform processing on the image, such as resampling, alignment, and placement in the sensor data array of individual tiles, and may be used by a process that calculates a tile cluster mask for each tile in the flow cell that identifies pixels in the sensor data array that correspond to clusters of genetic material on the corresponding tile of the flow cell. To calculate the cluster mask, one exemplary algorithm is based on a process for detecting clusters that are unreliable in early sequencing cycles using metrics derived from the softmax output, then discarding data from those wells/clusters, and not generating output data for those clusters. For example, the process may identify clusters with high reliability during the first N1 (e.g., 25) base detections, and reject other clusters. The rejected clusters may be polyclonal or very weak or blurred by fiducial points. The program may be executed on a host CPU. In alternative embodiments, this information would potentially be used to identify the necessary clusters of interest to be returned to the CPU, limiting the storage required for intermediate data.
Depending on the state of the base detect operation, the output of the image processing thread 501 is provided on line 502 to scheduling logic 510 in the CPU that routes the block data array to the data cache 504 on the high speed bus 503 or to hardware 520, such as the configurable processor of FIG. 4, on the high speed bus 505. The hardware 520 may be a multi-cluster neural network processor that executes a neural network-based base detector, or may be hardware that executes a non-neural based base detector, as will be discussed later herein.
Hardware 520 returns classification data (e.g., output by a neural network-based base detector and/or a non-neural network-based base detector) to scheduling logic 510, which passes the information to data cache 504, or on line 511 to thread 502, which performs base detection and quality score computation using the classification data, and may arrange the data for base detection reads in a standard format. The output of thread 502, which performs base detection and quality score computation, is provided on line 512 to thread 503, which aggregates base detection reads, performs other operations such as data compression, and writes the resulting base detection output to a designated destination for customer use.
In some embodiments, the host may include threads (not shown) that perform final processing of the output of hardware 520 to support the neural network. For example, hardware 520 may provide an output of classification data from a final layer of the multi-cluster neural network. The host processor may perform output activation functions, such as softmax functions, on the classified data to configure data for use by the base detection and quality scoring threads 502. In addition, the host processor may perform input operations (not shown), such as resampling, batch normalization, or other adjustment of the tile data prior to input to hardware 520.
Fig. 6 is a simplified diagram of a configuration of a configurable processor, such as the configurable processor of fig. 4. In fig. 6, the configurable processor includes an FPGA with multiple PCIe-fast interfaces. The FPGA is configured with a wrapper 600 that includes the data flow logic described with reference to figure 1. The encapsulator 600 manages the interface and coordination with the runtime programs in the CPU through a CPU communication link 609, and the communication with the on-board DRAM 602 (e.g., memory 460) via a DRAM communication link 610. The data flow logic in the wrapper 600 provides patch data to the cluster 601 retrieved by traversing the digital N-cycle array of banked data on the on-board DRAM 602 and retrieves the process data 615 from the cluster 601 for delivery back to the on-board DRAM 602. The encapsulator 600 also manages data transfer between the on-board DRAM 602 and host memory for both input arrays of chunk data and output blocks of classification data. The encapsulator transmits block data on line 613 to the assigned cluster 601. The encapsulator provides trained parameters such as weights and biases on line 612 to clusters 601 retrieved from on-board DRAM 602. The encapsulator provides configuration and control data on line 611 to cluster 601, which is provided from or generated in response to a runtime program on the host via CPU communication link 609. The clusters may also provide status signals on line 616 to the encapsulator 600 that are used in cooperation with control signals from the host to manage traversal of the block data array to provide spatially aligned patch data, and perform multi-cycle neural network for base detection and/or operations for non-neural network based base detection on the patch data using the resources of the clusters 601.
As described above, there may be multiple clusters on a single configurable processor managed by encapsulator 600 that are configured for execution on corresponding ones of multiple patches of block data. Each cluster may be configured to provide classification data for base detection in a subject sensing cycle using the tile data for multiple sensing cycles as described herein.
In an example of a system, model data (including kernel data, such as filter weights and offsets) may be sent from the host CPU to the configurable processor so that the model may be updated according to the number of cycles. As one representative example, a base detection operation may include on the order of hundreds of sensing cycles. In some embodiments, the base detection operation may comprise a double-ended read. For example, model training parameters may be updated once every 20 cycles (or other number of cycles) or according to an update pattern implemented for a particular system. In some implementations including double-ended reads, where the sequence of a given string in a genetic cluster on a block includes a first portion extending downward (or upward) along the string from a first end and a second portion extending upward (or downward) along the string from a second end, the trained parameters can be updated in the transition from the first portion to the second portion.
In some examples, image data for multiple cycles of the sense data for a tile may be sent from the CPU to the encapsulator 600. The encapsulator 600 can optionally perform some preprocessing and transformation on the sensed data, and write information to the on-board DRAM 602. The input tile data for each sensing cycle may comprise an array of sensor data comprising about 4000 x 3000 pixels or more per tile for each sensing cycle, wherein two features represent the colors of two images of a tile and each feature is one or two bytes per pixel. For implementations in which the number N is three sense loops to be used in each run of the multi-loop neural network, the block data array for each run of the multi-loop neural network may consume approximately hundreds of megabytes per block. In some embodiments of the system, the tile data further includes an array of DFC data stored once per tile, or other types of metadata about the sensor data and the tile.
In operation, the encapsulator assigns patches to clusters when multi-cycle clusters are available. The encapsulator retrieves the next patch of block data in the traversal of the block and sends it to the assigned cluster along with appropriate control and configuration information. Clusters may be configured with sufficient memory on a configurable processor to hold data patches that include patches from multiple cycles in some systems and that are being processed in place, and data patches that are to be processed when processing of the current patch is completed using ping-pong buffering techniques or raster scanning techniques in various embodiments.
When the assigned cluster completes its operation on the current patch's neural network and generates an output patch, it will signal the encapsulator. The encapsulator will read output patches from the assigned clusters, or alternatively, the assigned clusters push data to the encapsulator. The encapsulator will then assemble output patches for the processed ranks in DRAM 602. When the processing of the entire chunk has been completed and the output patch of data has been transferred to the DRAM, the wrapper sends the processed output array of chunks back to the host/CPU in a specified format. In some embodiments, the on-board DRAM 602 is managed by memory management logic in the package 600. The runtime program may control the sequencing operations to complete analysis of all arrays of block data for all cycles in the run in a continuous stream, thereby providing real-time analysis.
Multiple base detector
Fig. 6A shows a system 600 that performs a base detection operation on raw images (i.e., sensor data) output by a biosensor using two or more base detectors. For example, system 600 includes a sequencing machine 1404, such as that discussed with respect to fig. 1 (and also discussed later herein with respect to fig. 14). Sequencing machine 1404 includes a flow cell 1405, such as the flow cell discussed with respect to fig. 1-3. The flow cell 1405 includes a plurality of tiles 1406, and each tile 1406 includes a plurality of clusters 1407 (an exemplary cluster of individual tiles is shown in fig. 6A), e.g., as discussed with respect to fig. 2 and 3. As discussed with respect to fig. 4-6, sensor data 1412 including the raw image from block 1406 is output by sequencing machine 1404.
In one embodiment, the system 600 includes two or more base detectors, such as a first base detector 1414 and a second base detector 1416. Although two base detectors are shown, in one example, more than two base detectors may be present in system 600.
Each base detector of FIG. 6A outputs corresponding base detection classification information. For example, the first base detector 1414 outputs first base detection classification information 1434, and the second base detector 1416 outputs second base detection classification information 1436. The base detection combination module 1428 generates a final base detection 1440 based on one or both of the first base detection classification information 1434 and/or the second base detection classification information 1436.
In one example, the first base detector 1414 is a neural network based base detector. For example, the first base detector 1414 is a nonlinear system employing one or more neural network models for base detection, as will be discussed later herein. The first base detector 1414 is also referred to herein as a deep rta (deep real time analysis) base detector or a deep neural network base detector.
In one example, the second base detector 1416 is a non-neural network based base detector. For example, the second base detector 1416 is at least partially a linear system for base detection. For example, the second base detector 1416 does not employ a neural network for base detection (or uses a smaller neural network model for base detection than the larger neural network model used by the first base detector 1414), as will be discussed later herein. The second base detector 1416 is also referred to herein as an RTA (real time analysis) base detector.
Examples of deep RTA (or deep neural network) base detectors and RTA base detectors are discussed in U.S. non-provisional patent application No. 16/826,126 (attorney docket No. ilm 1008-18/IP-1744-US), entitled "Artificial Intelligence-Based Base Calling," filed 3/20/2020, which is incorporated by reference for all purposes as if fully set forth herein.
Further details of the operation of the system 600 of FIG. 6A and further examples of the first base detector 1414 and the second base detector 1416 will be discussed in further detail later herein, e.g., with respect to FIG. 14.
Base detector that is not based on neural network and is at least partially linear (second base detector of FIGS. 6A and 14) 1416)
As discussed with respect to fig. 6A, the second base detector 1416 is a non-neural network based and at least partially linear base detector. That is, the second base detector 1416 does not employ a neural network for base detection (or uses a smaller neural network model for base detection than the larger neural network model used by the first base detector 1414). An example of a second base detector 1416 is an RTA base detector.
RTA is a base detector that uses a linear intensity extractor to extract features from sequencing images for base detection. The following discussion describes one implementation of intensity extraction and base detection by RTA. In this implementation, the RTA performs a template generation step to generate a template image that identifies the location of clusters on the tile using sequencing images from a number of initial sequencing cycles, called template cycles. The template image is used as a reference for the subsequent registration and intensity extraction steps. The template image is generated by detecting and merging bright spots in each sequencing image of the template loop, which in turn involves sharpening the sequencing image (e.g., using laplace convolution), determining an "on" threshold by the spatially isolated Otsu method, and subsequent five-pixel local maximum detection with sub-pixel position interpolation. In another example, fiducial markers are used to identify the location of clusters on a block. The solid support on which the biological specimen is imaged may include such fiducial markers in order to determine the orientation of the specimen or an image thereof relative to a probe attached to the solid support. Exemplary fiducials include, but are not limited to, beads (with or without fluorescent moieties or moieties such as nucleic acids to which labeled probes may bind), fluorescent molecules attached with known or determinable features, or structures that combine morphological shapes with fluorescent moieties. An exemplary benchmark is set forth in U.S. patent publication No. 2002/0150909, which is incorporated herein by reference.
The RTA then registers the current sequencing image against the template image. This is achieved by using image correlation to align the current sequencing image with the template image over the sub-region or by using a nonlinear transformation (e.g., a full six parameter linear affine transformation).
The RTA generates a color matrix to correct for crosstalk between color channels of the sequenced image. RTA implements empirical phasing correction to compensate for noise in the sequenced images caused by phase errors.
After applying different corrections to the sequencing image, the RTA extracts the signal intensity at each point location in the sequencing image. For example, for a given point location, the signal strength may be extracted by determining a weighted average of the intensities of the pixels in the point location. For example, a weighted average of the center pixel and the neighboring pixels may be performed using bilinear or bicubic interpolation. In some implementations, each point location in the image can include several pixels (e.g., 1 to 5 pixels).
The RTA then spatially normalizes the extracted signal intensities to account for illumination variations across the sampled image. For example, the intensity values may be normalized such that the 5 th percentile and the 95 th percentile have values of 0 and 1, respectively. The normalized signal intensity of the image (e.g., the intensity normalized for each channel) may be used to calculate the mean purity of the points in the image.
In some implementations, the RTA uses an equalizer to maximize the signal-to-noise ratio of the extracted signal strength. The equalizer may be trained (e.g., using least squares estimation, adaptive equalization algorithms) to maximize the signal-to-noise ratio of the cluster intensity data in the sequenced image. In some implementations, the equalizer is a set of look-up tables (LUTs) comprising a plurality of LUTs with sub-pixel resolution, also referred to as an "equalizer filter" or "convolution kernel. In one implementation, the number of LUTs in the equalizer depends on the number of sub-pixels into which the pixels of the sequenced image can be divided. For example, if a pixel can be divided into n×n subpixels (e.g., 5×5 subpixels), the equalizer generates n2 LUTs (e.g., 25 LUTs).
In one implementation of training the equalizer, data from the sequenced images is binned by hole sub-pixel locations. For example, for a 5 x 5LUT, 1/25 of the holes have a center in bin (1, 1) (e.g., upper left corner of the sensor pixel), 1/25 of the holes are in bin (1, 2), and so on. In one implementation, the equalizer coefficients for each bin are determined using a least squares estimate of a subset of data from the aperture corresponding to the respective bin. In this way, the resulting estimated equalizer coefficients are different for each bin.
Each LUT/equalizer filter/convolution kernel has a plurality of coefficients learned from training. In one implementation, the number of coefficients in the LUT corresponds to the number of pixels used for base detection of clusters. For example, if the local grid of pixels (image or pixel patch) for base detection of clusters is p×p in size (e.g., 9×9 pixel patch), each LUT has p2 coefficients (e.g., 81 coefficients).
In one implementation, training produces equalizer coefficients configured to mix/combine intensity values of pixels depicting intensity emissions from a target cluster being base detected and intensity emissions from one or more neighboring clusters in a manner that maximizes signal-to-noise ratio. The signal that is maximized in signal-to-noise ratio is the intensity emission from the target cluster, while the noise that is minimized in signal-to-noise ratio is the intensity emission from the neighboring cluster, i.e., spatial crosstalk, plus some random noise (e.g., to account for background intensity emission). The equalizer coefficients are used as weights and the mixing/combining includes performing an element-wise multiplication between the equalizer coefficients and the intensity values of the pixels to calculate a weighted sum of the intensity values of the pixels, i.e., a convolution operation.
RTA then performs base detection by fitting a mathematical model to the optimized intensity data. Suitable mathematical models that may be used include, for example, k-means clustering algorithms, k-means like clustering algorithms, expectation-maximization clustering algorithms, histogram-based methods, and the like. Four gaussian distributions may be fitted to the set of two-channel intensity data such that one distribution is applied for each of the four nucleotides represented in the data set. In one particular implementation, a expectation-maximization (EM) algorithm may be applied. As a result of the EM algorithm, for each X, Y value (each of the two channel intensities), a value may be generated that represents the likelihood that a certain X, Y intensity value belongs to one of the four gaussian distributions to which the data is fitted. In the case of four bases giving four separate distributions, each X, Y intensity value will also have four associated possible values, one for each of the four bases. The maximum of the four possible values indicates base detection. For example, if the cluster is "off" in both channels, the base is detected as G. If the cluster is "off" in one channel and "on" in the other channel, the base is detected as either C or T (depending on which channel is on), and if the cluster is "on" in both channels, the base is detected as A.
Additional details regarding RTA can be found in U.S. non-provisional patent application Ser. No. 15/909,437 entitled "Optical Distortion Correction For Imaged Samples," filed on 1/3/2018; U.S. non-provisional patent application No. 14/530,299 entitled "Image Analysis Useful for Patterned Objects" filed on 10/31/2014; U.S. non-provisional patent application No. 15/153,953 entitled "Methods and Systems for Analyzing Image Data" filed on 12/3/2014; U.S. non-provisional patent application No. 13/006,206, filed on 1 month 13 2011, entitled "Data Processing System and Methods"; and U.S. non-provisional patent application No. 17/308,035 (attorney docket No. ILLM 1032-2/IP-1991-US) entitled "organization-Based Image Processing and Spatial Crosstalk Attenuator," filed on 4 and 5 of 2021, all of which are incorporated herein by reference as if fully set forth herein.
Base detector based on neural network and being at least partially nonlinear (e.g., the first base detector of FIG. 6A) 1414)
Fig. 7-13B discuss various examples of the first base detector 1414 of fig. 6A. For example, fig. 7 is a diagram of a multi-cycle neural network model that may be performed using the systems described herein. The multicycle neural network model is an example of the first base detector 1414 of FIG. 6A, although another neural network-based model may be used for the first base detector 1414.
The example shown in fig. 7 may be referred to as a five-cycle input, one-cycle output neural network. However, it is noted that five-cycle input, one-cycle output neural networks are merely examples, and that neural networks may have a different number of inputs (such as six, seven, nine, or another suitable number). For example, fig. 10, discussed later herein, has a 9-cycle input. Referring again to fig. 7, the input to the multi-cycle neural network model includes five spatially aligned patches (e.g., 700) of the block data array from five sense cycles of a given block. Spatially aligned patches have the same alignment row and column dimensions (x, y) as the other patches in the set, such that the information relates to the same cluster of genetic material on a block in the sequence loop. In this example, the subject patch is a patch from a tile data array of cycle K. A set of five spatially aligned patches includes a patch from cycle K-2 two cycles before the subject patch, a patch from cycle K-1 one cycle before the subject patch, a patch from cycle k+1 one cycle after the patch from the subject cycle, and a patch from cycle k+2 two cycles after the patch from the subject cycle.
The model includes an isolation stack 701 of layers of neural networks for each of the input patches. Thus, stack 701 receives as input tile data from the patches of cycle k+2 and is isolated from stacks 702, 703, 704, and 705 such that they do not share input data or intermediate data. In some embodiments, all of the stacks 710-705 may have the same model and the same trained parameters. In other embodiments, the model and trained parameters may be different in different stacks. The stack 702 receives as input block data from the patch of cycle k+1. The stack 703 receives as input block data from the patch of cycle K. The stack 704 receives as input block data from the patches of cycle K-1. Stack 705 receives as input block data from the patch of cycle K-2. The layers of the isolation stack each perform a convolution operation of a kernel that includes a plurality of filters on the input data of the layers. As in the above example, patch 700 may include three features. The output of layer 710 may include more features, such as 10 to 20 features. Likewise, the output of each of the layers 711-716 may include any number of features suitable for a particular implementation. The parameters of the filter are trained parameters of the neural network, such as weights and biases. The output feature sets (intermediate data) from each of the stacks 701-705 are provided as inputs to the inverse hierarchy 720 of the temporal combination layer (where intermediate data from multiple loops are combined). In the illustrated example, the inverse hierarchy 720 includes: a first layer comprising three combined layers 721, 722, 723, each of which receives intermediate data from three of the isolation stacks; and a final layer including a combination layer 730 that receives intermediate data from the three temporal layers 721, 722, 723.
The output of the final combining layer 730 is the output patch of classification data for clusters located in the corresponding patch from the block of cycle K. The output patches may be assembled into output array classification data for the blocks of cycle K. In some embodiments, the output patch may have a different size and dimension than the input patch. In some implementations, the output patch may include pixel-by-pixel data that may be filtered by the host to select cluster data.
Depending on the particular implementation, the output classification data 735 may then be applied to a softmax function 740 (or other output activation function) that is optionally executed by a host or on a configurable processor. An output function other than softmax may be used (e.g., base detection output parameters are generated based on maximum output, then base quality is given using a learned nonlinear mapping using context/network output).
Finally, the output of softmax function 740 may be provided as the base detection probability of cycle K (750) and stored in host memory for use in subsequent processing. Other systems may use another function for output probability computation, e.g., another nonlinear model.
The neural network may be implemented using a configurable processor with multiple execution clusters to complete the evaluation of one block cycle for a duration equal to or near the time interval of one sensing cycle, effectively providing output data in real time. The data flow logic may be configured to distribute the input units of block data and trained parameters to the execution clusters and to distribute the output patches for aggregation in the memory.
An input unit for five-cycle input, one-cycle output of data of a neural network as in fig. 7 for a base detection operation using two-channel sensor data is described with reference to fig. 8A and 8B. For example, for a given base in a gene sequence, a base detection operation may perform two analyte streams and two reactions that generate two signal (such as images) channels that can be processed to identify which of four bases is located at the current position of the genetic sequence for each cluster of genetic material. In other systems, a different number of channels of sensed data may be utilized. For example, a one-channel method and system may be utilized to perform base detection. The combined materials of U.S. patent application publication No. 2013/007932 discusses base detection using various numbers of channels (such as one channel, two channels, or four channels).
Fig. 8A shows a five-cycle block data array for a given block (block M) that is used for the purpose of performing a five-cycle input, one-cycle output neural network. In this example, five cycles of input banked data may be written to an on-board DRAM or other memory in the system that is logically accessible by the data stream, and include array 801 for channel 1 and array 811 for channel 2 for cycle K-2, array 802 for channel 1 and array 812 for channel 2 for cycle K-1, array 803 for channel 1 and array 813 for channel 2 for cycle K, array 804 for channel 1 and array 814 for channel 2 for cycle k+1, and array 805 for channel 1 and array 815 for channel 2 for cycle k+2. In addition, the array 820 of metadata for the chunk may be written once in memory, in which case a DFC file is included to serve as input to the neural network along with each cycle.
Although FIG. 8A discusses a two-channel base detection operation, the use of two channels is merely an example, and any other suitable number of channels may be used to perform base detection. For example, the merging material of U.S. patent application publication No. 2013/007932 discusses base detection using various numbers of channels (such as one channel, two channels, or four channels, or another suitable number of channels).
The data stream logic constitutes input units of block data, which can be understood with reference to fig. 8B, including spatially aligned patches of a block data array of each execution cluster configured to perform the operation of the neural network on the input patches. The input unit for the assigned execution cluster is constituted by the data flow logic by: the spatial alignment patches (e.g., 851, 852, 861, 862, 870) are read from each of the five input-cycled block data arrays 801-805, 811, 815, 820 and delivered via a data path (illustratively, 850) to memory on a configurable processor configured for use by the assigned execution clusters. The assigned execution cluster performs the operation of the five-cycle input/one-cycle output neural network and delivers for subject cycle K output patches of classification data of the same patches of the blocks in subject cycle K.
Fig. 9 is a simplified representation of a stack of neural networks that may be used in the same system as fig. 7 (e.g., 701 and 720). In this example, some functions of the neural network (e.g., 900, 902) are performed on the host computer, and other portions of the neural network (e.g., 901) are performed on the configurable processor.
In one example, the first function may be a batch normalization (layer 910) formed on the CPU. However, in another example, the batch normalization as a function may be fused into one or more layers, and there may be no separate batch normalization layer.
As discussed above with respect to the configurable processor, the plurality of spatially isolated convolutional layers are performed as a first set of convolutional layers of the neural network. In this example, the first set of convolution layers spatially applies a 2D convolution.
As shown in fig. 9, a first spatial convolution 921 is performed followed by a second spatial convolution 922, followed by a third spatial convolution 923, and so on, for L/2 (L is described with reference to fig. 7) spatially isolated neural network layers in each stack. As indicated at 923A, the number of spatial layers may be any actual number, which may range from a few to more than 20 in different embodiments for the context.
For sp_conv_0, the kernel weights are stored, for example, in a (1, 6,3, l) structure, since there are 3 input channels for this layer.
For other sp_conv_layers, the kernel weights are stored in the (1,6,6L) structure, since for each of these layers there are K (=l) inputs and outputs.
The output of the stack of spatial layers is provided to the temporal layers, including convolution layers 924, 925, which execute on the FPGA. Layers 924 and 925 may be convolution layers that apply 1D convolution across loops. As indicated at 924A, the number of temporal layers may be any actual number, which may range from a few to more than 20 in different embodiments for the context.
The first temporal layer temp_conv_0 layer 924 reduces the number of circulation channels from 5 to 3 as shown in fig. 7. The second temporal layer (layer 925) reduces the number of loop channels from 3 to 1 as shown in fig. 7, and reduces the number of feature maps to four outputs for each pixel, representing confidence in each base detection.
The output of the temporal layers is accumulated in the output patch and delivered to the host CPU to apply, for example, a softmax function 930 or other function to normalize the base detection probability.
FIG. 10 shows an alternative implementation of a 10-input, six-output neural network that can be performed for base detection operations. In this example, the tile data from spatially aligned input patches of cycles 0 through 9 is applied to an isolated stack of spatial layers, such as stack 1001 of cycle 9. The output of the isolation stack is applied to the inverse hierarchical arrangement of the time stack 1020 with outputs 1035 (2) to 1035 (7) to provide base detection classification data for subject cycles 2 to 7.
FIG. 11 illustrates one implementation of a specialized architecture of a neural network-based base detector (e.g., FIG. 7) for isolating processing of data for different sequencing cycles. First, the motivation for using a specialized architecture is described.
The neural network-based base detector processes data of a current sequencing cycle, one or more previous sequencing cycles, and one or more subsequent sequencing cycles. The data of the additional sequencing cycles provides sequence specific context. During training, neural network-based base detectors learn to use sequence-specific contexts to improve base detection accuracy. Furthermore, the data of the pre-sequencing cycle and post-sequencing cycle provide a second order contribution of the predetermined phase and phasing signals to the current sequencing cycle.
The spatial convolution layer uses so-called "isolation convolutions" that achieve isolation by independently processing the data of each of a plurality of sequencing cycles via a "dedicated unshared" convolution sequence. The isolated convolution convolves the data and resulting feature maps for only a given sequencing cycle (i.e., within a cycle), and does not convolve the data and resulting feature maps for any other sequencing cycle.
For example, consider that the input data includes (i) current data for a current (time t) sequencing cycle to be base detected, (ii) previous data for a previous (time t-1) sequencing cycle, and (iii) subsequent data for a subsequent (time t+1) sequencing cycle. The specialised architecture then initiates three separate data processing pipes (or convolution pipes), namely the current data processing pipe, the previous data processing pipe and the subsequent data processing pipe. The current data processing pipeline receives as input the current data of the current (time t) sequencing cycle and processes the current data independently through a plurality of spatial convolution layers to produce a so-called "current spatial convolution representation" as output of the final spatial convolution layer. The previous data processing pipeline receives as input the previous data of the previous (time t-1) sequencing cycle and processes the previous data independently through a plurality of spatial convolution layers to produce a so-called "previous spatial convolution representation" as output of the final spatial convolution layer. The subsequent data processing pipeline receives as input subsequent data of a subsequent (time t+1) sequencing cycle and independently processes the subsequent data through a plurality of spatial convolution layers to produce a so-called "subsequent spatial convolution representation" as output of the final spatial convolution layer.
In some implementations, the current pipeline, one or more previous pipelines, and one or more subsequent processing pipelines are executed in parallel.
In some implementations, the spatial convolution layer is part of a spatial convolution network (or sub-network) within the specialized architecture.
The neural network-based base detector also includes a temporal convolution layer that mixes information between sequencing loops (i.e., between loops). The temporal convolution layer receives its input from the spatial convolution network and operates on the spatial convolution representations produced by the final spatial convolution layer of the corresponding data processing pipeline.
The temporal convolution layer uses a so-called "combined convolution" that convolves the input channels in subsequent inputs on a group-by-group basis on a sliding window basis. In one implementation, these subsequent inputs are subsequent outputs generated by previous spatial convolution layers or previous temporal convolution layers.
In some implementations, the temporal convolution layer is part of a temporal convolution network (or sub-network) within the specialized architecture. The time convolution network receives its input from the spatial convolution network. In one implementation, the first temporal convolution layer of the temporal convolution network combines the spatial convolution representations between sequencing cycles on a group-by-group basis. In another implementation, a subsequent temporal convolution layer of the temporal convolution network combines a subsequent output of a previous temporal convolution layer. In one example, compression logic (or compression network or compression sub-network or compression layer or extrusion layer) processes the output of the temporal and/or spatial convolution network and generates a compressed representation of the output. In one implementation, the compression network includes a compression convolution layer that reduces the depth dimension of a feature map generated by the network.
The output of the final temporal convolution layer (e.g., with or without compression) is fed to an output layer that produces an output. The output is used for base detection of one or more clusters at one or more sequencing cycles.
During forward propagation, the specialized architecture processes information from multiple inputs in two stages. In the first stage, isolated convolution is used to prevent information mixing between inputs. In the second stage, a combined convolution is used to mix
Information between inputs. The results from the second stage are used to make a single inference of the plurality of inputs.
This is different from batch mode techniques in which a convolution layer processes multiple inputs in a batch simultaneously and makes a corresponding inference for each input in the batch. In contrast, the specialized framework maps the multiple inputs to the single inference. The single inference may include more than one prediction, such as a classification score for each of the four bases (A, C, T and G).
In one implementation, the inputs have a time sequence such that each input is generated at a different time step and has multiple input channels. For example, the plurality of inputs may include the following three inputs: a current input generated by a current sequencing cycle at time step (t), a previous input generated by a previous sequencing cycle at time step (t-1), and a subsequent input generated by a subsequent sequencing cycle at time step (t+1). In another implementation, each input is derived from a current output, a previous output, and a subsequent output, respectively, produced by one or more previous convolutional layers, and includes k feature maps.
In one implementation, each input may include the following five input channels: red image channel (red), red distance channel (yellow), green image channel (green), green distance channel (violet), and scaling channel (blue). In another implementation, each input may be blue and violet color channels (or one or more other suitable color channels) instead of or in addition to the red and green color channels. In another implementation, each input may be blue and violet channels instead of or in addition to red, green, violet, and/or yellow channels. In another implementation, each input may include k feature maps generated by previous convolutional layers, and each feature map is considered an input channel. In yet another example, each input may have only one channel, two channels, or another different number of channels. The combined materials of U.S. patent application publication No. 2013/007932 discusses base detection using various numbers of channels (such as one channel, two channels, or four channels).
FIG. 12 depicts one implementation of barrier layers, each of which may include convolutions. The isolated convolution processes the plurality of inputs by synchronously applying a convolution filter to each input once. With isolated convolution, convolution filters combine input channels in the same input and do not combine input channels in different inputs. In one implementation, the same convolution filter is applied to each input simultaneously. In another implementation, a different convolution filter is applied to each input simultaneously. In some implementations, each spatial convolution layer includes a set of k convolution filters, where each convolution filter is applied to each input synchronously.
FIG. 13A depicts one implementation of combined layers, each of which may include a convolution. Fig. 13B depicts another implementation of combined layers, each of which may include a convolution. The combined convolution mixes information between different inputs by grouping corresponding input channels of the different inputs and applying a convolution filter to each group. The grouping of these corresponding input channels and the application of convolution filters occurs on a sliding window basis. In this context, a window spans two or more subsequent input channels, which represent, for example, the output of two subsequent sequencing cycles. Since the window is a sliding window, most input channels are used in two or more windows.
In some implementations, the different inputs originate from output sequences generated by previous spatial convolution layers or previous temporal convolution layers. In this output sequence, these different inputs are arranged as subsequent outputs and are therefore treated as subsequent inputs by the subsequent temporal convolution layer. Then, in the subsequent temporal convolution layer, the combined convolutions apply convolution filters to corresponding groups of input channels in the subsequent inputs.
In one implementation, these subsequent inputs have a temporal order such that the current input is generated by a current sequencing cycle at time step (t), the previous input is generated by a prior sequencing cycle at time step (t-1), and the subsequent input is generated by a subsequent sequencing cycle at time step (t+1). In another implementation, each subsequent input is derived from a current output, a previous output, and a subsequent output, respectively, produced by one or more previous convolutional layers, and includes k feature maps.
In one implementation, each input may include the following five input channels: red image channel (red), red distance channel (yellow), green image channel (green), green distance channel (violet), and scaling channel (blue). In another implementation, the additional input channel may be an violet channel. In another implementation, each input may include a k-feature map generated by a previous convolutional layer, and each feature map is considered an input channel.
The depth B of the convolution filter depends on the number of subsequent inputs whose corresponding input channels are convolved by the convolution filter on a sliding window basis. In other words, the depth B is equal to the number and group size of subsequent inputs in each sliding window.
In fig. 13A, the corresponding input channels from two subsequent inputs are combined in each sliding window, and thus b=2. In fig. 13B, the corresponding input channels from three subsequent inputs are combined in each sliding window, and thus b=3.
In one implementation, the sliding windows share the same convolution filter. In another implementation, a different convolution filter is used for each sliding window. In some implementations, each temporal convolution layer includes a set of k convolution filters, where each convolution filter is applied to a subsequent input on a sliding window basis.
Further details of fig. 4-10 and variations thereof can be found in co-pending U.S. non-provisional patent application No. 17/176,147 (attorney docket No. ilm 1020-2/IP-1866-US), entitled "Hardware Execution and Acceleration of Artificial Intelligence-Based Base Caller," filed on month 15 of 2021, which patent application is incorporated herein by reference as if fully set forth herein.
Base detection using multiple base detectors
FIG. 14 shows a base detection system 1400 that includes a plurality of base detectors to predict base detection of an unknown analyte that includes a base sequence.
Note that fig. 6A, discussed previously, illustrates only some of the components of the system 1400 of fig. 14, while fig. 14 illustrates various other components not shown in fig. 6A.
As discussed with respect to fig. 6A, the system 1400 of fig. 14 includes a sequencing machine 1404, such as the sequencing machine discussed with respect to fig. 1. Sequencing machine 1404 includes a flow cell 1405, such as the flow cell discussed with respect to fig. 1-3. The flow cell 1405 includes a plurality of tiles 1406, and each tile 1406 includes a plurality of clusters 1407 (an exemplary cluster of individual tiles is shown in fig. 6A), e.g., as discussed with respect to fig. 2 and 3. As discussed with respect to fig. 4-6, sensor data 1412 including the raw image from block 1406 is output by sequencing machine 1404.
In one embodiment, the system 1400 includes two or more base detectors, such as a first base detector 1414 and a second base detector 1416. Although two base detectors are shown, in one example, more than two base detectors, such as three, four, or a higher number of base detectors, may be present in the system 1400.
In one example, base detectors 1414 and 1416 are local to sequencing machine 1404. Thus, base detectors 1414 and 1416 and sequencing machine 1404 are positioned adjacently (e.g., within the same housing, or within two adjacently positioned housings), and base detectors 1414 and 1416 receive sensor data 1412 directly from sequencing machine 1404.
In another example, base detectors 1414 and 1416 are located remotely relative to sequencing machine 1404, examples of which are so-called cloud-based base detectors. Thus, base detectors 1414 and 1416 receive sensor data 1412 from sequencing machine 1404 via a computer network (such as the internet).
Each of the base detectors 1414 and 1416 of FIG. 14 outputs corresponding base detection classification information. For example, the first base detector 1414 outputs first base detection classification information 1434, and the second base detector 1416 outputs second base detection classification information 1436. The base detection combination module 1428 generates a final base detection 1440 based on one or both of the first base detection classification information 1434 and the second base detection classification information 1436.
In one example, the first base detector 1414 is a neural network based base detector. For example, the first base detector 1414 is a nonlinear system employing one or more neural network models for base detection, as previously discussed herein (e.g., see fig. 6-13B).
In one example, the second base detector 1416 is a non-neural network based base detector. For example, the second base detector 1416 is at least partially a linear system for base detection. For example, the second base detector 1416 does not employ a neural network for base detection (or uses a smaller neural network model for base detection than the larger neural network model used by the first base detector 1414), as previously discussed herein (e.g., see fig. 6 and subsequent discussion).
In one embodiment, the system 1400 includes a context information generation module 1418. The context information generation module 1418 generates context information 1420. In one embodiment, the base detection combination module 1428 operates based on the context information 1420. For example, based on the context information 1420, the base detection combination module 1428 uses one or both of the base detection classification information 1434 and the base detection classification information 1436 to generate a final base detection. The context information will be discussed later herein, for example with respect to fig. 16.
In one embodiment, the system 1400 further includes a switching module 1422. Note that in fig. 14, the switching module 1422, the context information generation module 1418, and the base detection combination module 1428 are shown as three separate components of the system 1400. However, in one example, one or more of these modules may be combined to form a combined module.
In one embodiment, the system 1400 further includes a switching module 1422 that selectively turns on or off the base detectors 1414 and 1416. For example, depending on the context information 1420, if only one of the base detectors 1414 and 1416 is assumed to analyze a particular set of sensor data 1412, then for that set of sensor data, only the selected base detector is enabled and the other base detector is disabled, as will be discussed in further detail later herein.
Enabling or switching on a base detector for a collection of sensor data means that the base detector will operate or perform on a particular collection of sensor data. Thus, the activation or switching on of a base detector does not necessarily mean that the base detector is turned on—this simply means that the base detector is performed on a particular corresponding set of sensor data. Disabling or disabling a base detector for a collection of sensor data means that the base detector will inhibit the operation or execution of a particular collection of sensor data. Note that, for example, when a base detector is deactivated for a first set of sensor data, the base detector may be activated for a second set of sensor data. In one example, the first base detector 1414 can be selectively enabled or disabled using an enable signal 1424, and the second base detector 1416 can be selectively enabled or disabled using an enable signal 1426. Thus, enable signals 1424 and 1426 are signals that selectively enable (or disable) the corresponding base detector 1414 or 1416, respectively.
As discussed herein, "set of sensor data" refers to a section of sensor data 1412 or a dataset of sensor data 1412. For example, the set of sensor data may be sensor data from one or more specific clusters 1407 or one or more specific blocks 1406 of the flow cell 1405. The set of sensor data may be sensor data from one or more specific base sensing cycles. Thus, the collection of sensor data can be associated with a particular spatial aspect of the flow cell 1405 (e.g., one or more particular clusters 1407 from the flow cell 1405) and/or a particular temporal aspect of the base detection cycle (e.g., from one or more particular base detection cycles).
For a first set of sensor data 1412, base detection combining module 1428 may rely solely on base detection classification information 1434 from first base detector 1414 to generate a final base detection 1440 of the first set of sensor data 1412, as just one example. The base detection combination module 1428 decides to rely on the base detection classification information 1434 alone (and not on the base detection classification information 1436), for example, based on the context information 1420 associated with the first set of sensor data 1412. In one example, when processing a first set of sensor data, the switching module 1422 uses only the enable signal 1424 to enable the first base detector 1414 (e.g., the first base detector 1414 performs on the first set of data) and uses the enable signal 1426 to disable the second base detector 1416 (e.g., the second base detector 1416 does not perform on the first set of data), and the first base detection classification information 1434 from the first base detector 1414 is used to generate the final base detection 1440. However, in another example, although the first base detection classification information 1434 from the first base detector 1414 is used to generate the final base detection 1440 of the first set of data, the switching module 1422 enables the first base detector 1414 and optionally also enables the second base detector 1416, e.g., for reasons discussed later herein. In such an example, both base detection classification information 1434 and 1436 can be used for the first set of data, and the final base detection 1440 is based only on the first base detection classification information 1434.
As another example only, for the second set of sensor data 1412, base detection combining module 1428 may rely solely on base detection classification information 1436 from second base detector 1416 to generate a final base detection 1440 for the second set of sensor data 1412. The base detection combination module 1428 decides to rely on the base detection classification information 1436 alone (and not on the base detection classification information 1434), for example, based on the context information 1420 associated with the second set of sensor data 1412. In one example, when processing the second set of sensor data, the switching module 1422 uses only the enable signal 1426 to enable the second base detector 1416 and uses the enable signal 1424 to disable the first base detector 1414, e.g., and the second base detection classification information 1436 from the second base detector 1416 is used to generate the final base detection 1440. However, in another example, although the second base detection classification information 1436 from the second base detector 1416 is used to generate the final base detection 1440 of the second set of data, the switching module 1422 enables the second base detector 1416 and optionally also enables the first base detector 1414, e.g., for reasons discussed later herein. In this example, both base detection classification information 1434 and 1436 are available, and the final base detection 1440 is based only on base detection classification information 1436.
As yet another example only, for the third set of sensor data 1412, base detection combining module 1428 can rely on both base detection classification information 1434 and 1436 from base detectors 1414 and 1416, respectively, to generate a final base detection 1440 of the third set of sensor data 1412. The base detection combination module 1428 decides to rely on both base detection classification information 1434 and 1436, e.g., based on the context information 1420 associated with the third set of sensor data 1412. Thus, when processing the third set of sensor data, the switching module 1422 uses the enable signals 1424 and 1426 to enable both base detectors 1414 and 1416, respectively.
Thus, for a given set of sensor data, base detection combination module 1428 decides to rely on a particular one or both of base detection classification information 1434 and 1436 based on context information 1420 associated with the corresponding set of sensor data. Similarly, the switching module 1422 decides to enable a particular one or both of the base detectors 1414 and 1416 based on the context information 1420 associated with the corresponding set of sensor data.
Exemplary operation of first base Detector 1414 and second base Detector 1416
Fig. 15A, 15B, 15C, 15D, and 15E illustrate corresponding flowcharts depicting various operations of the base detection system 1400 of fig. 14 for corresponding sets of sensor data. For example, fig. 15A-15E illustrate various arrangements and combinations in which the system 1400 is operable.
The first base detector 1414 is activated and the final base detection 1440 is based on the first base detection classification information 1434
Fig. 15A illustrates the operation of the system 1400 in which the first base detector 1414 is enabled and generates base detection classification information for the set of sensor data 1501a (e.g., while the second base detector 1416 is not operating on the set of sensor data 1501 a), and the final base detection 1440 is based on the first base detection classification information 1434 of the set of sensor data 1501a.
Thus, in fig. 15A, the operation of system 1400 is shown for a set 1501a of sensor data generated by flow cell 1405. At 1505a, flow cell 1405 generates a set 1501a of sensor data. As discussed, the set 1501a of sensor data may be generated in a particular location of the flow cell, such as by a particular cluster of particular blocks or by a particular block, and generated for a particular base sequence cycle (i.e., the set is associated with a particular spatial location and a particular temporal base sequence cycle of the flow cell 1405). Also at 1505a, context information associated with the set 1501a of sensor data is accessed (e.g., by the switching module 1422 and/or the base detection combining module 1428). As discussed, the context information may be generated by the context information generation module 1418.
In the example of fig. 15A, the switching module 1422 decides that the first base detector 1414 (but not the second base detector 1416) will process the set 1501a of sensor data. Thus, at 1510a, the switching module 1422 enables the first base detector 1414, for example, by turning on an enable signal 1424. The second base detector 1416 may remain inactive, i.e., the second base detector 1416 does not operate on the set 1501a of sensor data.
At 1515a, first base detector 1414 generates first base detection classification information 1434 for the set of sensor data 1501a, while second base detector 1416 suppresses any second base detection classification information 1436 that generated the set of sensor data 1501a.
At 1520a, the base detection combination module 1428 uses the first base detection classification information 1434 to generate a final base detection for the set 1501a of sensor data based on the context information 1420 associated with the set 1501a of sensor data.
The second base detector 1416 is enabled and the final base detection 1440 is based on the second base detection classification information 1436
Fig. 15B illustrates operation of the system 1400 in which the second base detector 1416 is enabled and generates base detection classification information for the set of sensor data 1501B (e.g., while the first base detector 1414 is not operating on the set of sensor data 1501B), and the final base detection 1440 is based on the second base detection classification information 1436 for the set of sensor data 1501B.
At 1505b, flow cell 1405 generates a set 1501b of sensor data. As discussed, the set 1501a of sensor data may be generated in a particular location of the flow cell, such as by a particular cluster of particular blocks or by a particular block, and generated for a particular base sequence cycle (i.e., the set is associated with a particular spatial location and a particular temporal base sequence cycle of the flow cell 1405). Also at 1505b, context information associated with the set 1501b of sensor data is accessed (e.g., by the switching module 1422 and/or the base detection combining module 1428). As discussed, the context information may be generated by the context information generation module 1418.
In the example of fig. 15B, switching module 1422 determines that second base detector 1416 (instead of first base detector 1414) will process set 1501B of sensor data. Thus, at 1510b, the switching module 1422 enables the second base detector 1416, for example, by using an enable signal 1426. The first base detector 1414 may remain inactive, i.e., the first base detector 1414 does not operate on the collection 1501b of sensor data.
At 1515b, the second base detector 1416 generates second base detection classification information 1436 for the set 1501b of sensor data, while the first base detector 1414 suppresses any first base detection classification information 1434 for the set 1501b of sensor data.
At 1520b, base detection combining module 1428 uses the second base detection classification information 1436 to generate a final base detection for the set of sensor data 1501b based on the context information 1420 associated with the set of sensor data 1501 b.
The first base detector 1414 and the second base detector 1416 are enabled and the final base detection 1440 is based on (i) First base detection classification information 1434 and/or (ii) secondOne or both of the base detection classification information 1436
FIG. 15C illustrates the operation of the system 1400 in which both the first base detector 1414 and the second base detector 1416 are enabled (i.e., both base detectors operate on a corresponding set 1501C of sensor data) and generate corresponding base detection classification information for the set 1501C of sensor data, and the final base detection 1440 is based on one or both of (i) the first base detection classification information 1434 and/or (ii) the second base detection classification information 1436.
At 1505c, flow cell 1405 generates a set 1501c of sensor data. As discussed, the set 1501c of sensor data may be generated in a particular location of the flow cell, such as by a particular cluster of particular blocks or by a particular block, and generated for a particular base sequence cycle (i.e., the set is associated with a particular spatial location and a particular temporal base sequence cycle of the flow cell 1405 in fig. 14). Also at 1505c, context information associated with the set 1501c of sensor data is accessed (e.g., by the switching module 1422 and/or the base detection combining module 1428). As will be discussed in further detail herein, the context information may be generated by the context information generation module 1418.
In the example of fig. 15C, the switching module 1422 determines that both the first base detector 1414 and the second base detector 1416 will process the set 1501C of sensor data. Thus, at 1510c, the switching module 1422 (fig. 14) enables both the first base detector 1414 and the second base detector 1416, for example, using enable signals 1424 and 1426 (fig. 14). For example, both the first base detector 1414 and the second base detector 1416 will process the entire set 1501c of sensor data. In another example, the first base detector 1414 will process a first subset of the set 1501c of sensor data and the second base detector 1416 will process a second subset of the set 1501c of sensor data.
At 1515c, the first base detector 1414 generates first base detection classification information 1434 for the set 1501c of sensor data, and the second base detector 1416 generates second base detection classification information 1436 for the set 1501c of sensor data.
At 1520c, the base detection combining module 1428 uses the first base detection classification information 1434 and/or the second base detection classification information 1436 to generate a final base detection for the set 1501c of sensor data based on the context information 1420 associated with the set 1501c of sensor data.
In the case where the final base detection cannot be generated by using only the first base detection classification information 1434, the method is started and allowed to Using the second base to detect classification information 1436
FIG. 15D illustrates the operation of the system 1400 in which the second base detection classification information 1436 is used for the final base detection 1440 in the event that the final base detection cannot be generated using only the first base detection classification information 1434.
At 1505d, flow cell 1405 generates a set 1501d of sensor data. As discussed, the set 1501d of sensor data may be generated in a particular location of the flow cell, such as by a particular cluster of particular blocks or by a particular block, and generated for a particular base sequence cycle (i.e., the set is associated with a particular spatial location and a particular temporal base sequence cycle of the flow cell 1405). Also at 1505d, context information associated with the set 1501d of sensor data is accessed (e.g., by the switching module 1422 and/or the base detection combining module 1428). As discussed, the context information may be generated by the context information generation module 1418.
In the example of fig. 15D, the switching module 1422 determines that the first base detector 1414 is to process the set 1501D of sensor data. Optionally, the switching module 1422 may also determine that the second base detector 1416 may also process the set 1501d of sensor data. Thus, at 1501d, the first base detector 1414 is enabled, and optionally the second base detector 1416 is also enabled.
At 1515d, first base detector 1414 generates first base detection classification information 1434 of collection 1501d of sensor data. In an optional operation at 1510d in which the second base detector 1416 is enabled, the second base detector 1416 optionally generates second base detection classification information 1436 of the set 1501d of sensor data.
At 1520d, it is determined (e.g., by the switching module 1422 and/or the base detection combination module 1428) whether a final base detection can be generated from the first base detection classification information 1434 (e.g., without using the second base detection classification information 1436). For example, it can be determined that in the case where, for example, the final base detection 1440 is based only on the first base detection classification information 1434, the probability of error in the final base detection 1440 may be relatively high. Many examples of such determinations will be discussed in turn later herein. As just one example, if the first base detection classification information 1434 indicates a homopolymer (e.g., GGGGG) or near-homopolymer (e.g., GGTGG) sequence, the first base detection classification information 1434 may be insufficient or inadequate for generating a final base detection (e.g., the second base detection classification information 1436 must be relied upon to generate a final base detection), e.g., as discussed later herein with respect to fig. 19B and 19C.
If yes at 1520d (i.e., a final base detection can be generated from the first base detection classification information 1434 without using the second base detection classification information 1436), the method 1500d proceeds to 1525d, where the first base detection classification information 1434 is used to generate a final base detection for the set 1501d of sensor data.
If "no" at 1520d (i.e., the final base detection cannot be generated from only the first base detection classification information 1434 without using the second base detection classification information 1436, for example), the method 1500d proceeds to 1530d, where the second base detector 1416 is enabled, and then proceeds to 1535d, where the second base detection classification information 1436 is generated using the second base detector 1416. Note that the operations at blocks 1530d and 1535d are optional and are therefore shown using dashed lines. For example, if the second base detector 1416 is optionally enabled at 1510d, operation 1530d may be skipped. Similarly, if a second base detector 1416 is optionally used at 1515d to generate second base detection classification information 1436, operation 1535d may be skipped.
Assume a scenario in which the second base detector 1416 is not enabled at 1510d and the second base detector 1416 is enabled at 1530d. Thus, at 1530d, the second base detector 1416 begins processing the set of sensor data 1510d. It can be noted that for a given base detection cycle, the second base detector 1416 cannot immediately begin processing the corresponding sensor data and generate a base detection. This is because, due to phasing discussed later herein (e.g., see fig. 17C, 17D), the second base detector 1416 must process the sensor data of one or more previous base detection cycles to satisfactorily detect the base of the current cycle. For example, it is assumed that base detection cycles 1 to 1000 are to be performed, and the set 1501d of sensor data includes images from the base detection cycle 100 onward. It is also assumed that at 1530d, a second base detector 1416 is enabled to process the sensor data of the base detection cycle 100 and one or more subsequent base detection cycles. As discussed, the second base detector 1416 must process sensor data from one or more previous cycles to satisfactorily detect bases of cycle 100 and subsequent cycles. Processing the sensor data from several previous cycles enables the second base detector 1416 to estimate the effect of phasing at cycle 100, which improves the quality of base detection at cycle 100. As just one example, five, ten, twenty, or another suitable number of previous cycles are to be processed by the second base detector 1416 so that the second base detector 1416 satisfactorily detects base detection of cycle 100.
In the first example, assume that the second base detector 1416 must process the sensor data from the N1 previous cycles to satisfactorily detect the bases of cycle 100. In a second example, assume that the second base detector 1416 must process the sensor data from the N2 previous cycles to satisfactorily detect the bases of cycle 1000. Now, as will be discussed with respect to fig. 17C, 17D, the effects of phasing and pre-phasing are more pronounced as the base detection cycle progresses. Thus, phasing is more pronounced in cycle 1000 than intended in cycle 100. Therefore, in order to satisfactorily detect the bases of cycle 1000, the second base detector 1416 must process a higher number of previous cycles than the number of previous cycles to be processed to satisfactorily detect the bases of cycle 100. Thus, N2 is higher than N1.
Referring again to fig. 15D, after 1535D, at 1540D, the final base detection of the set 1501D of sensor data is generated using one or both of (i) the first base detection classification information 1434 and/or (ii) the second base detection classification information 1436.
In the case where the final base detection cannot be generated using only the second base detection classification information 1436, the method is started and allowed to Using the first base detection classification information 1434
FIG. 15E illustrates the operation of the system 1400 in which the first base detection classification information 1434 is used for final base detection 1440 in the event that final base detection cannot be generated using only the second base detection classification information 1436.
At 1505e, flow cell 1405 generates a set 1501e of sensor data. As discussed, the set 1501e of sensor data may be generated in a particular location of the flow cell, such as by a particular cluster of particular blocks or by a particular block, and generated for a particular base sequence cycle (i.e., the set is associated with a particular spatial location and a particular temporal base sequence cycle of the flow cell 1405). Also at 1505e, context information associated with the set 1501e of sensor data is accessed (e.g., by the switching module 1422 and/or the base detection combining module 1428). As discussed, the context information may be generated by the context information generation module 1418.
In the example of fig. 15E, the switching module 1422 decides that the second base detector 1416 is to process the set 1501E of sensor data, e.g., based on the associated context information. Optionally, the switching module 1422 may also determine that the first base detector 1414 may also process the set 1501e of sensor data. Thus, at 1510e, the second base detector 1416 is enabled, and optionally the first base detector 1414 is also enabled.
At 1515e, the second base detector 1416 generates second base detection classification information 1436 of the set 1501e of sensor data. In the option in which the first base detector 1414 is also enabled, the first base detector 1414 generates first base detection classification information 1434 of the set 1501e of sensor data.
At 1520e, it is determined (e.g., by the switching module 1422 and/or the base detection combination module 1428) whether a final base detection can be generated based solely on the second base detection classification information 1436 (e.g., without using the first base detection classification information 1434). For example, it may be determined (e.g., based on context information) that the probability of error in the final base detection 1440 may be relatively high if, for example, the final base detection 1440 is based only on the second base detection classification information 1436. Many examples of such determinations will be discussed in turn later herein. As just one example, if the context information indicates that a bubble in a cluster is detected, the final base detection cannot be generated from the second base detection classification information 1436 (e.g., without using the first base detection classification information 1434), e.g., as discussed later herein with respect to fig. 19D.
If yes at 1520e (i.e., a final base detection can be generated from the second base detection classification information 1436 without using the first base detection classification information 1434), then the method 1500c proceeds to 1525e, where the second base detection classification information 1436 is used to generate a final base detection for the set 1501e of sensor data.
If "no" at 1520e (i.e., the final base detection cannot be generated from the second base detection classification information 1436 without using the first base detection classification information 1434), the method 1500e proceeds to 1530e, where the first base detector 1414 is enabled, and then proceeds to 1535e, where the first base detection classification information 1434 is generated using the first base detector 1414. Note that the operations at blocks 1530e and 1535e are optional and are therefore illustrated using dashed lines. For example, if the first base detector 1414 is optionally enabled at 1510e, operation 1530e may be skipped. Similarly, if first base detector 1414 is optionally used at 1515e to generate first base detection class information 1434, operation 1535e may be skipped.
Assume a scenario in which the first base detector 1414 is not enabled at 1510e and the first base detector 1414 is enabled at 1530e. Thus, at 1530e, the first base detector 1416 begins processing the set of sensor data 1510e. It can be noted that for a given base detection cycle, the first base detector 1414 cannot immediately begin processing the corresponding sensor data and generate a base detection. For example, assume that first base detector 1414 is to operate on a corresponding set of data from base detection cycle Na. To satisfactorily generate a base detection from cycle Na, the first base detector 1414 must also operate on sensor data from at least several cycles prior to cycle Na, for example, because the base detection of the current cycle is also based on data from one or more past cycles and one or more future cycles as discussed with respect to fig. 7 and 10. Thus, to generate the first base detection classification information 1434 from cycle Na, the first base detector 1414 must also process sensor data from several previous cycles (such as 2 cycles in the example of fig. 7 and 5 cycles in the example of fig. 10).
Subsequently, at 1540e, one or both of (i) the first base detection classification information 1434 and/or (ii) the second base detection classification information 1436 are used to generate a final base detection for the set 1501e of sensor data.
Contextual information
Fig. 16 shows a context information generation module 1418 of context information 1420 of an exemplary set 1601 of generated sensor data of the base detection system 1400 of fig. 14. For example, the context information generation module 1418 receives information about the set 1601 of sensor data and generates various types of context information for the set 1601 of sensor data, which are combined to be referred to as context information for the set 1601 of sensor data. For example, the context information generation module 1418 generates spatial context information 1604, temporal context information 1606, base sequence context information 1608, and other context information 1610 for the set 1601 of sensor data.
Spatial context information 1604
As the name suggests, spatial context information 1604 refers to context information associated with the spatial locations of the blocks and clusters from which the set 1601 of sensor data was generated. Fig. 17A and 17B below discuss an example of spatial context information 1604.
Fig. 17A shows a flow cell 1405 of the system 1400 of fig. 14, wherein the flow cell 1405 includes blocks 1406 that are categorized based on the spatial locations of the blocks. For example, as discussed with respect to fig. 2, the flow cell 1405 of fig. 17A includes a plurality of channels 1702 with a corresponding plurality of blocks 1406 within each channel. Fig. 17A shows a top view of flow cell 1405.
Each block is categorized based on the location of the block. For example, the block adjacent to any edge of flow cell 1405 is labeled as edge block 1406a (shown using a grey box) and the remaining blocks are labeled as non-edge blocks 1406b (shown using a dashed box).
For example, the blocks on the vertical edge (e.g., along the Y-axis) and/or horizontal edge (e.g., along the X-axis) of the flow cell 1404 are categorized as edge blocks 1406a, as shown in fig. 14. Thus, edge blocks 1406a are adjacent (e.g., directly adjacent) to corresponding edges of the flow cell 1404, and non-edge blocks are not adjacent to any edges of the flow cell 1404.
A base detection cycle is performed for clusters in each block of the flow cell 1404. In one example, parameters related to the base detection operation of a block may be based on the relative position of the block. For example, the excitation light 101 discussed with respect to fig. 1 is directed to a zone of the flow cell, and different zones may receive different amounts of excitation light 101, e.g., based on the location of each zone and/or the location of one or more light sources emitting excitation light 101. For example, if the light source emitting the excitation light 101 is located vertically above the flow cell, the non-edge region 1406b may receive a different amount of light than the edge region 1406 a. In another example, ambient or external light (e.g., ambient light from outside the biosensor) around flow cell 1405 may affect the amount and/or characteristics of excitation light 101 received by the various blocks of flow cell 1405. As just one example, edge region 1406a may receive excitation light 101 and an amount of ambient light from outside flow cell 1405, while non-edge region 1406b may primarily receive excitation light 101. In yet another example, individual sensors (or pixels or photodiodes) included in flow cell 1405 (e.g., sensors 106, 108, 110, 112, and 114 shown in fig. 1) may sense light based on the locations of the corresponding sensors, which are based on the locations of the corresponding tiles. For example, the sensing operation performed by the one or more sensors associated with the edge block 1406a may be affected relatively more by the ambient light (and the excitation light 101) than the ambient light affects the sensing operation of the one or more other sensors associated with the non-edge block 1406 b. In yet another example, the flow of reactants (e.g., which include any substances that can be used to obtain a desired reaction during base detection, such as reagents, enzymes, samples, other biomolecules, and buffer solutions) to various tiles may also be affected by the tile location. For example, a block near the source of reactant may receive a greater amount of reactant than a block farther from the source.
In one example, the spatial context information 1604 (see fig. 16) associated with the set 1601 of sensor data includes information regarding whether the set 1601 of sensor data is generated in the edge zone 1406a or in the non-edge zone 1406 b. As discussed above, the parameters associated with base detection may be slightly different for different classes of blocks. Thus, in one implementation, spatial context information 1604 that indicates whether the set 1601 of sensor data is generated from an edge block or a non-edge block may affect the selection of a base detector for processing the set 1601 of sensor data. As just one example and based on implementation details, the first base detector 1414 may be more suitable for processing sensor data from one of an edge or non-edge block, while the second base detector 1416 may be more suitable for processing sensor data from the other of an edge or non-edge block.
Fig. 17B shows a block 1406 of the flow cell 1405 of the system 1400 of fig. 14, wherein the block 1406 includes clusters 1407 that are categorized based on the spatial locations of the clusters.
In one example, based on sensor data (which may be, for example, image data) received from a tile, the locations of various clusters within the tile may be estimated. For example, the (x, y) coordinates of the clusters may be used to identify the location of each cluster. Thus, each cluster 1407 has a corresponding (x, y) coordinate that identifies the position of the cluster relative to the block. In fig. 17B, the clusters 1407 of the exemplary block 1406 are categorized as edge clusters 1407a or non-edge clusters 1407B. For example, clusters 1407 within a threshold distance L1 from the edge of the block are marked as edge clusters 1407a, and clusters 1407 outside the threshold distance L1 from the edge of the block are marked as non-edge clusters 1407b. Thus, the edge cluster 1407a is located near the perimeter of the block 1406, while the non-edge cluster 1407a is located near the center section of the block 1406. As discussed, using the (x, y) coordinates of the cluster, a distance of the cluster relative to the edge of the block may be determined (e.g., by the context information generation module 1418), based on which the context information generation module 1418 classifies the cluster as an edge cluster 1407a or a non-edge cluster 1407b. As a simple example, an imaginary dashed rectangle is shown in fig. 17B that is within the perimeter of the block 1406 and is a distance L1 from the perimeter of the block 1406. Clusters within the dashed rectangle are classified as non-edge clusters 1407b, while clusters between the perimeter of the dashed rectangle and the perimeter of the block 1406 are classified as edge clusters 1407a.
As discussed with respect to fig. 1, flow cell 1405 may include lenses (such as filter layer 124 including an array of microlenses or other optical components) for capturing images of various clusters. In one example, the focus for the various clusters may be slightly different when capturing the image, such as when an image sensor or camera is moved around the flow cell. For example, when capturing an image of a cluster, the edge cluster 1407a may be slightly out of focus relative to the non-edge cluster 1407 b. Defocus events may also occur due to heating or mechanical vibrations caused by lens movement. Thus, depending on the implementation, one of the first base detector 1414 or the second base detector 1416 may be more suitable for processing sensor data from the edge cluster 1407a, while the other of the first base detector 1414 or the second base detector 1416 may be more suitable for processing sensor data from the non-edge cluster 1407b, as will be discussed in further detail later herein (see fig. 19G, 20D). In one example, the spatial context information 1604 (see fig. 16) associated with the set of sensor data 1601 includes information regarding whether the set of sensor data 1601 generated from one or more edge clusters 1407a or one or more non-edge clusters 1407b, based on which the set of sensor data 1601 can be processed by a particular one or both of the first base detector 1414 or the second base detector 1416.
Thus, summarizing the discussion above, spatial context information 1604 associated with a set 1601 of sensor data includes: (i) Information about whether the set 1601 of sensor data is generated from the edge block 1406a or the non-edge block 1406b, and/or (ii) information about whether the set 1601 of sensor data is generated from one or more edge clusters 1407a or one or more non-edge clusters 1407 b. Other suitable spatial context information is also contemplated based on the teachings of the present disclosure.
Time context information 1606
Referring again to fig. 16, the context information generation module 1418 also generates temporal context information 1606. For example, the base detection systems discussed herein can be configured to receive a sample whose bases are to be detected. Such base detection can be performed over a plurality of base detection cycles. In one example, the temporal context information 1606 of the set of sensor data 1601 indicates one or more base detection cycles for which the set of sensor data 1601 was generated. For example, assume that there are N base detection cycles, and the set 1601 of sensor data is associated with base detection cycles N1 to N2 out of the total N base detection cycles. The temporal context information 1606 of the set 1601 of sensor data will include such information. As discussed below, the selection of which base detector to use to process the set 1601 of sensor data may also be based on the number of base detection cycles with which the set 1601 of sensor data is associated.
Fig. 17C shows an example of fading in which the signal intensity decreases with the number of cycles of sequencing run of the base detection operation. The decay is an exponential decay in fluorescence signal intensity with the number of base detection cycles. As the sequencing run progresses, the analyte chains are over-washed, exposed to laser radiation that produces reactive species, and subjected to harsh environmental conditions. All this results in a gradual loss of fragments in each analyte, thereby reducing its fluorescence signal intensity. Fading is also known as darkening or signal attenuation. Fig. 17C shows one example of fading 1700C. In fig. 17C, the intensity values of analyte fragments with AC microsatellites exhibit an exponential decay.
Figure 17D conceptually illustrates the decreasing signal-to-noise ratio as the sequencing cycle progresses. For example, as sequencing proceeds, accurate base detection becomes increasingly difficult because the signal strength decreases and noise increases, resulting in a significant decrease in signal-to-noise ratio. Physically, it was observed that the later synthesis step attached the tag in a different position relative to the sensor than the earlier synthesis step. When the sensor is below the sequence being synthesized, the signal decays due to the tag being attached to the strand farther from the sensor in a later sequencing step than in an earlier step. This results in signal decay as the sequencing cycle progresses. In some designs, where the sensor is located above the substrate holding the clusters, the signal may be enhanced rather than attenuated as the sequencing proceeds.
In the flow cell design under investigation, as the signal decays, the noise becomes greater. Physically, phasing and predetermined phases increase noise as sequencing proceeds. Phasing refers to the step in sequencing where a tag fails to progress along the sequence. The predetermined phase refers to a sequencing step in which the tag jumps forward by two positions instead of one position during a sequencing cycle. Phasing and the predetermined phase occur relatively infrequently, once in about 500 to 1000 cycles. Phasing is somewhat more frequent than the predetermined phase. Phasing and predetermined phases affect individual chains in the cluster that produce intensity data, so as sequencing proceeds, intensity noise profiles from the clusters accumulate into two, three, four, etc. expansions.
Further details of fading, signal attenuation, and signal-to-noise ratio degradation can be found in U.S. non-provisional patent application No. 16/874,599 (attorney docket No. ILLM 1011-4/IP-1750-US), entitled "Systems and Devices for Characterization and Performance Analysis of Pixel-Based Sequencing," filed on even date 14 in 2020, which is incorporated herein by reference as if fully set forth herein.
Thus, during base detection, the reliability or quality of the base detection (e.g., the probability of detecting a correct base) can be based on the number of base detection cycles for which the current base is being detected. Thus, the selection of the first base detector 1414 and/or the second base detector 1416 for processing the set 1601 of sensor data may also be based on a current number of cycles for which base detection operations are being performed, which may be included in the time context information 1606 of the set 1601 of sensor data, as will be discussed in further detail later herein.
Base sequence context information 1608
FIG. 18 shows base detection accuracy (1-base detection error rate) of base detection for homopolymers (e.g., GGGGG) and sequences with near homopolymers or sequences with flanking homopolymers (e.g., GGTGG) for different exemplary configurations of base detectors (e.g., deepRTA, deepRTA-K0-06, deep RTA-349-K0-10-160p, deep RTA-KO-16, deep RTA-K0-16-Lanczos, deepRTA-KO-18, and deep RTA-K0-20). In one example, sequences having flanking homopolymers (e.g., GGTGG) include homopolymers flanking the base of interest (e.g., T) (e.g., GG). Similarly, a near-homopolymer includes sequences in which almost all or most of the bases are identical (e.g., 3 of 5 bases, or 4 of 7 bases are G). The table shown in fig. 18 shows data (e.g., base detection probability, or probability of correctly detecting a base) of various base detection cycles such as cycles 20, 40, 60, and 80. For example, at cycle 80, the probability of correctly detecting the intermediate base of the sequence GGGGG using the deep RTA base detector is 96.97%. It is noted that some examples of homopolymers, near homopolymers, or sequences with flanking homopolymers discussed in this disclosure are assumed to have five bases. However, any different number of bases may be present in such a particular sequence, such as three, five, six, seven, nine, or another suitable number.
As discussed above, in some implementations, the base detector performs base detection for a current sequencing cycle by processing sequencing image windows for multiple sequencing cycles, including the current sequencing cycle contextualized by a right sequencing cycle and a left sequencing cycle. Since the base "G" is indicated in the sequencing image by a dark or minimal signal state (also referred to herein as an off state, an illegible signal state, or an inactive state), the repeated pattern of base "G" can result in false base detection. Such false base detection can also occur when the current sequencing cycle is for a non-G base (e.g., base "T"), but is flanked on the right and left by G. Note that the non-G bases (i.e., A, C or T) are indicated in the sequencing image by a light or on (or active) state.
In one example, there are some specific base detection sequence patterns whose error probability in base detection is relatively high. Two such examples of GGGGG and GGTGG are shown in fig. 18. There may be other specific base detection sequence patterns, such as GGTCG, whose error probability in base detection is also relatively high. In one example, such a particular base detection sequence pattern has a plurality of G's, such as G's at least at the beginning and end of the sequence, and possibly a third G's between two end G's in a 5 base sequence. Other examples of such specific base-detecting sequences include GGXGG, GXGGG, GGGXG, GXXGG and GGXXG, where X can be either A, C, T or G.
In one example, base sequence context information 1608 of the set of sensor data 1601 also provides an indication as to whether the set of sensor data 1601 is associated with any such special base sequence patterns. For example, if the set 1601 of sensor data is used to detect an intermediate base of the sequence GGGGG (or GGTGG), this may require special manipulation to generate a final base detection, as will be discussed herein (e.g., see fig. 19B, 19C, 20A).
Other contextual information 1610
Referring again to fig. 16, the context information generation module 1418 further generates other context information 1610. Other context information 1610 may cover any type of context information that is not covered by spatial, temporal, and base sequence context information.
Many examples of other contextual information 1610 are possible, some of which are discussed below.
Sometimes, bubbles are formed on one or more clusters during one or more base detection operation sequences. Such bubbles may be beads of gas (such as air) in any liquid present in the cluster (such as bubbles within the reagents used for base detection). The presence of bubbles may be detected based on analyzing images captured from affected clusters. For example, the presence of bubbles within a cluster may be estimated by detecting unique intensity signal features in the captured image of the cluster. In one example, other context information 1610 may indicate whether a set 1601 of sensor data from a cluster is associated with such a bubble. In other words, if an image within the set 1601 of sensor data from a cluster indicates the presence of a bubble in the cluster, then other context information 1610 provides an indication of such a bubble in the cluster. Detection of air bubbles is discussed in further detail in co-pending U.S. patent application Ser. No. 63/170,072, entitled "Machine-Learning Model for Detecting a Bubble Within aNucleotide-Sample Slide for Sequencing," filed on even date 4 at 2021, which is incorporated herein by reference. Further details regarding the generation of a final base detection in the event a bubble is detected are discussed later in turn herein.
In one example, the reagents used in the flow cell play a major role in how base detection is performed. For example, if a first type of reagent is used, the first base detector 1414 may be suitable, while if a second type of reagent is used, the second base detector 1416 may be suitable. In one example, other context information 1610 provides an indication of the reagents used in the flow-through cell. Further details regarding the generation of final base detection based on reagent selection are discussed later in turn herein.
Selective use of first base detector 1414 and second base detector 1416: final base detectionIs from Function of classification information (e.g., average, maximum, minimum, or another suitable function) for two base detectors
FIG. 19A shows a final base detection that generates a set of sensor data based on a function of first base detection classification information 1434 from first base detector 1414 and second base detection classification information 1436 from second base detector 1416 of system 1400 of FIG. 14.
In one embodiment, each base detector 1414, 1416 outputs a corresponding probability that the base to be detected is A, C, G or T. For example, consider first base detection classification information 1434 from first base detector 1414. For a given base to be detected, the first base detection classification information 1434 is in the form of probability or confidence scores p1 (A), p1 (C), p1 (G), p1 (T). Herein, p1 (a) represents the probability that the base to be detected is a; p1 (C) represents the probability that the base to be detected is C; p1 (G) represents the probability that the base to be detected is G; p1 (T) represents the probability that the base to be detected is T. As just one example, if p1 (a), p1 (C), p1 (G), p1 (T) are 0.6, 0.2, 0.15, 0.05, respectively, the first base detector 1414 indicates a high probability that the base to be detected is a of 0.6.
In one example, the sum of p1 (a), p1 (C), p1 (G), p1 (T) is 1. Thus, in one example, the first base detector outputs a normalized probability for each base, for example, using a softmax function. In another example, other techniques (e.g., other than softmax) may be used. For example, the base detector has an output layer that does not use softmax. For example, regression-based operations may be used that may use Euclidean or Mahalanobis distances, for example, to the cloud center to derive a probability measure for each base.
Similarly, for a given base to be detected, the second base detection classification information 1436 is in the form of a probability or confidence score p2 (a), p2 (C), p2 (G), p2 (T), where the sum of p2 (a), p2 (C), p2 (G), p2 (T) is 1.
In one embodiment, the base detector can output the corresponding detected base in addition to the probabilities discussed above. For example, the first base detector 1414 outputs a first detected base, and the second base detector 1416 outputs a second detected base.
Simple rules for base detection are as follows. For example, assume that, for a given base to be detected, the first base detection classification information 1434 output by the first base detector 1414 is p1 (a), p1 (C), p1 (G), p1 (T), where p1 (C) is greater than each of p1 (a), p1 (G), and p1 (T). Then, the first base detector 1414 can detect a base as C. In another example, the first base detector 1414 can detect a base as C only if the corresponding probability p1 (C) is above a threshold probability. In yet another example, assume p1 (C) > p1 (a) > p1 (T) and > p1 (G). That is, p1 (C) has the highest probability, followed by probability p1 (a). Then, if p1 (C) is higher than p1 (A) by at least a threshold value (i.e., if the difference between the probabilities of two bases is at least a threshold amount), then first base detector 1414 can detect the base as C. Any other suitable rules for base detection are contemplated based on the teachings of the present disclosure. The second base detector 1416 can also detect bases accordingly.
Referring again to FIG. 19A, the base detection combination module 1428 receives the first and second base detection classification information 1434, 1436 and the context information 1420. Assume that based on the context information 1420, the base detection combination module 1428 decides to combine the first base detection classification information and the second base detection classification information, e.g., as discussed with respect to method 1500C of FIG. 15C. Thus, in this example, both base detectors 1414 and 1416 are processing the collection of sensor data, and the final confidence scores pf (a), pf (C), pf (G), pf (T), and the final detected bases are based on the output of both base detectors 1414 and 1416.
When the classification information from two base detectors is identical or matched, the classification information from two base detectors is used Average (e.g., arithmetic mean), minimum, maximum, or geometric mean of confidence scores
Still referring to FIG. 19A, a scenario is assumed in which the first base detection classification information 1434 from the first base detector 1414 and the second base detection classification information 1436 from the second base detector 1416 match. For example, the first base detector 1414 outputs first base detection classification information 1434 including confidence scores of p1 (a), p1 (C), p1 (G), p1 (T), and detects a base as C, just as one example. In addition, for example, the second base detector 1416 outputs second base detection classification information 1436 including confidence scores of p2 (a), p2 (C), p2 (G), p2 (T), and also detects a base as C, just as one example. Thus, in this example, the base detections from the two base detectors match and are C.
In this scenario, where the base detections from both base detectors 1414 and 1416 match, the final base detection 1440 includes the final detected base detection that matches the base detection made by base detectors 1414 and 1416.
In one embodiment, the final confidence scores pf (a), pf (C), pf (G), pf (T) are appropriate functions of the confidence scores p1 (a), p1 (C), p1 (G), p1 (T) output by the first base detector 1414 and the confidence scores p2 (a), p2 (C), p2 (G), p2 (T) output by the second base detector 1416.
For example, each of the final confidence scores pf (a), pf (C), pf (G), pf (T) may be an average or arithmetic mean of the corresponding confidence scores of the first base detector 1414 output p1 (a), p1 (C), p1 (G), p1 (T) and the corresponding confidence scores of the second base detector 1416 output p2 (a), p2 (C), p2 (G), p2 (T). Thus, if both base detectors 1414 and 1416 detect the base under consideration as C, the base detection combination module 1428 outputs the final detected base as C and the final confidence score as:
average value of pf (a) = (p 1 (a) and p2 (a)),
Average value of pf (C) = (p 1 (C) and p2 (C)),
average value of pf (G) = (p 1 (G) and p2 (G)), and
pf (T) = (p 1 (T) and p2 (T)). Equation 1
In another example, instead of an average or arithmetic mean, another mathematical function (such as a geometric mean) may be used. For example, if a geometric mean is used, equation 1 may be rewritten as:
and is also provided with
In another example, if the base detection system 1400 wants to report a conservative score, each of the final confidence scores pf (a), pf (C), pf (G), pf (T) may be the smallest of the confidence scores p1 (a), p1 (C), p1 (G), p1 (T) output by the first base detector 1414 and the confidence scores p2 (a), p2 (C), p2 (G), p2 (T) output by the second base detector 1416 (e.g., assuming that the base detections of the two base detectors match). Thus, if both base detectors 1414 and 1416 detect the base under consideration as C, the base detection combination module 1428 outputs the final detected base as C and the final confidence score as:
a minimum value of pf (a) = (p 1 (a) and p2 (a)),
a minimum value of pf (C) = (p 1 (C) and p2 (C)),
Minimum value of pf (G) = (p 1 (G) and p2 (G)), and
pf (T) = (p 1 (T) and p2 (T)). Equation 2
In yet another example, if the base detection system 1400 wants to report a high confidence score, each of the final confidence scores pf (a), pf (C), pf (G), pf (T) may be the maximum of the confidence scores p1 (a), p1 (C), p1 (G), p1 (T) output by the first base detector 1414 and the confidence scores p2 (a), p2 (C), p2 (G), p2 (T) output by the second base detector 1416 (e.g., assuming that the base detections of the two base detectors match). Thus, if both base detectors 1414 and 1416 detect the base under consideration as C, the base detection combination module 1428 outputs the final detected base as C and the final confidence score as:
pf (a) = (p 1 (a) and p2 (a)),
pf (C) = (p 1 (C) and p2 (C)) maximum value,
maximum value of pf (G) = (p 1 (G) and p2 (G)), and
pf (T) = (p 1 (T) and p2 (T)). Equation 3
In another example, if the base detection system 1400 wants to report weighted confidence scores, each of the final confidence scores pf (a), pf (C), pf (G), pf (T) may be a normalized weighted sum of the corresponding confidence scores p1 (a), p1 (C), p1 (G), p1 (T) output by the first base detector 1414 and the corresponding confidence scores p2 (a), p2 (C), p2 (G), p2 (T) output by the second base detector 1416 (e.g., assuming that the base detections of the two base detectors match). Thus, if both base detectors 1414 and 1416 detect the base under consideration as C, the base detection combination module 1428 outputs the final detected base as C and the final confidence score as:
pf(A)=A1 x p1(A)+A2×p2(A),
pf(C)=A1 x p1(C)+A2×p2(C),
pf (G) =a1 x p1 (G) +a2×p2 (G), and
pf (T) =a1 x p1 (T) +a2×p2 (T). Equation 4
In one example, weights A1 and A2 in equation 4 are fixed, pre-specified weights such that a1+a2=1. In one example, weights A1 and A2 are adjusted or updated during the training process, e.g., based on training data.
From two bases based on the temporal context (e.g., number of base detection cycles) associated with the sensor data Normalized ratio of confidence scores for detectors
In one example, the base detection system 1400 generates each of the final confidence scores pf (a), pf (C), pf (G), pf (T) as a weighted average of the confidence scores p1 (a), p1 (C), p1 (G), p1 (T) output by the first base detector 1414 and the corresponding confidence scores p2 (a), p2 (C), p2 (G), p2 (T) output by the second base detector 1416 (e.g., assuming that the base detections of the two base detectors match), where the weights are based on the context information 1420. That is, the context information 1420 specifies that the weight of each of the confidence scores from the two base detectors is to be given.
Fig. 19A1 illustrates a look-up table (LUT) 1901 that indicates an exemplary weighting scheme to be used for the final confidence score based on the temporal context information 1606 (see fig. 16). The actual weights included in LUT 1910 are merely examples and are not limiting.
Due to the phasing, predetermined phase, and fading discussed with respect to fig. 17C and 17D, it has been observed that the performance of both base detectors 1414 and 1416 is comparable during the initial base detection cycle (see, e.g., fig. 17D, which shows relatively better signal quality and less noise during the initial base detection cycle). During a later base detection cycle, the first base detector 1414 performs better than the second base detector 1416 because the first base detector 1414 can be better equipped to handle signal decay during the later base detection cycle. However, the first base detector 1414 may be computationally intensive to operate compared to the operation of the second base detector 1416.
Thus, in one example and as shown in FIG. 19A1, during an initial threshold number of base detection cycles, the confidence score from the second base detector 1416 is more important than the confidence score from the first base detector 1414. As and as the base detection loop progresses, confidence scores from the first base detector 1414 are more emphasized (e.g., because the first base detector 1414 performs better than the second base detector 1414 during later loops).
Specifically, referring to the first row of LUT 1901, it is assumed that there are N base detection cycles. For base detection cycles 1-N1 (i.e., the initial N1 base detection cycles), a high (e.g., 90-100%) weight is given to the confidence score from the second base detector 1416, and a low (e.g., 0-10%) weight is given to the confidence score from the first base detector 1414. Thus, during the base detection cycle 1-N1, the first base detector 1414 may be disabled or not operated. This increases computational efficiency since the first base detector 1414 is computationally intensive to operate (e.g., compared to the operation of the second base detector 1416). As discussed, during the first N1 cycles, the two base detectors had comparable performance, and thus no degradation in base detection quality was observed.
Here, N1 is an appropriate number of base detection cycles between 1 and N2 (discussed later). As just one example, N1 may be 100, 150, 200, 250 or another suitable number of base detection cycles. N1 can be determined as the number of initial base detection cycles for which two base detectors provide reasonably comparable base detection quality.
Thus, for example, for a base detection cycle between cycle 1 and N1, the final base detection pf is given by (e.g., assuming 100% weight is given to the second base detector):
for cycles 1 to N1: pf (a) =p2 (a); pf (C) =p2 (C); pf (T) =p2 (T); and is also provided with
pf (G) =p2 (G). Equation 5
As discussed, for at least the initial N1 cycles, the first base detector 1414 may be disabled (i.e., not operating on the corresponding set of data for at least the initial N1 cycles). Note that the first base detector 1414 will operate on the corresponding set of data going forward from cycle (n1+1) (see second row of LUT 1901). In order to satisfactorily generate base detection for cycle (n1+1) and subsequent cycles, the first base detector 1414 must also operate at least several cycles prior to cycle (n1+1), for example, because base detection for the current cycle is also based on data from one or more past cycles and one or more future cycles as discussed with respect to fig. 7 and 10 (see also discussion with respect to fig. 15E for further explanation). Thus, the first base detector 1414 may not operate on the corresponding set of data between cycle 1 and cycle (N1-T), and operate starting with cycle (N1-T+1). Here, T is a threshold number of cycles from cycle (n1+1) from which base detection is required for data.
Referring now to the second row of LUT 1901, in one example, for base detection cycles (N1+1) through N2, a first weight is given to the confidence score from the second base detector 1416 and a second weight is given to the confidence score from the first base detector 1414. In the example of fig. 19A1, both the first weight and the second weight are medium weights, such as about 50%, as just one example. Thus, during these cycles, both base detectors 1414 and 1416 operate. Thus, for example, for a base detection cycle between cycles n1+1 and N2, the final fraction pf is given by:
for cycles (n1+1) to N2:
pf(A)=0.5×p1(A)+0.5×p2(A);
pf(C)=0.5×p1(C)+0.5×p2(C);
pf (G) =0.5×p1 (G) +0.5×p2 (G); and
pf (T) =0.5×p1 (T) +0.5×p2 (T). Equation 6
Referring now to the third row of LUT 1901, in one example, for base detection cycles (n2+1) through N, a low (e.g., 0%) weight is given to the confidence score from the second base detector 1416 and a high (e.g., 100%) weight is given to the confidence score from the first base detector 1414. This is because, as discussed herein, during a later base detection cycle, the first base detector 1414 performs better than the second base detector 1416 because the first base detector 1414 can be better equipped to handle signal decay during the later base detection cycle (see FIGS. 17C, 17D).
Thus, for example, for a base detection cycle between cycles (n2+1) and N, the final fraction pf is given by:
for cycles (n2+1) to N:
pf(A)=p2(A);
pf(C)=p2(C);
pf (T) =p2 (T); and
pf (G) =p2 (G). Equation 7
In one example, LUT 1901 (or any other LUT discussed herein) may be saved in a memory of system 1400 (the memory is not shown in fig. 14). The switching module 1422 and/or base detection combining module 1428 accesses the LUT 1901 from memory and receives context information 1420 (e.g., temporal context information) indicating the current number of base detection cycles. Based on the temporal context information, the switching module 1422 and/or base detection combining module 1428 selects the appropriate row of the LUT 1901 and operates according to the weights specified in the selected row.
Confidence score correction based on base sequence context information indicative of a particular base sequence (where from two Base detection of the base detector is matched with the base detection of the base detector
Assume a scenario in which base detections from two base detectors match, and base sequence context information indicates that a detected base from any of the base detectors includes a particular base sequence, such as a homopolymer (e.g., GGGGG), a sequence with flanking homopolymers (e.g., GGTGG), a near-homopolymer, or another particular base sequence. In one example, five consecutive final base detections are made by base detection combination module 1428 (or either of base detectors 1414, 1416), and the five consecutive final base detection results are to include a particular base sequence. As previously discussed herein (see fig. 18), the probability of error for such special base sequences may be higher. Thus, the system 1400 may take special measures to potentially modify the confidence scores associated with bases of such base sequences. It is also noted that some examples of specific base sequences discussed in this disclosure (such as homopolymers, near homopolymers, or sequences with flanking homopolymers) have five bases. However, any different number of bases may be present in such a particular base sequence, such as three, five, six, seven, nine or another suitable number.
FIG. 19B shows a LUT 1905 indicating a base detector to be used when a detected base includes a special base sequence. In LUT 1905, the letter "X" indicates any base, such as A, C, T or G. Thus, for any base sequence included in the LUT 1905 (such as GGXGG, GXGGG, GGGXG, GXXGG, GGXXG), the final confidence score will be determined using, for example, the confidence score from the second base detector 1416. This is because it has been determined through experiments by the inventors that the second base detector 1416 performs better than the first base detector 1414 when encountering any particular base sequence indicated in the LUT 1905.
Thus, if the five consecutive final base detections made by base detection combination module 1428 are any of the special base sequences of LUT 1905, base detection combination module 1428 modifies the confidence scores associated with each of the five bases (or at least some, such as the middle base) to correspond to the confidence scores output by second base detector 1416 for the five bases.
In one example, LUT 1905 (or any other LUT discussed herein) may be saved in a memory of system 1400 (the memory is not shown in fig. 14). The switching module 1422 and/or base detection combining module 1428 access the LUT from memory and receive the context information 1420. Based on the context information, the switching module 1422 and/or base detection combining module 1428 selects the appropriate row of the LUT and operates according to the base detection operation specified in the selected row. Unless otherwise stated, this applies to all LUTs discussed in this disclosure.
FIG. 19C shows LUT 1910 indicating the weight given to the confidence score of each base detector when the detected base includes a particular base sequence. Note that the actual weights included in LUT 1910 are merely examples and do not limit the scope of the present disclosure. For example, referring to the first row of LUT 1910, if a particular sequence GGXGG is encountered, 60% weight may be given to the confidence score from the second base detector 1416 and 40% weight may be given to the confidence score from the first base detector 1414. For example, assume that the middle (i.e., third) base of the sequence indicated by "X" is T, and that the first base detector 1414 indicates a confidence score p1 (T) and the second base detector 1414 indicates a confidence score p2 (T). In this example, the final detected base is T, and the final confidence score for the middle (i.e., third) base of the sequence is:
pf(A)=0.6×p1(A)+0.4×p2(A),
pf(C)=0.6×p1(C)+0.4×p2(C),
pf (G) =0.6×p1 (G) +0.4×p2 (G), and
pf (T) =0.6×p1 (T) +0.4×p2 (T). Equation 8
In one example, the weights in LUT 1910 may be determined empirically through testing and calibration.
Other rows of LUT 1910 may also have weights based on detected base sequences. These weights may be pre-specified and the optimal values of these weights may be determined via testing and calibration.
Note that of all the exemplary weights specified in LUT 1910, the weight of second base detector 1416 is higher than the weight of first base detector 1414. This is because, as discussed above, in some examples, the second base detector 1416 may perform better than the first base detector 1414 when any of the particular base sequences indicated in the LUT are encountered.
Generating final classification information based on detection of bubbles in clusters
As previously discussed herein, during one or more base detection operation sequences, bubbles are formed on one or more clusters. Such bubbles may be beads of gas (such as air) in any liquid present in the cluster (such as bubbles within the reagents used for base detection). The presence of bubbles may be detected based on analyzing images captured from affected clusters. For example, the presence of bubbles within a cluster can be estimated by detecting a unique intensity signal features in the captured image of the cluster. In one example, other context information 1610 may indicate whether a set 1601 of sensor data from a cluster is associated with such a bubble.
Fig. 19D shows LUT 1915 indicating the operation of base detection combination module 1428 of fig. 14 based on the detection of one or more bubbles in a cluster of flow cells. For example, referring to the first row of LUT 1915, if no bubbles are detected in the flow cell, the final base detection is performed normally by base detection combination module 1428, e.g., according to any suitable operating scheme discussed herein in this disclosure.
Referring to the second row of LUT 1915, a scenario in which one or more bubbles are detected in a cluster of flow cells is discussed. In general, the first base detector 1414 is better equipped to handle base detection of clusters that include such bubbles. Thus, in one embodiment, in response to the other context information 1610 (see FIG. 16) indicating the presence of a bubble in a cluster, the base detection combination module 1428 places a relatively higher weight (e.g., 90-100% weight) on the confidence score from the first base detector 1414 and a relatively lower weight (e.g., 0-10% weight) on the confidence score from the second base detector 1416.
Note that a block of a flow cell includes a plurality of clusters, and bubbles are detected in a single cluster of blocks, for example. Thus, sensor data from a single cluster is processed by the first base detector 1414 primarily in accordance with the second row of the LUT 1915, while sensor data from other clusters of the block are processed by the first base detector 1414 and/or the second base detector 1416 in accordance with the first row of the LUT 1915.
Assume that no bubble is detected in the cluster for base detection cycles 1 to Na, and that at cycle (na+1), the cluster is detected to include a bubble. Thus, from the base detection cycle (na+1) onward, the first base detector 1414 will process sensor data from the cluster according to the second row of LUT 1915. However, it is assumed that prior to cycle (na+1) (i.e., from cycle 1 to Na), the first base detector 1414 is not operating on sensor data from the cluster, while the second base detector 1416 is operating on sensor data from the cluster. However, in order for the first base detector 1414 to detect bases starting from cluster (Na+1), the first base detector 1414 must also process the last few cycles (e.g., because, as discussed with respect to FIGS. 7 and 10, the base detection for the current cycle is also based on data from one or more past cycles and one or more future cycles; see also the discussion with respect to FIG. 15E). Thus, in response to the context information indicating the presence of a bubble at cycle (na+1), the first base detector processes sensor data for several cycles occurring prior to cycle (na+1) (e.g., processes sensor data for cycles Na, cycle (Na-1), cycle (Na-2), …, (Na-T)), and based on processing such past cycles, is now ready to process and base detect cycles at (na+1). T is the threshold number of past base detection cycles that the first base detector must process to properly process the sensor data for the current base detection cycle.
Generating final classification information based on out-of-focus event detection in clusters
As previously discussed herein, flow cell 1405 may include lenses (such as filter layer 124 including an array of microlenses or other optical components) for capturing images of various clusters. In one example, the focus for the various clusters may be slightly different when capturing the image, such as when an image sensor or camera is moved around the flow cell. For example, when capturing an image of a cluster, the edge cluster 1407a may be slightly out of focus relative to the non-edge cluster 1407 b. Defocus events may also occur due to heating or mechanical vibrations caused by lens movement.
Fig. 19D1 shows a LUT 1917 indicating the operation of the base detection combination module 1428 of fig. 14 according to detection of a defocus image from a cluster of flow cells. For example, referring to the first row of LUT 1917, if no out-of-focus image is detected for the flow cell, final base detection is performed normally by base detection combining module 1428, e.g., according to any suitable operating scheme discussed herein in this disclosure.
Referring to the second row of LUT 1917, a scenario in which out-of-focus images are detected from one or more clusters of the flow cell is discussed. In general, the first base detector 1414 is better equipped to handle base detection of clusters that generate such out-of-focus images. Thus, in one embodiment, in response to the other context information 1610 (see FIG. 16) indicating the presence of an out-of-focus image from a cluster, the base detection combination module 1428 places a relatively higher weight (e.g., 90-100% weight) on the confidence score from the first base detector 1414 and a relatively lower weight (e.g., 0-10% weight) on the confidence score from the second base detector 1416.
Note that a block of a flow cell comprises a plurality of clusters, and that out-of-focus images may be detected in, for example, a single cluster or several clusters (but not all or even most of the clusters) of the block. Thus, sensor data from one or more clusters with out-of-focus images is processed primarily by the first base detector 1414 according to the second row of the LUT 1915, while sensor data from other clusters of blocks is processed by the first base detector 1414 and/or the second base detector 1416 according to the first row of the LUT 1915.
Normalized ratio of confidence scores from two base detectors based on reagents used
Reagents play a major role in base detection, as discussed above. As just one example, when a first set of reagents is used, the first base detector 1414 may be more suitable than the second base detector 1416, and when a second set of reagents is used, the first base detector 1414 may be less suitable than the second base detector 1416. In one embodiment, the context information 1601 indicates the type of reagent used, and the context information generation module 1418 may assign a normalized weight to the confidence scores from the two base detectors for determining a final confidence score.
Fig. 19E shows a LUT 1920 indicating weights given to confidence scores of individual base detectors based on the reagent set used. For example, referring to the first row of LUT 1920, when using exemplary reagent set a, a1% weight is given to the confidence score from first base detector 1414 and a2% weight is given to the confidence score from second base detector 1416, where a1+a2=100. Similarly, referring to the second row of LUT 1920, when using exemplary reagent set B, b1% weight is given to the confidence score from first base detector 1414 and b2% weight is given to the confidence score from second base detector 1416, where b1+b2=100.
Normalized ratio of confidence scores from two base detectors in logarithmic probability domain
The various examples and embodiments above discuss confidence scores in terms of probabilities. However, in one embodiment, the confidence score may be expressed using a logarithmic scale, and the mathematical operations discussed herein (e.g., with respect to equations 1-8) may be performed with the confidence score expressed using a logarithmic scale. For example, the Phred mass fraction is a measure of the recognition quality of nucleobases generated by automated DNA sequencing. The Phred mass fraction Q is defined as the following attribute related to the logarithm of the base detection probability P:
Q=-10×log 10 P. Equation 9
Thus, a base detection accuracy of 90% (e.g., p1 (c) has a value of 0.9) translates to a corresponding Phred score of 10, a base detection accuracy of 99% (e.g., p1 (c) has a value of 0.99) translates to a corresponding Phred score of 20, and so on. Here, P is a base detection probability, which is related to the error probability E as follows: p= (1-E). Accordingly, the Phred quality score Q is related to the error probability E as follows: q= -10 log 10 (1-E), wherein E is the probability of error in detection of a specific base. Additional details of quality scores and error probabilities are discussed in co-pending U.S. provisional patent application No. 63/226,707 (attorney docket No. (ILLM 1045-1/IP-2093-PRV)) entitled "Quality Score Calibration of Basecalling Systems," filed on, for example, at 7, 28, 2021, which is incorporated herein by reference.
In one embodiment, the mathematical operations discussed herein (e.g., with respect to equations 1 through 8) may be performed using a Phred score instead of a confidence score. Thus, in some examples where mathematical operations use Phred or mass scores, the selection of the base detector to use may be based on Phred or mass scores (e.g., as discussed with respect to equations 1-8).
Generating detection from two bases based on spatial context associated with sensor data, e.g., edge block Final confidence score for a machine
As previously discussed with respect to fig. 17A, some blocks may be classified as edge blocks based on their spatial locations. For example, in fig. 17A, the blocks adjacent to any edge of the flow cell 1405 are labeled as edge blocks 1406a, and the remaining blocks are labeled as non-edge blocks 1406b. For example, the blocks on the vertical edge (e.g., along the Y-axis) and/or horizontal edge (e.g., along the X-axis) of the flow cell 1404 are categorized as edge blocks 1406, as shown in fig. 14. Thus, the edge block 1406 is immediately adjacent to the corresponding edge of the flow cell 1404.
Also as discussed with respect to fig. 17A, in one example, parameters related to the base detection operation of a tile may be based on the relative position of the tile. For example, the excitation light 101 discussed with respect to fig. 1 is directed to a zone of the flow cell, and different zones may receive different amounts of excitation light 101, e.g., based on the location of each zone and/or the location of one or more light sources emitting excitation light 101. For example, if the light source emitting the excitation light 101 is located vertically above the flow cell, the non-edge region 1406b may receive a different amount of light than the edge region 1406 a. In another example, ambient or external light (e.g., ambient light from outside the biosensor) around flow cell 1405 may affect the amount and/or characteristics of excitation light 101 received by the various blocks of flow cell 1405. As just one example, edge region 1406a may receive excitation light 101 and an amount of ambient light from outside flow cell 1405, while non-edge region 1406b may primarily receive excitation light 101. In yet another example, individual sensors (or pixels or photodiodes) included in flow cell 1405 (e.g., sensors 106, 108, 110, 112, and 114 shown in fig. 1) may sense light based on the locations of the corresponding sensors, which are based on the locations of the corresponding tiles. For example, the sensing operation performed by the one or more sensors associated with the edge block 1406a may be affected relatively more by the ambient light (and the excitation light 101) than the ambient light affects the sensing operation of the one or more other sensors associated with the non-edge block 1406b. In yet another example, the flow of reactants (e.g., which include any substances that can be used to obtain a desired reaction during base detection, such as reagents, enzymes, samples, other biomolecules, and buffer solutions) to various tiles may also be affected by the tile location. For example, a block near the source of reactant may receive a greater amount of reactant than a block farther from the source.
In one example, the spatial context information 1604 (see fig. 16) associated with the set 1601 of sensor data includes information regarding whether the set 1601 of sensor data is generated in the edge zone 1406a or in the non-edge zone 1406 b.
Fig. 19F shows LUT 1925 indicating the operation of base detection combining module 1428 of fig. 14 according to the spatial classification of the block. For example, referring to the first row of LUT 1925, for non-edge blocks, final base detection is performed normally by base detection combining module 1428, e.g., according to any suitable operating scheme discussed herein in this disclosure.
Referring to the second row of LUT 1925, the scenario of final base detection for an edge block is discussed. Generally, as discussed herein, the first base detector 1414 is better equipped to handle base detection of edge blocks. Thus, in one embodiment, for an edge block, the base detection combination module 1428 places an E1 weight on the confidence score from the first base detector 1414 and an E2 weight on the confidence score from the second base detector 1416, where E1 is higher than E2 in one example, and the sum of E1 and E2 is 100% (i.e., the weights are normalized).
Generating a signal from two base detectors based on a spatial context associated with the sensor data, e.g., edge clusters Final confidence score of (2)
As previously discussed with respect to fig. 17B, the clusters 1407 of the exemplary block 1406 are categorized as edge clusters 1407a or non-edge clusters 1407B. As also previously discussed herein, the flow cell 1405 may include lenses (such as the filter layer 124 including an array of microlenses or other optical components) for capturing images of the various clusters, and the edge clusters 1407a may be slightly out of focus relative to the non-edge clusters 1407b when capturing images of the clusters. Thus, depending on the implementation, one of the first base detector 1414 or the second base detector 1416 may be better suited to process sensor data from the edge cluster 1407a, while the other of the first base detector 1414 or the second base detector 1416 may be better suited to process sensor data from the non-edge cluster 1407b. In one example, the spatial context information 1604 (see fig. 16) associated with the set of sensor data 1601 includes information regarding whether the set of sensor data 1601 generated from one or more edge clusters 1407a or one or more non-edge clusters 1407b, based on which the set of sensor data 1601 can be processed by a particular one or both of the first base detector 1414 or the second base detector 1416.
Fig. 19G shows a LUT 1930 indicating the operation of the base detection combination module 1428 of fig. 14 according to the spatial classification of clusters. For example, referring to the first row of LUT 1930, for non-edge clusters, final base detection is normally performed by base detection combining module 1428, e.g., according to any suitable operating scheme discussed herein in this disclosure.
Referring to the second row of LUT 1930, the scenario of final base detection for edge clusters is discussed. In general, as discussed herein, the first base detector 1414 can be better equipped to handle base detection of edge blocks. Thus, in one embodiment, for an edge cluster, the base detection combination module 1428 places a C1 weight on the confidence score from the first base detector 1414 and a C2 weight on the confidence score from the second base detector 1416, where in one example, C1 is higher than C2 and the sum of C1 and C2 is 100% (i.e., the weights are normalized). In one example, the weight C1 may be as high as 100%, in which case the classification information from the first base detector 1414 is uniquely used to base detect the edge cluster.
Decreasing the final confidence score when the classification information from the two base detectors is inconsistent or not matched
As discussed with respect to fig. 19A, the first base detector 1414 outputs a first detected base and first confidence scores p1 (a), p1 (C), p1 (G), p1 (T); and the second base detector 1416 outputs the second detected base and the second confidence scores p2 (A), p2 (C), p2 (G), p2 (T). In one example, for a given base, a first detected base from the first base detector 1414 may not match a second detected base from the second base detector 1416.
For example, assume that the first base detector 1414 detects a base as A with a confidence score p1 (A) and the second base detector 1416 detects a base as C with a confidence score p2 (C). In this scenario, the final detected base output by base detection combination module 1428 is:
if p1 (a) is higher than p2 (C), the base=a is finally detected; or alternatively
If p1 (a) is lower than p2 (C), the base=c is finally detected. Equation 10
Note that there is a high probability of error because the two base detections from the two base detectors are not identical. Thus, the final confidence score may be reduced. For example, assume that p1 (a) is higher than p2 (C) (i.e., p1 (a) > p2 (C)), and that the final detected base is a. The final confidence score pf (a) corresponding to a is:
pf (a) = (p 1 (a) and p2 (a)). Equation 11
Thus, the final confidence score is artificially reduced due to the inconsistency of the two base detectors for the detected bases.
In another example, the final confidence score pf (a) is reduced as follows:
pf (a) =w1×p1 (a), wherein the weight W1 is smaller than 1. Equation 12
Thus, an appropriate weight W1 of less than 1 is used to reduce the final confidence score due to the inconsistency of the two base detectors for the detected bases.
When the classification information from the two base detectors is inconsistent or not matched, the information is combined with specific context information (s Such as a special base sequence
As discussed above, the first base detector 1414 outputs a first detected base and a first confidence score p1 (a), p1 (C), p1 (G), p1 (T); and the second base detector 1416 outputs the second detected base and the second confidence scores p2 (A), p2 (C), p2 (G), p2 (T). In one example, for a given base, a first detected base from the first base detector 1414 may not match a second detected base from the second base detector 1416.
In one embodiment, if a first detected base from the first base detector 1414 does not match a second detected base from the second base detector 1416, and such a mismatch is accompanied by one or more specific context information, the context information may be taken into account for the final detected base.
FIG. 20A shows the LUT 2000 indicating the operation of the base detection combination module 1428 of FIG. 14 when (i) a particular base sequence is detected and (ii) a first detected base from the first base detector 1414 does not match a second detected base from the second base detector 1416. For example, referring to the first line of LUT 2000, a scenario is discussed in which a first detected base from first base detector 1414 matches a second detected base from second base detector 1416 and a particular base sequence is detected, such as a homopolymer (e.g., GGGGG), a sequence with flanking homopolymers, or a near homopolymer (such as GGXGG). Further examples of such special sequences are discussed with respect to fig. 19B. Because the first detected base from the first base detector 1414 matches the second detected base from the second base detector 1416, the final detected base will match the detected bases from the first base detector and the second base detector, and a confidence score may be calculated according to FIG. 19B and/or according to any suitable protocol discussed herein. As previously described, some examples of special base sequences discussed in this disclosure (such as homopolymers, near homopolymers, or sequences with flanking homopolymers) have five bases. However, any different number of bases may be present in such a particular base sequence, such as three, five, six, seven, nine or another suitable number.
Referring now to the second line of LUT 2000, a scenario is discussed in which a first detected base from first base detector 1414 does not match a second detected base from second base detector 1416 and a particular base sequence is detected, such as a homopolymer (e.g., GGGGG), a sequence with flanking homopolymers, or a near-homopolymer (such as GGXGG). Further examples of such special sequences are discussed with respect to fig. 19B. Because the first detected base from the first base detector 1414 does not match the second detected base from the second base detector 1416, the final detected base is based on the second detected base from the second base detector 1416 (e.g., the second base detector 1416 is more reliable in the case of such a particular base sequence for the reasons discussed with respect to FIGS. 19B and 19C). The confidence score for the final detected base may be, for example, the minimum or average (or another suitable function) of the corresponding confidence scores from the two base detectors.
When the classification information from the two base detectors is inconsistent or not matched, the information is combined with specific context information (s Such as bubble detection
FIG. 20B shows a LUT 2005 indicating the operation of the base detection combination module 1428 of FIG. 14 when (i) a bubble is detected in a cluster and (ii) a first detected base from the first base detector 1414 does not match a second detected base from the second base detector 1416.
For example, referring to the first line of LUT 2005, a scenario is discussed in which a first detected base from first base detector 1414 matches a second detected base from second base detector 1416 and no bubbles are detected in any cluster. Thus, the final base detection is performed according to any suitable protocol discussed herein.
Referring now to the second line of LUT 2005, a scenario is discussed in which (i) a first detected base from first base detector 1414 does not match a second detected base from second base detector 1416 and (ii) a bubble is detected in a cluster. Because the first detected base from the first base detector 1414 does not match the second detected base from the second base detector 1416, the final detected base is based on the first detected base from the first base detector 1414 (e.g., the first base detector 1414 is more reliable in the case of bubble detection for reasons discussed with respect to FIG. 19D). The confidence score for the final detected base may be, for example, the minimum or average (or another suitable function) of the corresponding confidence scores from the two base detectors.
When the classification information from the two base detectors is inconsistent or not matched, the information is combined with specific context information (s Such as out-of-focus images
FIG. 20C shows an LUT 2010 indicating the operation of the base detection combination module 1428 of FIG. 14 when (i) one or more out-of-focus images are detected from at least one cluster and (ii) a first detected base from the first base detector 1414 does not match a second detected base from the second base detector 1416.
For example, referring to the first row of LUT 2010, a scenario is discussed in which a first detected base from first base detector 1414 matches a second detected base from second base detector 1416 and no out-of-focus image is detected in any cluster. Thus, the final base detection is performed according to any suitable protocol discussed herein.
Referring now to the second row of LUT 2010, a scenario is discussed in which (i) a first detected base from first base detector 1414 does not match a second detected base from second base detector 1416 and (ii) one or more out-of-focus images are detected from at least one cluster. Because the first detected base from the first base detector 1414 does not match the second detected base from the second base detector 1416, the final detected base is based on the first detected base from the first base detector 1414 (e.g., the first base detector 1414 is more reliable in the case of out-of-focus image detection for reasons discussed with respect to FIG. 19D 1). The confidence score for the final detected base may be, for example, the minimum or average (or another suitable function) of the corresponding confidence scores from the two base detectors.
When the classification information from the two base detectors is inconsistent or not matched, the information is combined with specific context information (s Such as spatial context information indicating edge clusters
FIG. 20D shows a LUT 2015 indicating the operation of the base detection combination module 1428 of FIG. 14 when (i) the sensor data is from an edge cluster and (ii) the first detected base from the first base detector 1414 does not match the second detected base from the second base detector 1416.
For example, referring to the first row of LUT 2015, a scenario is discussed in which a first detected base from first base detector 1414 matches a second detected base from second base detector 1416 and the collection of sensor data is from an edge cluster. Thus, final base detection is performed according to any suitable protocol discussed herein (such as the protocol discussed with respect to fig. 19G).
Referring now to the second row of LUT 2015, a scenario is discussed in which (i) a first detected base from first base detector 1414 does not match a second detected base from second base detector 1416 and (ii) sensor data is from an edge cluster. Because the first detected base from the first base detector 1414 does not match the second detected base from the second base detector 1416, the final detected base is based on the first detected base from the first base detector 1414 (e.g., the first base detector 1414 is more reliable in the case of edge clusters for reasons discussed with respect to FIG. 19G). The confidence score for the final detected base may be, for example, the minimum or average (or another suitable function) of the corresponding confidence scores from the two base detectors.
Detection of potentially unreliable confidence scores, and selection between base detectors based on such detection Selective switching
As discussed herein with respect to fig. 19B and 19C, for some particular detected base sequences, the performance of the first base detector 1414 may be unsatisfactory (e.g., with respect to the second base detector), for example, when a homopolymer (e.g., GGGGG), a sequence with flanking homopolymers, or a near homopolymer (e.g., GGTGG) is detected. In one embodiment, for some such sequences, the first base detector 1414 may generate a high confidence score for its detected base, although such a high confidence score may be higher than the true confidence of the detected base. Thus, for example, when the first base detector 1414 detects such relatively high confidence scores (e.g., above a threshold) for homopolymers, or sequences having flanking homopolymers, or near homopolymers, such high confidence scores may be unreliable. In some such scenarios, a confidence score from the second base detector 1416 may be used.
In one example, for a homopolymer, a sequence with flanking homopolymers, or near homopolymer, the confidence score p1 (a), p1 (C), p1 (G), p1 (T) associated with the middle or third detected base of the sequence can be changed. For example, assume that for sequences with flanking homopolymers (e.g., GGTGG), the first base detector 1414 detects the third base with some confidence score, where the confidence score for the third base being T is relatively high (e.g., above a threshold). Thus, the confidence scores p2 (a), p2 (C), p2 (G), p2 (T) from the second base detector 1416 can be used for a third base having a 5 base sequence flanking the homopolymer or near the homopolymer, and can be used to determine the final confidence score.
When the classification information from the two base detectors is inconsistent or not matched, the final base detection includes uncertainty Base detection of (2)
In one example, when the classification information from the two base detectors is inconsistent or does not match, the final base detection 1440 may include an indeterminate base detection and a corresponding confidence score. For example, the final confidence scores for the various bases may be generated using any of the methods discussed herein, such as a minimum, average, maximum, or normalized weighted confidence score, and the final base detection may be indicated as indeterminate.
For example, assume that for a given base to be detected, the first base detection classification information 1434 includes confidence scores p1 (a), p1 (C), p1 (G), p1 (T), and first detected base a (e.g., because p1 (a) is higher than each of p1 (C), p1 (G), p1 (T)). It is also assumed that for a given base to be detected, the second base detection classification information 1436 includes confidence scores p2 (a), p2 (C), p2 (G), p2 (T), and a second detected base C (e.g., because p2 (C) is higher than each of p2 (a), p2 (G), p2 (T)). Because of the mismatch of the two base detections, the final base detection is "N", where in one example "N" represents an indeterminate base detection. In another example and for the particular use case discussed herein, "N" can represent either of bases a or C (i.e., the first base detection and the second base detection output by the two base detectors). The final base detection N may be accompanied by final confidence scores, which may be calculated using any of formulas 1 to 8 discussed earlier herein.
Final base detection determining module based on neural network
FIG. 21 shows a base detection system 2100 that includes a plurality of base detectors to predict base detection of an unknown analyte that includes a base sequence, wherein a neural network-based final base detection determination module 2128 determines a final base detection 1440 based on the output of one or more of the plurality of base detectors. In one example, the final base detection determination module 2128 considers the context information and other variables (e.g., as discussed with respect to fig. 19A) to determine how to combine the first base detection classification information 1434 and the second base detection classification information 1436 to generate the final base detection 1440. The system 2100 is at least partially similar to the system 1400 of fig. 14. However, in the system 2100 of FIG. 21, the context information generation module 1418 and base detection combination module 1428 of FIG. 14 are replaced with a final base detection determination module 2128.
In one example, the final base detection determination module 2128 is a neural network-based module that has been trained using the outputs from the two base detectors 1414 and 1416. The trained final base detection determination module 2128 is then used for base detection. Training of the final base detection determination module 2128 can be based on one or more of the final base detection determination operations discussed herein. The operation of the system 2100 of fig. 21 will be apparent based on the discussion with respect to fig. 14 and further discussion presented herein regarding the final base detection determination. In other examples, the final base detection determination module 2128 may be another suitable machine learning model, such as a logistic regression model, a gradient-lifting tree model, a random forest model, a naive bayes model, and the like. In one example, the final base detection determination module 2128 can be any suitable machine learning model that can combine the two classification scores to generate the final base detection 1440.
Weight estimation
Various weights have been discussed throughout this disclosure, wherein the weights are used to weight the first classification information 1434 and the second classification information 1436 in generating the final classification information. Various techniques may be employed to generate the weights.
In one example, the trained neural network model of the final base detection determination module 2128 of fig. 21 can be used to fine tune weights. In another example, the weights may also be determined empirically using trial and error or another suitable method. In yet another example, the prediction covariance matrix of confidence scores may be empirically estimated and used to estimate weights.
Base detection system architecture
FIG. 22 is a block diagram of a base detection system 2200 according to one implementation. The base detection system 2200 is operable to obtain any information or data related to at least one of biological or chemical substances. In some implementations, the base detection system 2200 is a workstation that may be similar to a desktop device or a desktop computer. For example, most (or all) of the systems and components for carrying out the desired reaction may be located within a common housing 2216.
In particular implementations, the base detection system 2200 is a nucleic acid sequencing system (or sequencer) configured for a variety of applications including, but not limited to, de novo sequencing, re-sequencing of whole genome or target genomic regions, and metagenomics. Sequencers may also be used for DNA or RNA analysis. In some implementations, the base detection system 2200 can also be configured to generate reaction sites in the biosensor. For example, the base detection system 2200 may be configured to receive a sample and generate surface-attached clusters of clonally amplified nucleic acids derived from the sample. Each cluster may constitute or be part of a reaction site in the biosensor. The exemplary base detection system 2200 can include a system socket or interface 2212 configured to interact with the biosensor 2202 to perform a desired reaction within the biosensor 2202. In the following description with respect to fig. 22, biosensor 2202 is loaded into system receptacle 2212. However, it should be appreciated that a cartridge including the biosensor 2202 may be inserted into the system socket 2212, and in some states, the cartridge may be temporarily or permanently removed. As noted above, the cartridge may include, among other things, a fluid control component and a fluid storage component.
In particular implementations, the base detection system 2200 is configured to perform a number of parallel reactions within the biosensor 2202. Biosensor 2202 includes one or more reaction sites at which a desired reaction may occur. The reaction sites may be e.g. immobilized to a solid surface of the biosensor or to beads (or other movable substrates) located within the corresponding reaction chambers of the biosensor. The reaction sites may include, for example, clusters of clonally amplified nucleic acids. The biosensor 2202 may include a solid-state imaging device (e.g., a CCD or CMOS imaging device) and a flow cell mounted thereto. The flow-through cell may include one or more flow channels that receive the solution from the base detection system 2200 and direct the solution to the reaction site. Optionally, the biosensor 2202 may be configured to engage a thermal element for transferring thermal energy into or out of the flow channel.
Base detection system 2200 can include various components, assemblies, and systems (or subsystems) that interact with each other to perform predetermined methods or assay protocols for biological or chemical analysis. For example, the base detection system 2200 includes a system controller 2204 that can communicate with various components, assemblies, and subsystems of the base detection system 2200, as well as the biosensor 2202. For example, in addition to the system socket 2212, the base detection system 2200 may include: a fluid control system 2206 for controlling the flow of fluid throughout the fluid network of the base detection system 2200 and the biosensor 2202; a fluid storage system 2208 configured to hold all fluids (e.g., gases or liquids) usable by the biometric system; a temperature control system 2210 that can regulate the temperature of the fluid in the fluid network, fluid storage system 2208, and/or biosensor 2202; and an illumination system 2209 configured to illuminate the biosensor 2202. As described above, if a cartridge having the biosensor 2202 is loaded into the system socket 2212, the cartridge may further include a fluid control component and a fluid storage component.
As also shown, the base detection system 2200 can include a user interface 2214 for interaction with a user. For example, the user interface 2214 may include a display 2213 for displaying or requesting information from a user and a user input device 2215 for receiving user input. In some implementations, the display 2213 and the user input device 2215 are the same device. For example, the user interface 2214 may include a touch-sensitive display configured to detect the presence of an individual touch and also identify the location of the touch on the display. However, other user input devices 2215 may be used, such as a mouse, touchpad, keyboard, keypad, handheld scanner, voice recognition system, motion recognition system, and the like. As will be discussed in greater detail below, the base detection system 2200 can communicate with various components including a biosensor 2202 (e.g., in the form of a cartridge) to perform a desired reaction. The base detection system 2200 can also be configured to analyze data obtained from the biosensor to provide the user with desired information.
The system controller 2204 may include any processor-based or microprocessor-based system including use of microcontrollers, reduced Instruction Set Computers (RISC), application Specific Integrated Circuits (ASIC), field Programmable Gate Arrays (FPGA), logic circuits, and any other circuit or processor capable of executing the functions described herein. The above examples are exemplary only, and thus are not intended to limit in any way the definition and/or meaning of the term system controller. In an exemplary implementation, the system controller 2204 executes a set of instructions stored in one or more storage elements, memories, or modules in order to at least one of obtain detection data and analyze detection data. The detection data may include multiple pixel signal sequences such that pixel signal sequences from each of millions of sensors (or pixels) may be detected over a number of base detection cycles. The memory element may be in the form of an information source or a physical memory element within the base detection system 2200.
The instruction set may include various commands that instruct the base detection system 2200 or the biosensor 2202 to perform particular operations, such as the various embodied methods and processes described herein. The set of instructions may be in the form of a software program that may form part of a tangible, one or more non-transitory computer-readable media. As used herein, the terms "software" and "firmware" are interchangeable, and include any computer program stored in memory for execution by a computer, including RAM memory, ROM memory, EPROM memory, EEPROM memory, and non-volatile RAM (NVRAM) memory. The above memory types are exemplary only, and are thus not limiting as to the types of memory usable for storage of a computer program.
The software may be in various forms, such as system software or application software. Furthermore, the software may be in the form of a collection of separate programs, or in the form of program modules or portions of program modules within a larger program. The software may also include modular programming in the form of object-oriented programming. After obtaining the detection data, the detection data may be automatically processed by the base detection system 2200, processed in response to user input, or processed in response to a request by another processing machine (e.g., a remote request over a communication link). In the illustrated implementation, the system controller 2204 includes an analysis module 2338 (shown in fig. 23). In other implementations, the system controller 2204 does not include an analysis module 2338, but rather has access to the analysis module 2338 (e.g., the analysis module 2338 may be separately hosted on the cloud).
The system controller 2204 may be connected to the biosensor 2202 and other components of the base detection system 2200 via communication links. The system controller 2204 is also communicatively coupled to an offsite system or server. The communication link may be hardwired, wired, or wireless. The system controller 2204 may receive user inputs or commands from a user interface 2214 and a user input device 2215.
The fluid control system 2206 includes a fluid network and is configured to direct and regulate the flow of one or more fluids through the fluid network. The fluid network may be in fluid communication with the biosensor 2202 and the fluid storage system 2208. For example, selected fluid may be aspirated from the fluid storage system 2208 and directed to the biosensor 2202 in a controlled manner, or fluid may be aspirated from the biosensor 2202 and directed toward, for example, a waste reservoir in the fluid storage system 2208. Although not shown, the fluid control system 2206 may include a flow sensor that detects a flow rate or pressure of a fluid within a fluid network. The sensors may be in communication with the system controller 2204.
The temperature control system 2210 is configured to regulate the temperature of the fluid at different areas of the fluid network, fluid storage system 2208, and/or biosensor 2202. For example, the temperature control system 2210 may include a thermal cycler that interfaces with the biosensor 2202 and controls the temperature of fluid flowing along a reaction site in the biosensor 2202. Temperature control system 2210 can also regulate the temperature of solid elements or components of base detection system 2200 or biosensor 2202. Although not shown, the temperature control system 2210 may include sensors for detecting the temperature of the fluid or other components. The sensors may be in communication with the system controller 2204.
The fluid storage system 2208 is in fluid communication with the biosensor 2202 and may store various reaction components or reactants for performing a desired reaction therein. The fluid storage system 2208 may also store fluid for washing or cleaning the fluid network and the biosensor 2202, and for diluting the reactants. For example, the fluid storage system 2208 may include various reservoirs to store samples, reagents, enzymes, other biomolecules, buffer solutions, aqueous solutions, non-polar solutions, and the like. In addition, fluid storage system 2208 may also include a waste reservoir for receiving waste from biosensor 2202. In implementations that include a cartridge, the cartridge may include one or more of a fluid storage system, a fluid control system, or a temperature control system. Accordingly, one or more of the components described herein in connection with those systems may be housed within a cartridge housing. For example, the cartridge may have various reservoirs to store samples, reagents, enzymes, other biomolecules, buffer solutions, aqueous and non-polar solutions, waste, and the like. Thus, one or more of the fluid storage system, the fluid control system, or the temperature control system may be removably engaged with the biometric system via a cartridge or other biosensor.
The illumination system 2209 may include a light source (e.g., one or more LEDs) and a plurality of optical components for illuminating the biosensor. Examples of light sources may include lasers, arc lamps, LEDs, or laser diodes. The optical component may be, for example, a reflector, dichroic mirror, beam splitter, collimator, lens, filter, wedge mirror, prism, mirror, detector, etc. In implementations using an illumination system, the illumination system 2209 may be configured to direct excitation light to a reaction site. As one example, the fluorophore may be excited by light of a green wavelength, and thus the wavelength of the excitation light may be about 532nm. In one implementation, illumination system 2209 is configured to produce illumination parallel to a surface normal of a surface of biosensor 2202. In another implementation, illumination system 2209 is configured to produce illumination at an off-angle relative to a surface normal of a surface of biosensor 2202. In yet another implementation, the illumination system 2209 is configured to produce illumination having a plurality of angles, including some parallel illumination and some off-angle illumination. The system socket or interface 2212 is configured to engage the biosensor 2202 in at least one of mechanical, electrical, and fluidic manners. The system socket 2212 may hold the biosensor 2202 in a desired orientation to facilitate fluid flow through the biosensor 2202. The system socket 2212 may also include electrical contacts configured to engage the biosensor 2202 such that the base detection system 2200 may communicate with the biosensor 2202 and/or provide power to the biosensor 2202. Further, system socket 2212 may include a fluid port (e.g., a nozzle) configured to engage biosensor 2202. In some implementations, the biosensor 2202 is mechanically, electrically, and fluidly removably coupled to the system socket 2212.
In addition, the base detection system 2200 can be in remote communication with other systems or networks or with other biometric systems 2200. The detection data obtained by the biometric system 2200 may be stored in a remote database.
Fig. 23 is a block diagram of a system controller 2204 that may be used in the system of fig. 22. In one implementation, the system controller 2204 includes one or more processors or modules that may communicate with each other. Each of these processors or modules may include algorithms (e.g., instructions stored on tangible and/or non-transitory computer-readable storage media) or sub-algorithms for performing particular processes. The system controller 2204 is conceptually illustrated as a collection of modules, but may be implemented using any combination of special purpose hardware boards, DSPs, processors, etc. Alternatively, the system controller 2204 may be implemented using an off-the-shelf PC having a single processor or multiple processors with functional operations distributed among the processors. As a further option, the modules described below may be implemented using a hybrid configuration, where some of the modular functions are performed using dedicated hardware, while the remaining modular functions are performed using an off-the-shelf PC or the like. Modules may also be implemented as software modules within a processing unit.
During operation, communication port 2320 may transmit information (e.g., commands) to or receive information (e.g., data) from biosensor 2202 (fig. 22) and/or subsystems 2206, 2208, 2210 (fig. 22). In implementations, communication port 2320 may output multiple sequences of pixel signals. Communication port 2320 may receive user input from user interface 2214 (fig. 22) and transmit data or information to user interface 2214. Data from the biosensor 2202 or subsystems 2206, 2208, 2210 may be processed in real-time by the system controller 2204 during a biometric session. Additionally or alternatively, the data may be temporarily stored in system memory during the biometric session and processed at a slower rate than real-time or offline operation.
As shown in fig. 23, the system controller 2204 may include a plurality of modules 2331-2339 in communication with a master control module 2330. The master control module 2330 may be in communication with a user interface 2214 (fig. 22). Although modules 2331-2339 are shown in direct communication with master control module 2330, modules 2331-2339 may also be in direct communication with each other, with user interface 2214 and biosensor 2202. Additionally, modules 2331-2339 may communicate with master control module 2330 through other modules.
The plurality of modules 2331-2339 includes system modules 2331-2333, 2339 that communicate with subsystems 2206, 2208, 2210, and 2209, respectively. The fluid control module 2331 may communicate with the fluid control system 2206 to control valves and flow sensors of the fluid network to control the flow of one or more fluids through the fluid network. The fluid storage module 2332 may notify the user when the fluid volume is low or when the waste reservoir is at or near capacity. The fluid storage module 2332 may also be in communication with the temperature control module 2333 such that the fluid may be stored at a desired temperature. The illumination module 2339 may communicate with the illumination system 2209 to illuminate the reaction sites at specified times during the protocol, such as after a desired reaction (e.g., binding event) has occurred. In some implementations, the illumination module 2339 can communicate with the illumination system 2209 to illuminate the reaction sites at a specified angle.
The plurality of modules 2331-2339 may also include a device module 2334 in communication with the biosensor 2202 and an identification module 2335 that determines identification information associated with the biosensor 2202. The equipment module 2334 can, for example, communicate with the system socket 2212 to confirm that the biosensor has established an electrical and fluid connection with the base detection system 2200. The identification module 2335 may receive signals identifying the biosensor 2202. The identification module 2335 may use the identity of the biosensor 2202 to provide other information to the user. For example, the identification module 2335 may determine and then display a lot number, a date of manufacture, or a protocol suggesting operation with the biosensor 2202.
The plurality of modules 2331-2339 also includes an analysis module 2338 (also referred to as a signal processing module or signal processor) that receives and analyzes signal data (e.g., image data) from the biosensor 2202. Analysis module 2338 includes memory (e.g., RAM or flash memory) for storing detection data. The detection data may include multiple pixel signal sequences such that pixel signal sequences from each of millions of sensors (or pixels) may be detected over a number of base detection cycles. The signal data may be stored for later analysis or may be transmitted to user interface 2214 to display desired information to a user. In some implementations, the signal data may be processed by a solid-state imaging device (e.g., CMOS image sensor) before the analysis module 2338 receives the signal data.
The analysis module 2338 is configured to obtain image data from the light detector at each sequencing cycle of the plurality of sequencing cycles. The image data is derived from the emission signals detected by the light detectors and the image data for each of the plurality of sequencing cycles is processed through a neural network (e.g., a neural network-based template generator 2348, a neural network-based base detector 2358 (see, e.g., fig. 7, 9, and 10), and/or a neural network-based quality score 2368) and base detection is generated for at least some of the analytes at each of the plurality of sequencing cycles.
The protocol module 2336 and the protocol module 2337 communicate with the master control module 2330 to control the operation of the subsystems 2206, 2208, and 2210 when performing predetermined assay protocols. Protocol modules 2336 and 2337 may include a set of instructions for instructing base detection system 2200 to perform particular operations according to a predetermined protocol. As shown, the protocol module may be a sequencing-by-synthesis (SBS) module 2336 configured to emit
Various commands for performing sequencing-by-synthesis process. In SBS, the extension of a nucleic acid primer along a nucleic acid template is monitored to determine the sequence of nucleotides in the template. The basic chemical process may be polymerization (e.g., catalyzed by a polymerase) or ligation (e.g., catalyzed by a ligase). In a specific polymerase-based SBS implementation, fluorescently labeled nucleotides are added to the primers (and thus the primers are extended) in a template-dependent manner, such that detection of the order and type of nucleotides added to the primers can be used to determine the sequence of the template. For example, to initiate a first SBS cycle, a command may be issued to deliver one or more labeled nucleotides, DNA polymerase, etc. to/through a flow cell containing an array of nucleic acid templates. The nucleic acid templates may be located at corresponding reaction sites. Those reaction sites where primer extension results in incorporation of the labeled nucleotide can be detected by imaging events. During an imaging event, illumination system 2209 may provide excitation light to a reaction site. Optionally, the nucleotide may also include a reversible termination property that terminates further primer extension upon addition of the nucleotide to the primer. For example, a nucleotide analog having a reversible terminator moiety may be added to the primer such that no subsequent extension occurs prior to delivery of the deblocking agent to remove the moiety. Thus, for implementations using reversible termination, a command may be given to deliver the deblock reagent to the flow-through cell (either before or after detection occurs). One or more commands may be issued to effect washing between the various delivery steps. The cycle may then be repeated n times to extend the primer n nucleotides, thereby detecting a sequence of length n. Exemplary sequencing techniques are described in: for example, bentley et al, nature 456:53-59 (2008), WO 04/018497, US 7,057,026, WO 91/06678, WO 07/123744, US 7,329,492, US 7,211,414, US 7,315,019, and US 7,405,281, each of which is incorporated herein by reference.
For the nucleotide delivery step of the SBS cycle, a single type of nucleotide may be delivered at a time, or multiple different nucleotide types may be delivered (e.g., A, C, T and G together). For nucleotide delivery configurations where only a single type of nucleotide is present at a time, the different nucleotides need not have different labels, as they can be distinguished based on the time interval inherent in personalized delivery. Thus, sequencing methods or devices may use single color detection. For example, an excitation source need only provide excitation at a single wavelength or within a single wavelength range. For nucleotide delivery configurations in which delivery results in multiple different nucleotides being present in the flow-through cell at the same time, sites incorporating different nucleotide types can be distinguished based on different fluorescent labels attached to the corresponding nucleotide types in the mixture. For example, four different nucleotides may be used, each having one of four different fluorophores. In one implementation, excitation in four different regions of the spectrum may be used to distinguish between four different fluorophores. For example, four different excitation radiation sources may be used. Alternatively, fewer than four different excitation sources may be used, but optical filtering of excitation radiation from a single source may be used to produce different ranges of excitation radiation at the flow cell.
In some implementations, less than four different colors can be detected in a mixture of four different nucleotides. For example, a nucleotide pair may be detected at the same wavelength, but distinguished based on the difference in intensity of one member of the pair relative to the other member, or based on a change in one member of the pair (e.g., via chemical, photochemical, or physical modification) that results in the appearance or disappearance of a signal that is apparent from the detected signal of the other member of the pair. Exemplary devices and methods for distinguishing four different nucleotides using less than four color detection are described, for example, in U.S. patent application Ser. Nos. 61/538,294 and 61/619,878, which are incorporated herein by reference in their entireties. U.S. application Ser. No. 13/624,200 filed on 9/21/2012 is also incorporated by reference in its entirety.
The plurality of protocol modules may also include a sample preparation (or generation) module 2337 configured to issue commands to the fluid control system 2206 and the temperature control system 2210 to amplify the products within the biosensor 2202. For example, the biosensor 2202 may be coupled to the base detection system 2200. Amplification module 2337 can issue instructions to fluid control system 2206 to deliver the necessary amplification components to the reaction chambers within biosensor 2202. In other implementations, the reaction site may already contain some components for amplification, such as template DNA and/or primers. After delivering the amplification components to the reaction chamber, the amplification module 2337 may instruct the temperature control system 2210 to cycle through different temperature stages according to known amplification protocols. In some implementations, amplification and/or nucleotide incorporation occurs isothermally.
The SBS module 2336 may issue a command to perform bridge PCR in which clusters of cloned amplicons are formed on localized areas within the channels of the flow-through cell. After the amplicon is generated by bridge PCR, the amplicon can be "linearized" to prepare single stranded template DNA or sstDNA, and the sequencing primers can be hybridized to the universal sequences flanking the region of interest. For example, a reversible terminator-based sequencing-by-synthesis method may be used as described above or below.
Each base detection or sequencing cycle can extend sstDNA by a single base, which can be accomplished, for example, by using a modified DNA polymerase and a mixture of four types of nucleotides. The different types of nucleotides may have unique fluorescent labels and each nucleotide may also have a reversible terminator that allows only single base incorporation to occur in each cycle. After a single base is added to sstDNA, excitation light can be incident on the reaction site and fluorescence emission can be detected. After detection, the fluorescent label and terminator can be chemically cleaved from the sstDNA. This may be followed by another similar cycle of base detection or sequencing. In such a sequencing protocol, SBS module 2336 may instruct fluid control system 2206 to direct reagent and enzyme solutions to flow past biosensor 2202. Exemplary SBS methods based on reversible terminators that may be used with the devices and methods set forth herein are described in U.S. patent application publication No. 2007/0166705A1, U.S. patent application publication No. 2006/0188901A1, U.S. patent No. 7,057,026, U.S. patent application publication No. 2006/02404339 A1, U.S. patent application publication No. 2006/02814714709A1, PCT publication No. WO 05/065814, PCT publication No. WO 06/064199, each of which is incorporated herein by reference in its entirety. US 7,541,444; US 7,057,026; US 7,427,673; US 7,566,537; and US 7,592,435, each of which is incorporated herein by reference in its entirety.
In some implementations, the amplification module and SBS module can operate in a single assay protocol, where, for example, template nucleic acids are amplified and then sequenced within the same cassette.
The base detection system 2200 may also allow a user to reconfigure the assay protocol. For example, the base detection system 2200 can provide the user with options for modifying the determined protocol through the user interface 2214. For example, if it is determined that the biosensor 2202 is to be used for amplification, the base detection system 2200 can request the temperature of the annealing cycle.
Further, if the user has provided user input that is generally unacceptable to the selected assay protocol, the base detection system 2200 can alert the user.
In a particular implementation, the biosensor 2202 includes millions of sensors (or pixels), each of which generates multiple pixel signal sequences in subsequent base detection cycles. The analysis module 2338 detects multiple pixel signal sequences and attributes them to corresponding sensors (or pixels) based on the row-by-row and/or column-by-column locations of the sensors on the sensor array.
Each sensor in the sensor array may generate sensor data for a block of the flow cell, wherein the block is located on the flow cell in an area where clusters of genetic material are disposed during a base detection operation. The sensor data may comprise image data in an array of pixels. For a given cycle, the sensor data may include more than one image, producing multi-feature per pixel as tile data.
Fig. 24 is a simplified block diagram of a computer system 2400 that can be used to implement the disclosed technology. Computer system 2400 includes at least one Central Processing Unit (CPU) 2472 that communicates with a plurality of peripheral devices via bus subsystem 2455. These peripheral devices may include a storage subsystem 2410 (including, for example, a memory device and a file storage subsystem 2436), a user interface input device 2438, a user interface output device 2476, and a network interface subsystem 2474. Input devices and output devices allow users to interact with computer system 2400. Network interface subsystem 2474 provides an interface to external networks, including interfaces to corresponding interface devices in other computer systems.
User interface input device 2438 can include: a keyboard; pointing devices such as a mouse, trackball, touch pad, or tablet; a scanner; a touch screen incorporated into the display; audio input devices such as a speech recognition system and a microphone; as well as other types of input devices. Generally, the term "input device" is intended to include all possible types of devices and ways of inputting information into computer system 2400.
User interface output device 2476 can include a display subsystem, a printer, a facsimile machine, or a non-visual display (such as an audio output device). The display subsystem may include an LED display, a Cathode Ray Tube (CRT), a flat panel device such as a Liquid Crystal Display (LCD), a projection device, or some other mechanism for producing a viewable image. The display subsystem may also provide for non-visual displays, such as audio output devices. Generally, the term "output device" is intended to include all possible types of devices and ways to output information from computer system 2400 to a user or to another machine or computer system.
Storage subsystem 2410 stores programming structures and data structures that provide the functionality of some or all of the modules and methods described herein. These software modules are typically executed by the deep learning processor 2478.
In one implementation, the neural network is implemented using deep learning processor 2478, which may be a configurable and reconfigurable processor, a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), and/or Coarse Grain Reconfigurable Architecture (CGRA) and Graphics Processing Unit (GPU), other configured devices. The deep learning processor 2478 may be implemented by a deep learning cloud platform such as Google Cloud Platform TM 、Xilinx TM And Cirrascale TM And (5) hosting. Examples of deep learning processor 14978 include Google's Tensor Processing Unit (TPU) TM Rack solutions (e.g. GX4 Rackmount Series TM 、GX149 Rackmount Series TM )、NVIDIA DGX-1 TM Microsoft Stratix V FPGA TM Intelligent Processor Unit Graphcore (IPU) TM Qualcomm has Snapdragon processors TM Zeroth Platform of (a) TM Volta of NVIDIA TM DRIVE PX of NVIDIA TM JETSON TX1/TX2 MODULE of NVIDIA TM Nirvana of Intel TM 、Movidius VPU TM 、Fujitsu DPI TM DynamicIQ of ARM TM 、IBM TrueNorth TM Etc.
The memory subsystem 2422 used in the storage subsystem 2410 may include a number of memories, including a main Random Access Memory (RAM) 2434 for storing instructions and data during program execution and a Read Only Memory (ROM) 2432 in which fixed instructions are stored. File storage subsystem 2436 may provide persistent storage for program files and data files, and may include a hard disk drive, a floppy disk drive, and associated removable media, CD-ROM drive, optical disk drive, or removable media diskette box. Modules implementing certain of the implemented functions may be stored by file storage subsystem 2436 in storage subsystem 2410, or in other machines accessible to the processor.
Bus subsystem 2455 provides a mechanism for enabling the various components and subsystems of computer system 2400 to communicate with each other as intended. Although bus subsystem 2455 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple buses.
The computer system 2400 itself may be of different types, including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely distributed group of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 2400 depicted in FIG. 23 is intended only as a specific example for purposes of illustrating a preferred implementation of the present invention. Many other configurations of computer system 2400 are possible, having more or fewer components than the computer system depicted in FIG. 23.
Clause of (b)
Clause set 1 (generating a final classification based on classification information of two base detectors)
1. A computer-implemented method for base detection using at least two base detectors, the method comprising:
Performing at least a first base detector and a second base detector on sensor data generated for a sensing cycle of a series of sensing cycles;
generating, by the first base detector, first classification information associated with the sensor data based on executing the first base detector on the sensor data;
generating, by the second base detector, second classification information associated with the sensor data based on performing the second base detector on the sensor data; and
based on the first classification information and the second classification information, final classification information is generated, the final classification information including one or more base detections of the sensor data.
2. The method of clause 1, wherein at least one of the first base detector and the second base detector implements a nonlinear function, and wherein at least the other of the first base detector and the second base detector is at least partially linear.
3. The method of clause 1, wherein at least one of the first base detector and the second base detector implements a neural network model, and at least the other of the first base detector and the second base detector does not include a neural network model.
4. The method of clause 1, wherein:
for each base detection cycle, the first classification information generated by the first base detector includes: (i) A first plurality of scores, each score of the first plurality of scores indicating a probability that the base to be detected is one of A, C, T or G, and (ii) a first detected base; and
for each base detection cycle, the second classification information generated by the second base detector includes: (i) A second plurality of scores, each score of the second plurality of scores indicating a probability that the base to be detected is one of A, C, T or G, and (ii) a second detected base.
5. The method of clause 4, wherein:
for each base detection cycle, the final classification information includes: (i) A third plurality of scores, each score of the third plurality of scores indicating a probability that the base to be detected is one of A, C, T or G, and (ii) the base to be finally detected.
6. The method of clause 4, wherein at least one of the first base detector and the second base detector uses a softmax function to generate a corresponding plurality of scores.
7. The method of clause 1, wherein generating the final classification information comprises: the final classification information is generated by selectively combining the first classification information and the second classification information based on context information associated with the sensor data.
8. The method of clause 7, wherein the context information associated with the sensor data includes temporal context information, spatial context information, base sequence context information, and other context information.
9. The method of clause 7, wherein the context information associated with the sensor data comprises temporal context information indicating one or more base detection cycles associated with the sensor data.
10. The method of clause 7, wherein the context information associated with the sensor data includes spatial context information indicating the location of one or more tiles circulated Chi Nasheng into the sensor data.
11. The method of clause 7, wherein the context information associated with the sensor data comprises spatial context information indicating a location within a block of the flow cell where one or more clusters of the sensor data were generated.
The method of clause 11, wherein the spatial context information indicates whether the one or more clusters within the block of the flow-through cell that generated the sensor data are edge clusters or non-edge clusters.
The method of clause 11A, wherein if a cluster is estimated to be within a threshold distance from an edge of the block, the cluster is classified as an edge cluster.
The method of clause 11C, wherein if a cluster is estimated to be located greater than a threshold distance from any edge of the block, the cluster is classified as a non-edge cluster.
12. The method of clause 7, wherein the context information associated with the sensor data comprises base sequence context information indicating a base sequence detected for the sensor data.
13. The method of clause 1, wherein:
for a particular base to be detected, the first classification information includes a first score, a second score, a third score, and a fourth score that indicate probabilities of the base to be detected being A, C, T and G, respectively;
for the particular base to be detected, the second classification information includes a fifth score, a sixth score, a seventh score, and an eighth score that indicate probabilities of the base to be detected being A, C, T and G, respectively; and
generating the final classification information includes:
for the particular base to be detected, generating the final classification information based on the first score, the second score, the third score, the fourth score, the fifth score, the sixth score, the seventh score, and the eighth score.
14. The method of clause 13, wherein:
the final score comprises a first final score that is a function of the first score and the fifth score, the first final score indicating a probability that the base to be detected is a;
the final score comprises a second final score that is a function of the second score and the sixth score, the second final score indicating a probability that the base to be detected is C;
the final score comprises a third final score that is a function of the third score and the seventh score, the third final score indicating a probability that the base to be detected is T; and
the final score includes a fourth final score that is a function of the fourth score and the eighth score, the fourth final score indicating a probability that the base to be detected is G.
15. The method of clause 14, wherein:
the first final score is an average, normalized weighted average, minimum or maximum of the first score and the fifth score;
the second final score is an average, normalized weighted average, minimum or maximum of the second score and the sixth score;
the third final score is an average, normalized weighted average, minimum or maximum of the third score and the seventh score; and
The fourth final score is an average, normalized weighted average, minimum or maximum of the fourth score and the eighth score.
16. The method of clause 14, wherein:
for a particular base to be detected, the first classification information includes a first detected base that is one of A, C, T and G, the first detected base having a corresponding score that is the highest score of the first score, the second score, the third score, and the fourth score; and
for the particular base to be detected, the second classification information includes a second detected base that is one of A, C, T and G, the second detected base having a corresponding score that is the highest score of the fifth score, the sixth score, the seventh score, and the eighth score.
17. The method of clause 1, wherein:
for a particular base to be detected, the first classification information includes a first detected base that is one of A, C, T and G;
for the particular base to be detected, the second classification information includes a second detected base that is identical to the first detected base; and
generating the final classification information includes:
for the particular base to be detected, the final classification information is generated such that the final classification information includes a final detected base that matches the first detected base and the second detected base.
18. The method of clause 1, wherein:
for a particular base to be detected, the first classification information includes a first detected base that is one of A, C, T and G;
for the particular base to be detected, the second classification information includes a second detected base that is the other of A, C, T and G such that the second detected base does not match the first detected base; and
generating the final classification information includes:
for the particular base to be detected, generating the final classification information such that the final classification information includes a final detected base that is one of: (i) the first detected base, (ii) the second detected base, or (iii) labeled as indeterminate.
19. The method of clause 1, wherein:
at least one of the first classification information, the second classification information, or the final classification information indicates that the detected base sequence has a specific base sequence pattern; and
in response to the indication that the detected base sequence has the particular base sequence pattern, the final classification information is generated by placing a first weight to the first classification information and a second weight to the second classification information, wherein the first weight and the second weight are different.
20. The method of clause 19, wherein:
the specific base sequence pattern includes a homopolymer pattern or a near-homopolymer pattern.
20a. the method of clause 19, wherein:
the specific base sequence pattern includes a homopolymer pattern or a pattern having a flanking homopolymer.
21. The method of clause 19, wherein:
the specific base sequence pattern includes a plurality of bases, wherein at least a first base and a last base are G.
The method of clause 19, wherein:
the specific base sequence pattern includes at least five bases, wherein at least a first base and a last base are G.
22. The method of clause 19, wherein:
the specific base sequence pattern includes a plurality of bases, wherein most of the plurality of bases of the specific base sequence pattern are G.
The method of clause 19, wherein:
the specific base sequence pattern includes at least five bases, wherein at least three bases of the specific base sequence pattern are G.
The method of clause 19, wherein:
the specific base sequence pattern includes any one of GGXGG, GXGGG, GGGXG, GXXGG, GGXXG, wherein X is any one of A, C, T or G.
The method of clause 19, wherein:
the specific base sequence pattern includes a plurality of bases, wherein each of at least a first base and a last base is associated with inactive base detection.
22B1. The method of clause 19, wherein:
the particular base sequence pattern includes at least five bases, wherein each of at least a first base and a last base is associated with an inactive base detection.
The method according to clause 19, wherein:
the specific base sequence pattern includes a plurality of bases, wherein base detection of each of at least a first base and a last base is associated with a dark cycle.
The method of clause 19, wherein:
the particular base sequence pattern includes a plurality of bases, wherein at least a majority of each of the bases of the particular base sequence pattern is associated with inactive base detection.
The method according to clause 19, wherein:
the particular base sequence pattern includes a plurality of bases, wherein at least a majority of each of the bases of the particular base sequence pattern is associated with a dark cycle.
23. The method of clause 19, wherein:
The first weight is lower than the second weight such that the weight of the first classification information is less than the weight of the second classification information when the final classification information is generated.
24. The method of clause 23, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model.
25. The method of clause 19, wherein:
the first weight is higher than 90% and the second weight is lower than 10%.
26. The method of clause 1, wherein:
the sensor data includes: (i) First sensor data for a first one or more sensing cycles, and (ii) second sensor data for a second one or more sensing cycles occurring after the first one or more sensing cycles;
the final classification information includes:
(i) First final classification information for the first one or more sensing cycles generated by: (a) Placing a first weight on the first classification information associated with the first one or more sensing cycles, and (b) placing a second weight on the second classification information associated with the first one or more sensing cycles; and
(i) Second final classification information for the second one or more sensing cycles generated by: (a) Placing a third weight on the first classification information associated with the second one or more sensing cycles, and (b) placing a fourth weight on the second classification information associated with the second one or more sensing cycles; and
The first weight, the second weight, the third weight, and the fourth weight are different.
27. The method of clause 26, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model;
the first weight is lower than the second weight such that for the first one or more sensing cycles, the second classification information from the second base detector is more important than the first classification information from the first base detector; and
the third weight is higher than the fourth weight such that for the second one or more sensing cycles, the first classification information from the first base detector is more important than the second classification information from the second base detector.
28. The method of clause 1, wherein:
the sensor data includes: (i) First sensor data from a first one or more clusters of blocks of the flow-through cell, and
(ii) Second sensor data from a second one or more clusters of the block of the flow-through cell; the final classification information includes:
(i) First final classification information of the first sensor data from the first one or more clusters, the first final classification information generated by: (a) Placing a first weight on the first classification information from the first one or more clusters, an
(b) Placing a second weight on the second classification information from the first one or more clusters; and
(i) Second final classification information of the second sensor data from the second one or more clusters, the second final classification information generated by: (a) Placing a third weight on the first classification information from the second one or more clusters, an
(b) Placing a fourth weight on the second classification information from the second one or more clusters; and
the first weight, the second weight, the third weight, and the fourth weight are different.
29. The method of clause 28, wherein:
the first one or more clusters are edge clusters disposed within a threshold distance from one or more edges of the block of the flow-through cell; and
the second one or more clusters are non-edge clusters disposed at more than the threshold distance from one or more edges of the block of the flow-through cell.
30. The method of clause 29, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model; and
the first weight is higher than the second weight such that for the first one or more edge clusters, the first classification information from the first base detector is more important than the second classification information from the second base detector.
31. The method of clause 30, wherein:
the third weight is lower than or equal to the fourth weight such that for the second one or more non-edge clusters, the first classification information from the first base detector is less or equally important than the second classification information from the second base detector.
32. The method of clause 1, further comprising:
detecting the presence of one or more bubbles in at least one cluster of a block of the flow-through cell based on the sensor data,
wherein generating the final classification information comprises:
in response to the detection of the one or more bubbles, the final classification information is generated by placing a first weight to the first classification information and a second weight to the second classification information, wherein the first weight and the second weight are different.
33. The method of clause 32, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model; and
the first weight is higher than the second weight such that in response to the detection of the one or more bubbles, the first classification information from the first base detector is more important than the second classification information from the second base detector.
34. The method of clause 1, wherein the sensor data comprises at least one image, and wherein the method further comprises:
detecting that the at least one image is an out-of-focus image,
wherein generating the final classification information comprises:
in response to the detection of the out-of-focus image, the final classification information is generated by placing a first weight to the first classification information and a second weight to the second classification information, wherein the first weight and the second weight are different.
35. The method of clause 32, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model; and
the first weight is higher than the second weight such that in response to the detection of the out-of-focus image, the first classification information from the first base detector is more important than the second classification information from the second base detector.
36. The method of clause 1, wherein:
the sensor data is associated with a plurality of sequencing cycles;
the first classification information includes a first detected base sequence corresponding to the plurality of sequencing cycles, and the second classification information includes a second detected base sequence corresponding to the plurality of sequencing cycles;
The first detected base sequence and the second detected base sequence do not match, and at least one of the first detected base sequence or the second detected base sequence has a specific base sequence pattern;
the first base detector implements a neural network model, and the second base detector does not include a neural network model; and
generating the final classification information includes:
in response to (i) at least one of the first detected base sequence or the second detected base sequence having the particular base sequence pattern, and (ii) the second base detector not including the neural network model, generating the final classification information such that the final detected base sequence of the final classification information matches the second detected base sequence but does not match the first detected base sequence.
37. The method of clause 36, wherein:
the specific base sequence pattern includes a homopolymer pattern or a near-homopolymer pattern.
38. The method of clause 36, wherein:
the specific base sequence pattern includes a plurality of bases, wherein at least a first base and a last base are G.
39. The method of clause 36, wherein:
The specific base sequence pattern includes a plurality of bases, wherein at least a majority of the bases of the specific base sequence pattern are G.
39a. The method of clause 36, wherein:
the specific base sequence pattern includes at least five bases, wherein at least three bases of the specific base sequence pattern are G.
39A. The method of clause 36, wherein:
the specific base sequence pattern includes a plurality of bases, wherein each of at least a first base and a last base is associated with inactive base detection.
39b. the method of clause 36, wherein:
the specific base sequence pattern includes a plurality of bases, wherein base detection of each of at least a first base and a last base is associated with a dark cycle.
39c. the method of clause 36, wherein:
the particular base sequence pattern includes a plurality of bases, wherein at least a majority of each of the bases of the particular base sequence pattern is associated with inactive base detection.
39d. the method of clause 36, wherein:
the particular base sequence pattern includes a plurality of bases, wherein at least a majority of each of the bases of the particular base sequence pattern is associated with a dark cycle.
40. The method of clause 1, wherein generating the final classification information comprises: receiving, by a machine learning model, the first classification information associated with the sensor data from the first base detector;
receiving, by the machine learning model, the second classification information associated with the sensor data from the second base detector; and
the final classification information is generated by the machine learning model based on the first classification information and the second classification information.
The method of clause 40, wherein the machine learning model is any one of a logistic regression model, a gradient-lifting tree model, a random forest model, a naive bayes model, or a neural network model.
The method of clause 1, wherein generating the final classification information comprises: receiving, by a neural network model, the first classification information associated with the sensor data from the first base detector;
receiving, by the neural network model, the second classification information associated with the sensor data from the second base detector; and
the final classification information is generated by the neural network model based on the first classification information and the second classification information.
41. A computer-implemented method, the method comprising:
Generating sensor data for a sensing cycle of the series of sensing cycles; and performing at least a first base detector and a second base detector on at least a corresponding portion of the sensor data, and selectively switching the performance of the first base detector and the second base detector based on context information associated with the sensor data, wherein the first base detector is different from the second base detector;
generating first classification information and second classification information from the first base detector and the second base detector, respectively,
base detection is generated based on one or both of the first classification information and the second classification information.
42. A non-transitory computer readable storage medium printed with computer program instructions for progressively training a base detector, the instructions when executed on a processor implement a method comprising:
performing at least a first base detector and a second base detector on sensor data generated for a sensing cycle of a series of sensing cycles;
generating, by the first base detector, first classification information associated with the sensor data based on executing the first base detector on the sensor data;
Generating, by the second base detector, second classification information associated with the sensor data based on performing the second base detector on the sensor data; and
based on the first classification information and the second classification information, final classification information is generated, the final classification information including one or more base detections of the sensor data.
43. The non-transitory computer readable storage medium of clause 42, wherein at least one of the first base detector and the second base detector implements a non-linear function, and wherein at least the other of the first base detector and the second base detector is at least partially linear.
44. The non-transitory computer-readable storage medium of clause 42, wherein at least one of the first and second base detectors implements a neural network model and at least the other of the first and second base detectors does not include a neural network model.
45. The non-transitory computer readable storage medium of clause 42, wherein:
for each base detection cycle, the first classification information generated by the first base detector includes: (i) A first plurality of scores, each score of the first plurality of scores indicating a probability that the base to be detected is one of A, C, T or G, and (ii) a first detected base; and
For each base detection cycle, the second classification information generated by the second base detector includes: (i) A second plurality of scores, each score of the second plurality of scores indicating a probability that the base to be detected is one of A, C, T or G, and (ii) a second detected base.
46. The non-transitory computer readable storage medium of clause 45, wherein:
for each base detection cycle, the final classification information includes: (i) A third plurality of scores, each score of the third plurality of scores indicating a probability that the base to be detected is one of A, C, T or G, and (ii) the base to be finally detected.
47. The non-transitory computer-readable storage medium of clause 45, wherein at least one of the first base detector and the second base detector uses a softmax function to generate a corresponding plurality of scores.
48. The non-transitory computer readable storage medium of clause 42, wherein generating the final classification information comprises:
the final classification information is generated by selectively combining the first classification information and the second classification information based on context information associated with the sensor data.
49. The non-transitory computer readable storage medium of clause 48, wherein the context information associated with the sensor data includes temporal context information, spatial context information, base sequence context information, and other context information.
50. The non-transitory computer readable storage medium of clause 48, wherein the context information associated with the sensor data comprises temporal context information indicating one or more base detection cycles associated with the sensor data.
51. The non-transitory computer readable storage medium of clause 48, wherein the context information associated with the sensor data comprises spatial context information indicating the location of one or more tiles circulated Chi Nasheng into the sensor data.
52. The non-transitory computer readable storage medium of clause 48, wherein the context information associated with the sensor data comprises spatial context information indicating a location within a block of the flow cell at which one or more clusters of the sensor data were generated.
52A. The non-transitory computer readable storage medium of clause 52, wherein the spatial context information indicates whether the one or more clusters within the block of the flow cell that generated the sensor data are edge clusters or non-edge clusters.
52B. The non-transitory computer-readable storage medium of clause 52A, wherein if a cluster is estimated to be located within a threshold distance from an edge of the block, the cluster is classified as an edge cluster.
The non-transitory computer readable storage medium of clause 52A, wherein if a cluster is estimated to be located greater than a threshold distance from any edge of the block, the cluster is classified as a non-edge cluster.
53. The non-transitory computer readable storage medium of clause 48, wherein the context information associated with the sensor data comprises base sequence context information indicating a base sequence detected for the sensor data.
54. The non-transitory computer readable storage medium of clause 42, wherein:
for a particular base to be detected, the first classification information includes a first score, a second score, a third score, and a fourth score that indicate probabilities of the base to be detected being A, C, T and G, respectively;
for the particular base to be detected, the second classification information includes a fifth score, a sixth score, a seventh score, and an eighth score that indicate probabilities of the base to be detected being A, C, T and G, respectively; and
generating the final classification information includes:
for the particular base to be detected, generating the final classification information based on the first score, the second score, the third score, the fourth score, the fifth score, the sixth score, the seventh score, and the eighth score.
55. The non-transitory computer readable storage medium of clause 54, wherein:
the final score comprises a first final score that is a function of the first score and the fifth score, the first final score indicating a probability that the base to be detected is a;
the final score comprises a second final score that is a function of the second score and the sixth score, the second final score indicating a probability that the base to be detected is C;
the final score comprises a third final score that is a function of the third score and the seventh score, the third final score indicating a probability that the base to be detected is T; and
the final score includes a fourth final score that is a function of the fourth score and the eighth score, the fourth final score indicating a probability that the base to be detected is G.
56. The non-transitory computer readable storage medium of clause 55, wherein:
the first final score is an average, normalized weighted average, minimum or maximum of the first score and the fifth score;
the second final score is an average, normalized weighted average, minimum or maximum of the second score and the sixth score;
the third final score is an average, normalized weighted average, minimum or maximum of the third score and the seventh score; and
The fourth final score is an average, normalized weighted average, minimum or maximum of the fourth score and the eighth score.
57. The non-transitory computer readable storage medium of clause 55, wherein:
for a particular base to be detected, the first classification information includes a first detected base that is one of A, C, T and G, the first detected base having a corresponding score that is the highest score of the first score, the second score, the third score, and the fourth score; and
for the particular base to be detected, the second classification information includes a second detected base that is one of A, C, T and G, the second detected base having a corresponding score that is the highest score of the fifth score, the sixth score, the seventh score, and the eighth score.
58. The non-transitory computer readable storage medium of clause 42, wherein:
for a particular base to be detected, the first classification information includes a first detected base that is one of A, C, T and G;
for the particular base to be detected, the second classification information includes a second detected base that is identical to the first detected base; and
Generating the final classification information includes:
for the particular base to be detected, the final classification information is generated such that the final classification information includes a final detected base that matches the first detected base and the second detected base.
59. The non-transitory computer readable storage medium of clause 42, wherein:
for a particular base to be detected, the first classification information includes a first detected base that is one of A, C, T and G;
for the particular base to be detected, the second classification information includes a second detected base that is the other of A, C, T and G such that the second detected base does not match the first detected base; and
generating the final classification information includes:
for the particular base to be detected, generating the final classification information such that the final classification information includes a final detected base that is one of: (i) the first detected base, (ii) the second detected base, or (iii) labeled as indeterminate.
60. The non-transitory computer readable storage medium of clause 42, wherein:
at least one of the first classification information, the second classification information, or the final classification information indicates that the detected base sequence has a specific base sequence pattern; and generating the final classification information by placing a first weight to the first classification information and placing a second weight to the second classification information in response to the indication that the detected base sequence has the particular base sequence pattern, wherein the first weight and the second weight are different.
61. The non-transitory computer readable storage medium of clause 60, wherein:
the specific base sequence pattern includes a homopolymer pattern or a near-homopolymer pattern.
62. The non-transitory computer readable storage medium of clause 60, wherein:
the specific base sequence pattern includes a plurality of bases, wherein at least a first base and a last base are G.
63. The non-transitory computer readable storage medium of clause 60, wherein:
the specific base sequence pattern includes a plurality of bases, wherein most of the bases of the specific base sequence pattern are G.
63A. The non-transitory computer-readable storage medium of clause 60, wherein:
the specific base sequence pattern includes any one of GGXGG, GXGGG, GGGXG, GXXGG, GGXXG, wherein X is any one of A, C, T or G.
63b. the non-transitory computer-readable storage medium of clause 60, wherein:
the specific base sequence pattern includes a plurality of bases, wherein each of at least a first base and a last base is associated with inactive base detection.
63c. the non-transitory computer readable storage medium of clause 60, wherein:
the specific base sequence pattern includes a plurality of bases, wherein base detection of each of at least a first base and a last base is associated with a dark cycle.
63d. the non-transitory computer-readable storage medium of clause 60, wherein:
the specific base sequence pattern includes a plurality of bases, wherein most of the bases of the specific base sequence pattern are associated with inactive base detection.
63e. the non-transitory computer readable storage medium of clause 60, wherein:
the specific base sequence pattern includes at least five bases, wherein each of the at least three bases of the specific base sequence pattern is associated with a dark cycle.
64. The non-transitory computer readable storage medium of clause 60, wherein:
the first weight is lower than the second weight such that the weight of the first classification information is less than the weight of the second classification information when the final classification information is generated.
65. The non-transitory computer readable storage medium of clause 64, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model.
66. The non-transitory computer readable storage medium of clause 60, wherein:
the first weight is higher than 90% and the second weight is lower than 10%.
67. The non-transitory computer readable storage medium of clause 42, wherein:
The sensor data includes: (i) First sensor data for a first one or more sensing cycles, and (ii) second sensor data for a second one or more sensing cycles occurring after the first one or more sensing cycles;
the final classification information includes:
(i) First final classification information for the first one or more sensing cycles generated by: (a) Placing a first weight on the first classification information associated with the first one or more sensing cycles, and (b) placing a second weight on the second classification information associated with the first one or more sensing cycles; and
(i) Second final classification information for the second one or more sensing cycles generated by: (a) Placing a third weight on the first classification information associated with the second one or more sensing cycles, and (b) placing a fourth weight on the second classification information associated with the second one or more sensing cycles; and
the first weight, the second weight, the third weight, and the fourth weight are different.
68. The non-transitory computer readable storage medium of clause 67, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model;
The first weight is lower than the second weight such that for the first one or more sensing cycles, the second classification information from the second base detector is more important than the first classification information from the first base detector; and
the third weight is higher than the fourth weight such that for the second one or more sensing cycles, the first classification information from the first base detector is more important than the second classification information from the second base detector.
69. The non-transitory computer readable storage medium of clause 42, wherein:
the sensor data includes: (i) First sensor data from a first one or more clusters of blocks of the flow-through cell, and
(ii) Second sensor data from a second one or more clusters of the block of the flow-through cell; the final classification information includes:
(i) First final classification information of the first sensor data from the first one or more clusters, the first final classification information generated by: (a) Placing a first weight on the first classification information from the first one or more clusters, and (b) placing a second weight on the second classification information from the first one or more clusters; and
(i) Second final classification information of the second sensor data from the second one or more clusters, the second final classification information generated by: (a) Placing a third weight on the first classification information from the second one or more clusters, an
(b) Placing a fourth weight on the second classification information from the second one or more clusters; and
the first weight, the second weight, the third weight, and the fourth weight are different.
70. The non-transitory computer readable storage medium of clause 69, wherein:
the first one or more clusters are edge clusters disposed within a threshold distance from one or more edges of the block of the flow-through cell; and
the second one or more clusters are non-edge clusters disposed at more than the threshold distance from one or more edges of the block of the flow-through cell.
71. The non-transitory computer readable storage medium of clause 70, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model; and
the first weight is higher than the second weight such that for the first one or more edge clusters, the first classification information from the first base detector is more important than the second classification information from the second base detector.
72. The non-transitory computer readable storage medium of clause 71, wherein:
the third weight is lower than or equal to the fourth weight such that for the second one or more non-edge clusters, the first classification information from the first base detector is less or equally important than the second classification information from the second base detector.
73. The non-transitory computer-readable storage medium of clause 42, further comprising: detecting the presence of one or more bubbles in at least one cluster of a block of the flow-through cell based on the sensor data,
wherein generating the final classification information comprises:
in response to the detection of the one or more bubbles, the final classification information is generated by placing a first weight to the first classification information and a second weight to the second classification information, wherein the first weight and the second weight are different.
74. The non-transitory computer readable storage medium of clause 73, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model; and
the first weight is higher than the second weight such that in response to the detection of the one or more bubbles, the first classification information from the first base detector is more important than the second classification information from the second base detector.
75. The non-transitory computer readable storage medium of clause 73, wherein the sensor data comprises at least one image, and wherein the method further comprises:
detecting that the at least one image is an out-of-focus image,
wherein generating the final classification information comprises:
in response to the detection of the out-of-focus image, the final classification information is generated by placing a first weight to the first classification information and a second weight to the second classification information, wherein the first weight and the second weight are different.
76. The non-transitory computer readable storage medium of clause 73, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model; and
the first weight is higher than the second weight such that in response to the detection of the out-of-focus image, the first classification information from the first base detector is more important than the second classification information from the second base detector.
77. The non-transitory computer readable storage medium of clause 42, wherein:
the sensor data is associated with a plurality of sequencing cycles;
the first classification information includes a first detected base sequence corresponding to the plurality of sequencing cycles, and the second classification information includes a second detected base sequence corresponding to the plurality of sequencing cycles;
The first detected base sequence and the second detected base sequence do not match, and at least one of the first detected base sequence or the second detected base sequence has a specific base sequence pattern;
the first base detector implements a neural network model, and the second base detector does not include a neural network model; and
generating the final classification information includes:
in response to (i) at least one of the first detected base sequence or the second detected base sequence having the particular base sequence pattern, and (ii) the second base detector not including the neural network model, generating the final classification information such that the final detected base sequence of the final classification information matches the second detected base sequence but does not match the first detected base sequence.
78. The non-transitory computer readable storage medium of clause 77, wherein:
the specific base sequence pattern includes a homopolymer pattern or a near-homopolymer pattern.
79. The non-transitory computer readable storage medium of clause 77, wherein:
the specific base sequence pattern includes a plurality of bases, wherein at least a first base and a last base are G.
80. The non-transitory computer readable storage medium of clause 77, wherein:
the specific base sequence pattern includes a plurality of bases, wherein most of the bases of the specific base sequence pattern are G.
80A. The non-transitory computer readable storage medium of clause 77, wherein:
the specific base sequence pattern includes a plurality of bases, wherein each of at least a first base and a last base is associated with inactive base detection.
80b. the non-transitory computer-readable storage medium of clause 77, wherein:
the specific base sequence pattern includes a plurality of bases, wherein base detection of each of at least a first base and a last base is associated with a dark cycle.
80c. the non-transitory computer readable storage medium of clause 77, wherein:
the specific base sequence pattern includes a plurality of bases, wherein most of the plurality of bases of the specific base sequence pattern are associated with inactive base detection.
80d. the non-transitory computer readable storage medium of clause 60, wherein:
the specific base sequence pattern includes at least five bases, wherein each of the at least three bases of the specific base sequence pattern is associated with a dark cycle.
81. The non-transitory computer readable storage medium of clause 42, wherein generating the final classification information comprises:
receiving, by a neural network model, the first classification information associated with the sensor data from the first base detector;
receiving, by the neural network model, the second classification information associated with the sensor data from the second base detector; and
the final classification information is generated by the neural network model based on the first classification information and the second classification information.
Clause set 2 (switching/selectively enabling two base detectors)
1. A computer-implemented method for base detection using at least two base detectors, the method comprising:
performing a first base detector on sensor data generated for a sensing cycle of a series of sensing cycles; generating, by the first base detector, first classification information associated with the sensor data based on said executing the first base detector on the sensor data;
determining that the first classification information is insufficient for generating final classification information for the sensor data;
in response to determining the inadequacy of the first classification information, performing a second base detector on the sensor data, the second base detector being different from the first base detector;
Generating, by the second base detector, second classification information associated with the sensor data based on said executing the second base detector on the sensor data; and
based on the first classification information and the second classification information, the final classification information is generated, the final classification information including one or more base detections of the sensor data.
2. The method of clause 1, wherein the first classification information comprises a first detected base sequence, and wherein determining that the first classification information is insufficient comprises: determining that the first detected base sequence matches a specific base sequence pattern; and
based on the first detected base sequence matching the specific base sequence pattern, it is determined that the first classification information is insufficient for generating the final classification information.
3. The method of clause 2, wherein:
the specific base sequence pattern includes a homopolymer pattern or a near-homopolymer pattern.
4. The method of clause 2, wherein:
the specific base sequence pattern includes a plurality of bases, wherein at least a first base and a last base of the plurality of bases is G.
4A. The method according to clause 2, wherein:
The specific base sequence pattern includes at least five bases, wherein at least a first base and a last base are G.
5. The method of clause 2, wherein:
the specific base sequence pattern includes a plurality of bases, wherein at least three bases of the plurality of bases are G.
The method according to clause 2, wherein:
the specific base sequence pattern includes at least five bases, wherein at least three bases of the specific base sequence pattern are G.
6. The method of clause 2, wherein:
the specific base sequence pattern includes any one of GGXGG, GXGGG, GGGXG, GXXGG, GGXXG, wherein X is any one of A, C, T or G.
The method according to clause 2, wherein:
the specific base sequence pattern includes a plurality of bases, wherein each of at least a first base and a last base is associated with inactive base detection.
The method according to clause 2, wherein:
the specific base sequence pattern includes a plurality of bases, wherein base detection of each of at least a first base and a last base is associated with a dark cycle.
The method according to clause 2, wherein:
the specific base sequence pattern includes at least five bases, wherein each of the at least three bases of the specific base sequence pattern is associated with inactive base detection.
The method according to clause 2, wherein:
the specific base sequence pattern includes at least five bases, wherein each of the at least three bases of the specific base sequence pattern is associated with a dark cycle.
7. The method of clause 2, wherein generating the final classification information comprises: responsive to the first detected base sequence matching the particular base sequence pattern, by placing a first weight on the first classification information and a second weight on the second classification information
To generate the final classification information, wherein the first weight and the second weight are different.
8. The method of clause 7, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model; and
the first weight is less than the second weight such that the weight of the first classification information is less than the weight of the second classification information when the final classification information is generated.
9. The method of clause 2, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model;
the second classification information includes a second detected base sequence;
The first detected base sequence does not match the second detected base sequence; and
in response to (i) the first detected base sequence matching the particular base sequence pattern, and (ii) the second base detector not including the neural network model, the final classification information is generated such that the final detected base sequence of the final classification information matches the second detected base sequence but does not match the first detected base sequence.
10. The method of clause 1, wherein determining that the first classification information is insufficient comprises:
detecting the presence of bubbles in the cluster from which the sensor data is generated; and
based on the detection of the bubble, it is determined that the first classification information is insufficient for generating the final classification information.
11. The method of clause 10, wherein the second base detector implements a neural network model and the first base detector does not include a neural network model, and generating the final classification information comprises:
the final classification information is generated by placing a first weight on the first classification information and a second weight on the second classification information, wherein the second weight is greater than the first weight.
12. The method of clause 1, wherein:
the sensor data is current sensor data;
the current sensor data is for a sensing cycle N1 and one or more subsequent sensing cycles, where N1 is a positive integer greater than 1; and is also provided with
Performing the second base detector on the current sensor data includes:
first, the second base detector is performed on past sensor data associated with at least T sensing cycles occurring prior to the sensing cycle N1 to estimate phasing data associated with the at least T sensing cycles, and
subsequently, the second base detector is performed on the current sensor data associated with the sensing cycle N1 and the one or more subsequent sensing cycles using the estimated phasing data.
13. The method of clause 1, wherein the sensor data is first sensor data generated from a first one or more clusters of a block of the flow-through cell, and wherein the method further comprises:
generating second sensor data from a second one or more clusters of the block of the flow-through cell for a sensing cycle of the series of sensing cycles; and
performing the first base detector on the second sensor data;
Generating, by the first base detector, third classification information associated with the second sensor data based on executing the first base detector on the second sensor data;
determining that the third classification information is sufficient to generate final classification information for the second one or more clusters;
wherein performing the second base detector on the first sensor data comprises:
the second base detector is performed on the first sensor data, but not on the second sensor data, such that (i) the final classification of the first one or more clusters is based on the output of the first base detector and the second base detector, and (ii) the final classification of the second one or more clusters is based on the output of the first base detector and not the second base detector.
14. The method of clause 1, wherein determining that the first classification information is insufficient comprises:
receiving context information associated with the sensor data;
determining, based on the context data, that the first classification information includes an error probability that is above a threshold probability; and
based on determining that the first classification information includes the error probability that is above the threshold probability, determining the first classification information is insufficient for generating final classification information for the sensor data.
15. A system for base detection, the system comprising:
a memory storing images depicting intensity emissions of a set of analytes generated by analytes in the set of analytes during a sequencing cycle of a sequencing run, wherein the memory further stores a topology of the first base detector and the second base detector;
a context information generation module configured to generate context information associated with the images;
one or more processors configured to execute the first base detector on the images, thereby generating first classification information associated with the images; and
a final base detection determination module configured to determine a deficiency of the first classification information in generating final classification information associated with the images,
wherein in response to a determination of the deficiency of the first classification information, the one or more processors are configured to perform the second base detector on the images to generate second classification information associated with the images, and
Wherein the final base detection determination module is further configured to generate the final classification information comprising one or more final base detections of the sequencing run based at least in part on the second classification information.
16. The system of clause 15, wherein the final classification information is further generated based at least in part on the first classification information.
17. The system of clause 15, wherein the final classification information is generated based on a weighted sum of the first classification information and the second classification information.
18. The system of clause 15, wherein the first classification information comprises a first detected base sequence, and wherein to determine the deficiency of the first classification information, the final base detection determination module is to:
determining that the first detected base sequence matches a particular base sequence pattern; and determining that the first classification information is insufficient for generating final classification information based on the first detected base sequence matching the particular base sequence pattern.
19. The system of clause 18, wherein:
the specific base sequence pattern includes a homopolymer pattern or a near-homopolymer pattern.
20. The system of clause 18, wherein:
The specific base sequence pattern includes a plurality of bases, wherein at least a first base and a last base are G.
21. The system of clause 18, wherein:
the specific base sequence pattern includes at least five bases, wherein at least three bases of the specific base sequence pattern are G.
22. The system of clause 18, wherein:
the specific base sequence pattern includes any one of GGXGG, GXGGG, GGGXG, GXXGG, GGXXG, wherein X is any one of A, C, T or G.
The system of clause 18, wherein:
the specific base sequence pattern includes a plurality of bases, wherein each of at least a first base and a last base is associated with inactive base detection.
The system of clause 18, wherein:
the specific base sequence pattern includes a plurality of bases, wherein base detection of each of at least a first base and a last base is associated with a dark cycle.
The system of clause 18, wherein:
the specific base sequence pattern includes at least five bases, wherein each of the at least three bases of the specific base sequence pattern is associated with inactive base detection.
The system of clause 18, wherein:
The specific base sequence pattern includes at least five bases, wherein each of the at least three bases of the specific base sequence pattern is associated with a dark cycle.
23. The system of clause 15, wherein to determine the deficiency of the first classification information, the final base detection determination module is to:
detecting the presence of bubbles in the clusters from which the images were generated; and
based on the detection of the bubble, it is determined that the first classification information is insufficient for generating the final classification information.
24. The system of clause 15, wherein to determine the deficiency of the first classification information, the final base detection determination module is to:
detecting the presence of out-of-focus images within the images; and
based on the detection of the out-of-focus image, it is determined that the first classification information is insufficient for generating the final classification information.
25. A non-transitory computer readable storage medium printed with computer program instructions for progressively training a base detector, which instructions when executed on a processor implement a method comprising:
performing a first base detector on sensor data generated for a sensing cycle of a series of sensing cycles to generate first classification information associated with the sensor data;
Processing (i) context information associated with the sensor data and (ii) the first classification information;
performing a second base detector on the sensor data based on processing the context information and the first classification information to generate second classification information associated with the sensor data; and
based on the first classification information and the second classification information, the final classification information is generated, the final classification information including one or more base detections of the sensor data.

Claims (30)

1. A computer-implemented method for base detection using at least two base detectors, the method comprising:
performing at least a first base detector and a second base detector on sensor data generated for a sensing cycle of a series of sensing cycles;
generating, by the first base detector, first classification information associated with the sensor data based on performing the first base detector on the sensor data;
generating, by the second base detector, second classification information associated with the sensor data based on performing the second base detector on the sensor data; and
based on the first classification information and the second classification information, final classification information is generated, the final classification information including one or more base detections of the sensor data.
2. The method of claim 1, wherein at least one of the first base detector and the second base detector implements a nonlinear function, and wherein at least the other of the first base detector and the second base detector is at least partially linear.
3. The method of claim 1 or 2, wherein at least one of the first base detector and the second base detector implements a neural network model, and at least the other of the first base detector and the second base detector does not include a neural network model.
4. A method according to any one of claims 1 to 3, wherein:
for each base detection cycle, the first classification information generated by the first base detector includes: (i) A first plurality of scores, each score of the first plurality of scores indicating a probability that the base to be detected is one of A, C, T or G, and (ii) a first detected base; and
for each base detection cycle, the second classification information generated by the second base detector includes: (i) A second plurality of scores, each score of the second plurality of scores indicating a probability that the base to be detected is one of A, C, T or G, and (ii) a second detected base.
5. The method of any one of claims 1 to 4, wherein:
for each base detection cycle, the final classification information includes: (i) A third plurality of scores, each score of the third plurality of scores indicating a probability that the base to be detected is one of A, C, T or G, and (ii) the base ultimately detected.
6. The method of any one of claims 1-5, wherein at least one of the first base detector and the second base detector uses a softmax function to generate a corresponding plurality of scores.
7. The method of any of claims 1-6, wherein generating the final classification information comprises:
the final classification information is generated by selectively combining the first classification information and the second classification information based on context information associated with the sensor data.
8. The method of claim 7, wherein the context information associated with the sensor data comprises temporal context information, spatial context information, base sequence context information, and other context information.
9. The method of claim 7 or 8, wherein the context information associated with the sensor data comprises temporal context information indicating one or more base detection cycles associated with the sensor data.
10. The method of any of claims 7-9, wherein the context information associated with the sensor data includes spatial context information indicating a location within a flow cell of one or more tiles generating the sensor data.
11. The method of any of claims 7-10, wherein the context information associated with the sensor data includes spatial context information indicating a location within a block of the flow cell at which one or more clusters of the sensor data are generated.
12. The method of claim 10 or 11, wherein the spatial context information indicates whether the one or more clusters within the block of the flow-through cell that generated the sensor data are edge clusters or non-edge clusters.
13. The method of any of claims 10 to 12, wherein a cluster is classified as an edge cluster if the cluster is estimated to be within a threshold distance from an edge of the block.
14. The method of any of claims 10 to 13, wherein a cluster is classified as a non-edge cluster if the cluster is estimated to be located greater than a threshold distance from any edge of the block.
15. The method of any one of claims 7 to 14, wherein the context information associated with the sensor data comprises base sequence context information indicating a base sequence detected for the sensor data.
16. The method of any one of claims 7 to 15, wherein:
for a particular base to be detected, the first classification information includes a first score, a second score, a third score, and a fourth score that indicate probabilities of the base to be detected being A, C, T and G, respectively;
for the particular base to be detected, the second classification information includes a fifth score, a sixth score, a seventh score, and an eighth score that indicate probabilities of the base to be detected being A, C, T and G, respectively; and
generating the final classification information includes:
for the particular base to be detected, generating the final classification information based on the first score, the second score, the third score, the fourth score, the fifth score, the sixth score, the seventh score, and the eighth score.
17. The method according to claim 16, wherein:
The final score comprises a first final score that is a function of the first score and the fifth score, the first final score indicating a probability that the base to be detected is a;
the final score comprises a second final score that is a function of the second score and the sixth score, the second final score indicating a probability that the base to be detected is C;
the final score comprises a third final score that is a function of the third score and the seventh score, the third final score indicating a probability that the base to be detected is T; and
the final score includes a fourth final score that is a function of the fourth score and the eighth score, the fourth final score indicating a probability that the base to be detected is G.
18. The method according to claim 17, wherein:
the first final score is an average, normalized weighted average, minimum or maximum of the first score and the fifth score;
the second final score is an average, normalized weighted average, minimum or maximum of the second score and the sixth score;
the third final score is an average, normalized weighted average, minimum or maximum of the third score and the seventh score; and
The fourth final score is an average, normalized weighted average, minimum or maximum of the fourth score and the eighth score.
19. The method according to claim 17 or 18, wherein:
for a particular base to be detected, the first classification information includes a first detected base that is one of A, C, T and G, the first detected base having a corresponding score that is the highest of the first score, the second score, the third score, and the fourth score; and
for the particular base to be detected, the second classification information includes a second detected base that is one of A, C, T and G, the second detected base having a corresponding score that is the highest score of the fifth score, the sixth score, the seventh score, and the eighth score.
20. The method of any one of claims 1 to 19, wherein:
for a particular base to be detected, the first classification information includes a first detected base that is one of A, C, T and G;
for the particular base to be detected, the second classification information includes a second detected base that is identical to the first detected base; and
Generating the final classification information includes:
the final classification information is generated for the particular base to be detected such that the final classification information includes a final detected base that matches the first detected base and the second detected base.
21. The method of any one of claims 1 to 20, wherein:
for a particular base to be detected, the first classification information includes a first detected base that is one of A, C, T and G;
for the particular base to be detected, the second classification information includes a second detected base that is the other of A, C, T and G such that the second detected base does not match the first detected base; and
generating the final classification information includes:
for the particular base to be detected, generating the final classification information such that the final classification information includes a final detected base that is one of: (i) the first detected base, (ii) the second detected base, or (iii) labeled as indeterminate.
22. The method of any one of claims 1 to 21, wherein:
at least one of the first classification information, the second classification information, or the final classification information indicates that the detected base sequence has a specific base sequence pattern; and
In response to the indication that the detected base sequence has the particular base sequence pattern, the final classification information is generated by placing a first weight to the first classification information and a second weight to the second classification information, wherein the first weight and the second weight are different.
23. The method of any one of claims 1 to 22, wherein:
the sensor data includes: (i) First sensor data for a first one or more sensing cycles, and (ii) second sensor data for a second one or more sensing cycles occurring after the first one or more sensing cycles;
the final classification information includes:
(i) First final classification information for the first one or more sensing cycles generated by: (a) Placing a first weight on the first classification information associated with the first one or more sensing cycles, and (b) placing a second weight on the second classification information associated with the first one or more sensing cycles; and
(i) Second final classification information for the second one or more sensing cycles generated by: (a) Placing a third weight on the first classification information associated with the second one or more sensing cycles, and (b) placing a fourth weight on the second classification information associated with the second one or more sensing cycles; and
The first weight, the second weight, the third weight, and the fourth weight are different.
24. The method according to claim 23, wherein:
the first base detector implements a neural network model, and the second base detector does not include a neural network model;
the first weight is lower than the second weight such that for the first one or more sensing cycles, the second classification information from the second base detector is more important than the first classification information from the first base detector; and
the third weight is higher than the fourth weight such that for the second one or more sensing cycles, the first classification information from the first base detector is more important than the second classification information from the second base detector.
25. The method of any one of claims 1 to 24, wherein:
the sensor data includes: (i) First sensor data from a first one or more clusters of a block of a flow-through cell, and (ii) second sensor data from a second one or more clusters of the block of the flow-through cell; the final classification information includes:
(i) First final classification information of the first sensor data from the first one or more clusters, the first final classification information generated by: (a) Placing a first weight on the first classification information from the first one or more clusters, an
(b) Placing a second weight on the second classification information from the first one or more clusters; and
(i) Second final classification information of the second sensor data from the second one or more clusters, the second final classification information generated by: (a) Placing a third weight on the first classification information from the second one or more clusters, an
(b) Placing a fourth weight on the second classification information from the second one or more clusters; and
the first weight, the second weight, the third weight, and the fourth weight are different.
26. The method according to claim 25, wherein:
the first one or more clusters are edge clusters disposed within a threshold distance from one or more edges of the block of the flow-through cell; and
the second one or more clusters are non-edge clusters disposed at more than the threshold distance from one or more edges of the block of the flow-through cell.
27. The method of any one of claims 1 to 26, the method further comprising:
detecting the presence of one or more bubbles in at least one cluster of a block of a flow-through cell from the sensor data,
wherein generating the final classification information comprises:
in response to the detection of the one or more bubbles, the final classification information is generated by placing a first weight to the first classification information and a second weight to the second classification information, wherein the first weight and the second weight are different.
28. The method of any of claims 1-27, wherein the sensor data comprises at least one image, and wherein the method further comprises:
detecting that the at least one image is an out-of-focus image,
wherein generating the final classification information comprises:
in response to the detection of the out-of-focus image, the final classification information is generated by placing a first weight to the first classification information and a second weight to the second classification information, wherein the first weight and the second weight are different.
29. A computer-implemented method, the method comprising:
Generating sensor data for a sensing cycle of the series of sensing cycles; and
performing at least a first base detector and a second base detector on at least a corresponding portion of the sensor data, and selectively switching the performance of the first base detector and the second base detector based on context information associated with the sensor data, wherein the first base detector is different from the second base detector;
generating first classification information and second classification information by the first base detector and the second base detector, respectively,
base detection is generated based on one or both of the first classification information and the second classification information.
30. A non-transitory computer readable storage medium printed with computer program instructions for progressively training a base detector, the instructions when executed on a processor implement a method comprising:
performing at least a first base detector and a second base detector on sensor data generated for a sensing cycle of a series of sensing cycles;
generating, by the first base detector, first classification information associated with the sensor data based on performing the first base detector on the sensor data;
Generating, by the second base detector, second classification information associated with the sensor data based on performing the second base detector on the sensor data; and
based on the first classification information and the second classification information, final classification information is generated, the final classification information including one or more base detections of the sensor data.
CN202280044106.4A 2021-08-03 2022-08-02 Base detection using multiple base detector model Pending CN117546248A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US63/228954 2021-08-03
US17/876528 2022-07-28
US17/876,528 US20230041989A1 (en) 2021-08-03 2022-07-28 Base calling using multiple base caller models
PCT/US2022/039208 WO2023014741A1 (en) 2021-08-03 2022-08-02 Base calling using multiple base caller models

Publications (1)

Publication Number Publication Date
CN117546248A true CN117546248A (en) 2024-02-09

Family

ID=89794364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280044106.4A Pending CN117546248A (en) 2021-08-03 2022-08-02 Base detection using multiple base detector model

Country Status (1)

Country Link
CN (1) CN117546248A (en)

Similar Documents

Publication Publication Date Title
US20220301657A1 (en) Tile location and/or cycle based weight set selection for base calling
US20230041989A1 (en) Base calling using multiple base caller models
AU2022319125A1 (en) Quality score calibration of basecalling systems
CN117546248A (en) Base detection using multiple base detector model
CN117616474A (en) Intensity extraction with interpolation and adaptation for base detection
EP4381514A1 (en) Base calling using multiple base caller models
US20230029970A1 (en) Quality score calibration of basecalling systems
JP2024510539A (en) Tile position and/or cycle-based weight set selection for base calling
US20220415445A1 (en) Self-learned base caller, trained using oligo sequences
US20230026084A1 (en) Self-learned base caller, trained using organism sequences
WO2022197752A1 (en) Tile location and/or cycle based weight set selection for base calling
CN117529780A (en) Mass fraction calibration of base detection systems
US20230298339A1 (en) State-based base calling
US20230087698A1 (en) Compressed state-based base calling
CA3224382A1 (en) Self-learned base caller, trained using oligo sequences
CN117546249A (en) Self-learning base detector trained using oligonucleotide sequences
WO2023049215A1 (en) Compressed state-based base calling
EP4374343A1 (en) Intensity extraction with interpolation and adaptation for base calling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination