IL309786A - Quality score calibration of basecalling systems - Google Patents
Quality score calibration of basecalling systemsInfo
- Publication number
- IL309786A IL309786A IL309786A IL30978623A IL309786A IL 309786 A IL309786 A IL 309786A IL 309786 A IL309786 A IL 309786A IL 30978623 A IL30978623 A IL 30978623A IL 309786 A IL309786 A IL 309786A
- Authority
- IL
- Israel
- Prior art keywords
- sensor data
- range
- clusters
- computer
- subset
- Prior art date
Links
- 238000000034 method Methods 0.000 claims 17
- 239000002773 nucleotide Substances 0.000 claims 13
- 125000003729 nucleotide group Chemical group 0.000 claims 13
- 238000013507 mapping Methods 0.000 claims 6
- 238000004590 computer program Methods 0.000 claims 2
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Public Health (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Analytical Chemistry (AREA)
- Epidemiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Organic Chemistry (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Signal Processing (AREA)
- Biochemistry (AREA)
- Genetics & Genomics (AREA)
Claims (20)
1.Claims 1. A computer-implemented method of generating base calls by a base caller, comprising: receiving, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identifying a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; mapping at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; processing the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remapping each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
2. The computer-implemented method of claim 1, wherein the second range is fully encompassed within the first range.
3. The computer-implemented method of claim 1 or 2, wherein one or more outlier sensor data within the first range are absent from the second range of sensor data.
4. The computer-implemented method of any of claims 1-3, wherein identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of sensor data have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of sensor data have a value that is higher than the high value, wherein the second range is defined by the low value and the high value.
5. The computer-implemented method of claim 4, wherein at least one of the lower threshold percentage or the upper threshold percentage is 0.5% or less.
6. The computer-implemented method of claim 4, wherein at least one of the lower threshold percentage or the upper threshold percentage is 1.0% or less.
7. The computer-implemented method of any of claims 4-6, wherein each of the lower threshold percentage and the upper threshold percentage is 0.5% or less.
8. The computer-implemented method of any of claims 4-6, wherein each of the lower threshold percentage and the upper threshold percentage is 1% or less.
9. The computer-implemented method of any of claims 4-8, further comprising: identifying (i) a first outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is lower than the low value and (ii) a second outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is higher than the high value; and prior to the mapping, assigning the low value to the first outlier sensor data, and assigning the high value to the second outlier sensor data, such that the first outlier sensor data and the second outlier sensor data are within the second range subsequent to the assignment.
10. The computer-implemented method of any of claims 4-9, further comprising: identifying (i) a first outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is lower than the low value and (ii) a second outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is higher than the high value; and excluding the first outlier sensor data and the second outlier sensor data from the subset of the plurality of sensor data representing the subset of the plurality of clusters incorporating different nucleotide bases with different labels during the mapping, for being outside the second range, such that the first outlier sensor data and the second outlier sensor data are not mapped to the third range.
11. The computer-implemented method of any of claims 1-10, wherein mapping at least a subset of the plurality of sensor data representing the subset of the plurality of clusters incorporating different nucleotide bases with different labels comprises: mapping a first sensor data within the subset from a first value that is within the second range to a second value that is within the third range; and mapping a second sensor data within the subset from a third value that is within the second range to a fourth value that is within the third range.
12. The computer-implemented method of any of claims 1-11, wherein individual sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels comprises corresponding intensity of a corresponding section of an image generated from the flow cell.
13. The computer-implemented method of any of claims 1-12, further comprising: processing the plurality of normalized sensor data in a base caller, to assign, the corresponding base called for the target cluster, a first quality score indicating a first probability of the corresponding base being an A, a second quality score indicating a second probability of the corresponding base being a C, a third quality score indicating a third probability of the corresponding base being a T, and a fourth quality score indicating a fourth probability of the corresponding base being a G.
14. The computer-implemented method of claim 13, wherein the plurality of quality scores corresponding to the base call comprise the first quality score, the second quality score, the third quality score, and the fourth quality score.
15. The computer-implemented method of claim 14, further comprising: quantizing each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
16. A non-transitory computer readable storage medium comprising computer program instructions that, when executed on a processor, cause a computing device to: receive, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identify a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; map at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; process the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remap each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
17. The non-transitory computer readable storage medium of claim 16, wherein identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of sensor data have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of sensor data have a value that is higher than the high value, wherein the second range is defined by the low value and the high value.
18. The non-transitory computer readable storage medium of claim 17, further comprising computer program instructions that, when executed on the processor, cause the computing device to quantize each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
19. A system comprising: at least one processor; and a non-transitory computer-readable medium comprising instructions thereon that, when executed by the at least one processor, cause the system to: receive, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identify a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; map at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; process the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remap each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
20. The system of claim 19, further comprising instructions that, when executed by the at least one processor, cause the system to quantize each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163226707P | 2021-07-28 | 2021-07-28 | |
US17/839,387 US20230029970A1 (en) | 2021-07-28 | 2022-06-13 | Quality score calibration of basecalling systems |
PCT/US2022/038729 WO2023009758A1 (en) | 2021-07-28 | 2022-07-28 | Quality score calibration of basecalling systems |
Publications (1)
Publication Number | Publication Date |
---|---|
IL309786A true IL309786A (en) | 2024-02-01 |
Family
ID=83149575
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
IL309786A IL309786A (en) | 2021-07-28 | 2022-07-28 | Quality score calibration of basecalling systems |
Country Status (7)
Country | Link |
---|---|
EP (1) | EP4377960A1 (en) |
JP (1) | JP2024532049A (en) |
KR (1) | KR20240037882A (en) |
AU (1) | AU2022319125A1 (en) |
CA (1) | CA3223746A1 (en) |
IL (1) | IL309786A (en) |
WO (1) | WO2023009758A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118053503A (en) * | 2024-01-11 | 2024-05-17 | 中国农业科学院农业基因组研究所 | Method and system for constructing invasive biology multi-group database |
Family Cites Families (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1991006678A1 (en) | 1989-10-26 | 1991-05-16 | Sri International | Dna sequencing |
US6090592A (en) | 1994-08-03 | 2000-07-18 | Mosaic Technologies, Inc. | Method for performing amplification of nucleic acid on supports |
US5641658A (en) | 1994-08-03 | 1997-06-24 | Mosaic Technologies, Inc. | Method for performing amplification of nucleic acid with two primers bound to a single solid support |
AU6846698A (en) | 1997-04-01 | 1998-10-22 | Glaxo Group Limited | Method of nucleic acid amplification |
JP2001517948A (en) | 1997-04-01 | 2001-10-09 | グラクソ、グループ、リミテッド | Nucleic acid sequencing |
AR021833A1 (en) | 1998-09-30 | 2002-08-07 | Applied Research Systems | METHODS OF AMPLIFICATION AND SEQUENCING OF NUCLEIC ACID |
US20020150909A1 (en) | 1999-02-09 | 2002-10-17 | Stuelpnagel John R. | Automated information processing in randomly ordered arrays |
EP1368460B1 (en) | 2000-07-07 | 2007-10-31 | Visigen Biotechnologies, Inc. | Real-time sequence determination |
AU2002227156A1 (en) | 2000-12-01 | 2002-06-11 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
AR031640A1 (en) | 2000-12-08 | 2003-09-24 | Applied Research Systems | ISOTHERMAL AMPLIFICATION OF NUCLEIC ACIDS IN A SOLID SUPPORT |
US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
US20040002090A1 (en) | 2002-03-05 | 2004-01-01 | Pascal Mayer | Methods for detecting genome-wide sequence variations associated with a phenotype |
EP2607369B1 (en) | 2002-08-23 | 2015-09-23 | Illumina Cambridge Limited | Modified nucleotides for polynucleotide sequencing |
PT3147292T (en) | 2002-08-23 | 2018-11-22 | Illumina Cambridge Ltd | Labelled nucleotides |
GB0321306D0 (en) | 2003-09-11 | 2003-10-15 | Solexa Ltd | Modified polymerases for improved incorporation of nucleotide analogues |
US20110059865A1 (en) | 2004-01-07 | 2011-03-10 | Mark Edward Brennan Smith | Modified Molecular Arrays |
WO2006044078A2 (en) | 2004-09-17 | 2006-04-27 | Pacific Biosciences Of California, Inc. | Apparatus and method for analysis of molecules |
WO2006064199A1 (en) | 2004-12-13 | 2006-06-22 | Solexa Limited | Improved method of nucleotide detection |
EP1888743B1 (en) | 2005-05-10 | 2011-08-03 | Illumina Cambridge Limited | Improved polymerases |
US8045998B2 (en) | 2005-06-08 | 2011-10-25 | Cisco Technology, Inc. | Method and system for communicating using position information |
GB0514936D0 (en) | 2005-07-20 | 2005-08-24 | Solexa Ltd | Preparation of templates for nucleic acid sequencing |
GB0517097D0 (en) | 2005-08-19 | 2005-09-28 | Solexa Ltd | Modified nucleosides and nucleotides and uses thereof |
US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
GB0522310D0 (en) | 2005-11-01 | 2005-12-07 | Solexa Ltd | Methods of preparing libraries of template polynucleotides |
EP2021503A1 (en) | 2006-03-17 | 2009-02-11 | Solexa Ltd. | Isothermal methods for creating clonal single molecule arrays |
EP3722409A1 (en) | 2006-03-31 | 2020-10-14 | Illumina, Inc. | Systems and devices for sequence by synthesis analysis |
US20080242560A1 (en) | 2006-11-21 | 2008-10-02 | Gunderson Kevin L | Methods for generating amplified nucleic acid arrays |
US7595882B1 (en) | 2008-04-14 | 2009-09-29 | Geneal Electric Company | Hollow-core waveguide-based raman systems and methods |
CA2859660C (en) | 2011-09-23 | 2021-02-09 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
WO2016064703A1 (en) * | 2014-10-21 | 2016-04-28 | Life Technologies Corporation | Methods, systems, and computer-readable media for blind deconvolution dephasing of nucleic acid sequencing data |
BR112020014542A2 (en) * | 2018-01-26 | 2020-12-08 | Quantum-Si Incorporated | MACHINE LEARNING ENABLED BY PULSE AND BASE APPLICATION FOR SEQUENCING DEVICES |
US11347965B2 (en) * | 2019-03-21 | 2022-05-31 | Illumina, Inc. | Training data generation for artificial intelligence-based sequencing |
-
2022
- 2022-07-28 IL IL309786A patent/IL309786A/en unknown
- 2022-07-28 JP JP2023579782A patent/JP2024532049A/en active Pending
- 2022-07-28 AU AU2022319125A patent/AU2022319125A1/en active Pending
- 2022-07-28 KR KR1020237043770A patent/KR20240037882A/en unknown
- 2022-07-28 WO PCT/US2022/038729 patent/WO2023009758A1/en active Application Filing
- 2022-07-28 EP EP22761681.0A patent/EP4377960A1/en active Pending
- 2022-07-28 CA CA3223746A patent/CA3223746A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
AU2022319125A1 (en) | 2024-01-18 |
EP4377960A1 (en) | 2024-06-05 |
WO2023009758A1 (en) | 2023-02-02 |
JP2024532049A (en) | 2024-09-05 |
CA3223746A1 (en) | 2023-02-02 |
KR20240037882A (en) | 2024-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP2023080096A5 (en) | Deep learning based variant classifier | |
CN112241699B (en) | Object defect type identification method, object defect type identification device, computer equipment and storage medium | |
US11907760B2 (en) | Systems and methods of memory allocation for neural networks | |
CN112418268A (en) | Target detection method and device and electronic equipment | |
CN109686321B (en) | Backlight control method and backlight controller of display device and display device | |
IL309786A (en) | Quality score calibration of basecalling systems | |
CN113159147A (en) | Image identification method and device based on neural network and electronic equipment | |
GB2577640A (en) | Autonomic incident triage prioritization by performance modifier and temporal decay parameters | |
CN110648322A (en) | Method and system for detecting abnormal cervical cells | |
JP6550723B2 (en) | Image processing apparatus, character recognition apparatus, image processing method, and program | |
EP4020200B1 (en) | Resource management platform-based task allocation method and system | |
US20220147441A1 (en) | Method and apparatus for allocating memory and electronic device | |
US20210158137A1 (en) | New learning dataset generation method, new learning dataset generation device and learning method using generated learning dataset | |
KR20210065901A (en) | Method, device, electronic equipment and medium for identifying key point positions in images | |
CN112070682A (en) | Method and device for compensating image brightness | |
WO2023142843A1 (en) | Resource management systems and methods thereof | |
CN111177811A (en) | Automatic fire point location layout method applied to cloud platform | |
CN113393794B (en) | Gamma debugging method, device and equipment | |
CN116071774A (en) | Table image cell rank information indexing method, computer device and storage medium | |
CN113360105A (en) | Laser printer imaging system based on laser unit self-adaptive adjustment | |
US11355083B2 (en) | Correction device, display device, method of performing correction for display device, and method of manufacturing display device | |
CN112070814A (en) | Target angle identification method and device | |
WO2023151285A1 (en) | Image recognition method and apparatus, electronic device, and storage medium | |
CN115994918A (en) | Cell segmentation method and system | |
KR100900678B1 (en) | Image quality enhancement method using dynamic range segmentation |