WO2025006874A1 - Machine-learning model for recalibrating genotype calls corresponding to germline variants and somatic mosaic variants - Google Patents
Machine-learning model for recalibrating genotype calls corresponding to germline variants and somatic mosaic variants Download PDFInfo
- Publication number
- WO2025006874A1 WO2025006874A1 PCT/US2024/036003 US2024036003W WO2025006874A1 WO 2025006874 A1 WO2025006874 A1 WO 2025006874A1 US 2024036003 W US2024036003 W US 2024036003W WO 2025006874 A1 WO2025006874 A1 WO 2025006874A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- variant
- call
- genotype
- recalibration
- variants
- Prior art date
Links
- 230000000392 somatic effect Effects 0.000 title claims abstract description 215
- 238000010801 machine learning Methods 0.000 title claims abstract description 213
- 210000004602 germ cell Anatomy 0.000 title claims abstract description 160
- 238000012163 sequencing technique Methods 0.000 claims abstract description 390
- 125000003729 nucleotide group Chemical group 0.000 claims abstract description 248
- 239000002773 nucleotide Substances 0.000 claims abstract description 246
- 108700028369 Alleles Proteins 0.000 claims description 102
- 238000001514 detection method Methods 0.000 claims description 29
- 238000003066 decision tree Methods 0.000 claims description 10
- 239000000203 mixture Substances 0.000 claims description 9
- 238000007637 random forest analysis Methods 0.000 claims description 7
- 238000000034 method Methods 0.000 abstract description 94
- 239000000523 sample Substances 0.000 description 238
- 230000009977 dual effect Effects 0.000 description 212
- 230000000875 corresponding effect Effects 0.000 description 136
- 150000007523 nucleic acids Chemical class 0.000 description 76
- 102000039446 nucleic acids Human genes 0.000 description 70
- 108020004707 nucleic acids Proteins 0.000 description 70
- 238000013442 quality metrics Methods 0.000 description 61
- 238000013507 mapping Methods 0.000 description 45
- 238000012549 training Methods 0.000 description 38
- 108020004414 DNA Proteins 0.000 description 27
- 102000053602 DNA Human genes 0.000 description 27
- 230000006870 function Effects 0.000 description 25
- 108091028043 Nucleic acid sequence Proteins 0.000 description 22
- 230000008569 process Effects 0.000 description 21
- 238000012545 processing Methods 0.000 description 19
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 18
- 238000013528 artificial neural network Methods 0.000 description 16
- 238000004891 communication Methods 0.000 description 16
- 238000003860 storage Methods 0.000 description 16
- 210000004027 cell Anatomy 0.000 description 15
- 239000000284 extract Substances 0.000 description 14
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 14
- 238000010348 incorporation Methods 0.000 description 14
- 108091034117 Oligonucleotide Proteins 0.000 description 13
- 238000004422 calculation algorithm Methods 0.000 description 13
- 210000000349 chromosome Anatomy 0.000 description 11
- 230000002441 reversible effect Effects 0.000 description 11
- 206010068052 Mosaicism Diseases 0.000 description 10
- 238000012217 deletion Methods 0.000 description 10
- 230000037430 deletion Effects 0.000 description 10
- 238000003780 insertion Methods 0.000 description 10
- 230000037431 insertion Effects 0.000 description 10
- 229920000642 polymer Polymers 0.000 description 10
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 9
- 239000003153 chemical reaction reagent Substances 0.000 description 9
- 229940104302 cytosine Drugs 0.000 description 9
- 239000000178 monomer Substances 0.000 description 9
- 238000009826 distribution Methods 0.000 description 8
- 238000003752 polymerase chain reaction Methods 0.000 description 8
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 7
- 206010028980 Neoplasm Diseases 0.000 description 6
- 230000003321 amplification Effects 0.000 description 6
- 238000013459 approach Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 6
- 230000003247 decreasing effect Effects 0.000 description 6
- 239000000975 dye Substances 0.000 description 6
- 239000012634 fragment Substances 0.000 description 6
- 229920001519 homopolymer Polymers 0.000 description 6
- 239000000463 material Substances 0.000 description 6
- 238000003199 nucleic acid amplification method Methods 0.000 description 6
- 238000010223 real-time analysis Methods 0.000 description 6
- 229920002477 rna polymer Polymers 0.000 description 6
- 238000012384 transportation and delivery Methods 0.000 description 6
- 238000001712 DNA sequencing Methods 0.000 description 5
- 230000007423 decrease Effects 0.000 description 5
- 230000035772 mutation Effects 0.000 description 5
- 210000002569 neuron Anatomy 0.000 description 5
- 238000012175 pyrosequencing Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000013144 data compression Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- XPPKVPWEQAFLFU-UHFFFAOYSA-J diphosphate(4-) Chemical compound [O-]P([O-])(=O)OP([O-])([O-])=O XPPKVPWEQAFLFU-UHFFFAOYSA-J 0.000 description 4
- 235000011180 diphosphates Nutrition 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000007477 logistic regression Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- BASFCYQUMIYNBI-UHFFFAOYSA-N platinum Chemical compound [Pt] BASFCYQUMIYNBI-UHFFFAOYSA-N 0.000 description 4
- 238000000528 statistical test Methods 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 229940113082 thymine Drugs 0.000 description 4
- ZKHQWZAMYRWXGA-KQYNXXCUSA-J ATP(4-) Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](COP([O-])(=O)OP([O-])(=O)OP([O-])([O-])=O)[C@@H](O)[C@H]1O ZKHQWZAMYRWXGA-KQYNXXCUSA-J 0.000 description 3
- ZKHQWZAMYRWXGA-UHFFFAOYSA-N Adenosine triphosphate Natural products C1=NC=2C(N)=NC=NC=2N1C1OC(COP(O)(=O)OP(O)(=O)OP(O)(O)=O)C(O)C1O ZKHQWZAMYRWXGA-UHFFFAOYSA-N 0.000 description 3
- 241001678559 COVID-19 virus Species 0.000 description 3
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 description 3
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 description 3
- 108091081406 G-quadruplex Proteins 0.000 description 3
- 241001465754 Metazoa Species 0.000 description 3
- 210000004369 blood Anatomy 0.000 description 3
- 239000008280 blood Substances 0.000 description 3
- 238000003776 cleavage reaction Methods 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 230000005284 excitation Effects 0.000 description 3
- 230000001747 exhibiting effect Effects 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 238000007480 sanger sequencing Methods 0.000 description 3
- 230000007017 scission Effects 0.000 description 3
- 238000002864 sequence alignment Methods 0.000 description 3
- 241000894007 species Species 0.000 description 3
- 239000000758 substrate Substances 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 229930024421 Adenine Natural products 0.000 description 2
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 2
- KDLHZDBZIXYQEI-UHFFFAOYSA-N Palladium Chemical compound [Pd] KDLHZDBZIXYQEI-UHFFFAOYSA-N 0.000 description 2
- 108091081062 Repeated sequence (DNA) Proteins 0.000 description 2
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 2
- 229960000643 adenine Drugs 0.000 description 2
- 210000001124 body fluid Anatomy 0.000 description 2
- 201000011510 cancer Diseases 0.000 description 2
- 230000011712 cell development Effects 0.000 description 2
- 239000000470 constituent Substances 0.000 description 2
- 238000011109 contamination Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000002866 fluorescence resonance energy transfer Methods 0.000 description 2
- 238000007672 fourth generation sequencing Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 102000054766 genetic haplotypes Human genes 0.000 description 2
- 210000004209 hair Anatomy 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012417 linear regression Methods 0.000 description 2
- 230000008774 maternal effect Effects 0.000 description 2
- 230000003278 mimic effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 210000002381 plasma Anatomy 0.000 description 2
- 229910052697 platinum Inorganic materials 0.000 description 2
- 239000011148 porous material Substances 0.000 description 2
- 238000011045 prefiltration Methods 0.000 description 2
- 230000002829 reductive effect Effects 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 210000000582 semen Anatomy 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 210000003765 sex chromosome Anatomy 0.000 description 2
- 230000013278 single fertilization Effects 0.000 description 2
- 210000001519 tissue Anatomy 0.000 description 2
- 210000002700 urine Anatomy 0.000 description 2
- JTTIOYHBNXDJOD-UHFFFAOYSA-N 2,4,6-triaminopyrimidine Chemical compound NC1=CC(N)=NC(N)=N1 JTTIOYHBNXDJOD-UHFFFAOYSA-N 0.000 description 1
- NOIRDLRUNWIUMX-UHFFFAOYSA-N 2-amino-3,7-dihydropurin-6-one;6-amino-1h-pyrimidin-2-one Chemical compound NC=1C=CNC(=O)N=1.O=C1NC(N)=NC2=C1NC=N2 NOIRDLRUNWIUMX-UHFFFAOYSA-N 0.000 description 1
- 125000003903 2-propenyl group Chemical group [H]C([*])([H])C([H])=C([H])[H] 0.000 description 1
- 108091093088 Amplicon Proteins 0.000 description 1
- 241000894006 Bacteria Species 0.000 description 1
- 108020000946 Bacterial DNA Proteins 0.000 description 1
- 108020004635 Complementary DNA Proteins 0.000 description 1
- 102000012410 DNA Ligases Human genes 0.000 description 1
- 108010061982 DNA Ligases Proteins 0.000 description 1
- 230000005778 DNA damage Effects 0.000 description 1
- 231100000277 DNA damage Toxicity 0.000 description 1
- 230000010777 Disulfide Reduction Effects 0.000 description 1
- 241000196324 Embryophyta Species 0.000 description 1
- 241000233866 Fungi Species 0.000 description 1
- 206010056740 Genital discharge Diseases 0.000 description 1
- 101000724418 Homo sapiens Neutral amino acid transporter B(0) Proteins 0.000 description 1
- 108010052285 Membrane Proteins Proteins 0.000 description 1
- 102000018697 Membrane Proteins Human genes 0.000 description 1
- 108020005196 Mitochondrial DNA Proteins 0.000 description 1
- 102100028267 Neutral amino acid transporter B(0) Human genes 0.000 description 1
- 229910019142 PO4 Inorganic materials 0.000 description 1
- 206010036790 Productive cough Diseases 0.000 description 1
- 102000004523 Sulfate Adenylyltransferase Human genes 0.000 description 1
- 108010022348 Sulfate adenylyltransferase Proteins 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- XAGFODPZIPBFFR-UHFFFAOYSA-N aluminium Chemical compound [Al] XAGFODPZIPBFFR-UHFFFAOYSA-N 0.000 description 1
- 229910052782 aluminium Inorganic materials 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000011888 autopsy Methods 0.000 description 1
- 239000011324 bead Substances 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 238000010804 cDNA synthesis Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 239000003054 catalyst Substances 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000007385 chemical modification Methods 0.000 description 1
- 230000002301 combined effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000002299 complementary DNA Substances 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- SUYVUBYJARFZHO-RRKCRQDMSA-N dATP Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@H]1C[C@H](O)[C@@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)O1 SUYVUBYJARFZHO-RRKCRQDMSA-N 0.000 description 1
- SUYVUBYJARFZHO-UHFFFAOYSA-N dATP Natural products C1=NC=2C(N)=NC=NC=2N1C1CC(O)C(COP(O)(=O)OP(O)(=O)OP(O)(O)=O)O1 SUYVUBYJARFZHO-UHFFFAOYSA-N 0.000 description 1
- RGWHQCVHVJXOKC-SHYZEUOFSA-J dCTP(4-) Chemical compound O=C1N=C(N)C=CN1[C@@H]1O[C@H](COP([O-])(=O)OP([O-])(=O)OP([O-])([O-])=O)[C@@H](O)C1 RGWHQCVHVJXOKC-SHYZEUOFSA-J 0.000 description 1
- HAAZLUGHYHWQIW-KVQBGUIXSA-N dGTP Chemical compound C1=NC=2C(=O)NC(N)=NC=2N1[C@H]1C[C@H](O)[C@@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)O1 HAAZLUGHYHWQIW-KVQBGUIXSA-N 0.000 description 1
- NHVNXKFIZYSCEB-XLPZGREQSA-N dTTP Chemical compound O=C1NC(=O)C(C)=CN1[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)C1 NHVNXKFIZYSCEB-XLPZGREQSA-N 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000005546 dideoxynucleotide Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 239000000839 emulsion Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000002255 enzymatic effect Effects 0.000 description 1
- 150000002148 esters Chemical class 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000001605 fetal effect Effects 0.000 description 1
- 239000007850 fluorescent dye Substances 0.000 description 1
- 238000011842 forensic investigation Methods 0.000 description 1
- 210000002980 germ line cell Anatomy 0.000 description 1
- 239000003228 hemolysin Substances 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 150000002500 ions Chemical class 0.000 description 1
- 230000002427 irreversible effect Effects 0.000 description 1
- 238000000370 laser capture micro-dissection Methods 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 239000006166 lysate Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000000813 microbial effect Effects 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000003097 mucus Anatomy 0.000 description 1
- 239000002086 nanomaterial Substances 0.000 description 1
- 230000005257 nucleotidylation Effects 0.000 description 1
- 229910052763 palladium Inorganic materials 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000002161 passivation Methods 0.000 description 1
- 230000001717 pathogenic effect Effects 0.000 description 1
- 239000010452 phosphate Substances 0.000 description 1
- 229920001690 polydopamine Polymers 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 239000012521 purified sample Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000002271 resection Methods 0.000 description 1
- 125000000548 ribosyl group Chemical group C1([C@H](O)[C@H](O)[C@H](O1)CO)* 0.000 description 1
- 210000003296 saliva Anatomy 0.000 description 1
- 238000001963 scanning near-field photolithography Methods 0.000 description 1
- 238000007790 scraping Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000007841 sequencing by ligation Methods 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 210000001082 somatic cell Anatomy 0.000 description 1
- 210000003802 sputum Anatomy 0.000 description 1
- 208000024794 sputum Diseases 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 229940035893 uracil Drugs 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Definitions
- nucleobase sequencing platforms determine individual nucleobases within sequences from germ cells by using conventional Sanger sequencing or by using sequencing-by-synthesis (SBS) methods.
- SBS sequencing-by-synthesis
- existing platforms can monitor millions to billions of nucleic acid polymers being synthesized in parallel to predict nucleobase calls from a larger base call dataset.
- a camera in many SBS platforms captures images of irradiated fluorescent tags incorporated into oligonucleotides for determining the nucleobase calls.
- existing SBS platforms send base call data (or image-based data) to a computing device to apply sequencing data analysis software that determines a nucleobase sequence for a genomic sample or other nucleic acid polymer. Based on differences between the aligned nucleotide reads and the reference genome, existing data analysis software can further utilize a variant caller to identify germline variants within the germline cells of a genomic sample, such as single nucleotide polymorphisms (SNPs), insertions or deletions (indels), and/or other structural variants, and genotype calls.
- SNPs single nucleotide polymorphisms
- indels insertions or deletions
- existing nucleobase sequencing platforms and sequencing data analysis software (together and hereinafter, existing sequencing systems) often (a) limit variant calling to germline variants only and/or (b) cannot accurately detect both somatic mosaic variant calls and germline variant calls.
- some existing systems utilize extensive statistical data analysis, such as a Bayesian probabilistic modeling, to implement computational tools (e.g., Java-based tools) for identifying somatic mosaic and germline variant calls within existing sequence data.
- computational tools e.g., Java-based tools
- Such Bayesian-based systems require significant computation time, processing, and resources and can often result in multiple false positives in identifying somatic mosaic variants.
- Such limits and shortcomings also apply to state- of-the-art machine-leaming-based sequencing systems.
- somatic mosaic variants In both machine-leaming-based and statistical or probabilistic models, existing sequencing systems exhibit the technical limits of (a) and (b) in part due to the nature of somatic mosaic variants. Germline variants of a genomic sample are inherited by the time of the sample’s zygote from parents and are present in the sample’s germ cells. By contrast, somatic mosaic variants typically constitute mutations that (i) were introduced after zygote formation during cell development (e.g., 1 of 4 early cells), but (ii) were not inherited from the given sample’s parents, and (iii) were not introduced by a form of cancer or tumor in the given sample. Consequently, a relatively small proportion of a given sample’s cells include such somatic mosaic variants. Depending on when in development or which cell type a somatic mosaic variant has been introduced, the variant allele fraction of a somatic mosaic variant in a given sample’s cells can range from 10-50% to much smaller percentages, such as 0.1%.
- SSEs sequence specific errors
- DNA sequencing often determine false-positive somatic mosaic variant calls based on various noise sources common in DNA sequencing, such as sequence specific errors (SSEs) induced by one or more of inverted repeats, homopolymers, nucleotide context; uneven read depth or coverage across genomic regions of a reference genome, where certain genomic regions comprising somatic mosaic variants may lack read coverage (e.g., below 10X or 20X); sequencing platform-specific errors induced by, for example, barcode swapping or allele capture bias against somatic mosaic variants; DNA sample contamination; misclassification of germline variants versus somatic mosaic variants induced by differences among corresponding variant allele fractions; nucleotide read mapping errors that obscure or hide somatic mosaic variants because nucleotide reads reflecting such somatic mosaic variants may be incorrectly mapped to the wrong genomic region; polymerase chain reaction (PCR) errors during the process of growing clusters of oligonucleotides; and DNA damage caused by reagents, heat, or other environmental sources.
- existing sequencing systems often limit sequencing pipelines or sequencing data analysis software to a single variant type. For instance, one existing sequencing system is configured to determine germline variant calls only and another existing sequencing system is configured to determine somatic mosaic variant calls only. When clinicians, research laboratories, or other parties seek to identify both germline variants and somatic mosaic variants for a given sample, however, such existing sequencing systems limit options to using two separate single- variant-type pipelines or sequencing data analysis software to separately determine somatic mosaic variant calls and germline variant calls for the given sample.
- Such separate, single-variant-type sequencing unnecessarily consumes more memory, processing, and computational time for both (i) specialized sequencing platforms for primary analysis to determine nucleotide reads for a given sample and (ii) computing devices executing sequencing data analysis software for secondary analysis to determine variant calls for the given sample.
- This disclosure describes embodiments of methods, non-transitory computer-readable media, and systems that can utilize a machine-learning model to recalibrate genotype calls (e.g., variant calls) corresponding to germline variants and somatic mosaic variants.
- the disclosed systems can utilize one or more machine-learning models to jointly generate genotype probabilities that account for both germline variants and somatic mosaic variants within a genomic sample.
- the disclosed systems can determine sequencing metrics for nucleotide reads corresponding to genomic regions of a genomic sample, generate genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants, and generate variant calls corresponding to germline variants or somatic mosaic variants.
- the disclosed systems trains or utilizes a variant-call-recalibration machine-learning model to generate predictions for genotype calls based on training data with known somatic mosaic variants at relatively low allele frequencies.
- the systems can utilize various sources of such training data, including training data generated by synthetically modifying existing ground truth sequencing fdes or by implementing an admixture of germline truth sets to simulate somatic mosaicisms.
- the disclosed variant-call-recalibration machine-learning model After training, the disclosed variant-call-recalibration machine-learning model generates genotype probabilities for genotypes, where some of the genotypes for which probabilities are determined include either a germline variant or a somatic mosaic variant. Based on the generated probabilities, the disclosed systems can confirm or change various fields of sequencing information within a genotype-call data file or other output data file.
- FIG. 1 illustrates a block diagram of a sequencing system including a dual-variant-type call recalibration system in accordance with one or more embodiments.
- FIG. 2 illustrates an overview of the dual -variant-type call recalibration system utilizing a variant-call-recalibration machine-learning model to generate genotype probabilities and, based on such probabilities, generating genotype calls in accordance with one or more embodiments.
- FIGS. 3A-3C illustrate the dual -variant-type call recalibration system determining or identifying sequencing metrics in accordance with one or more embodiments.
- FIG. 4A illustrates the dual-variant-type call recalibration system generating genotype probabilities utilizing a variant-call-recalibration machine-learning model in accordance with one or more embodiments.
- FIG. 4B illustrates the dual-variant-type call recalibration system determining genotype calls corresponding to germline variants and somatic mosaic variants in accordance with one or more embodiments.
- FIG. 5A-5B illustrate example processes for the dual-variant-type call recalibration system generating modified sample sequencing data with corresponding ground-truth somatic mosaic variants in accordance with one or more embodiments.
- FIG. 6 illustrates an example process for the dual-variant-type call recalibration system training a variant-call-recalibration machine-learning model in accordance with one or more embodiments.
- FIGS. 7A-7F illustrate graphs of experimental results of utilizing the dual -variant-type call recalibration system to identify somatic mosaic variants within modified genomic samples in accordance with one or more embodiments.
- FIG. 8 illustrates further experimental results of utilizing the dual-variant-type call recalibration system relative to existing sequencing systems in identifying somatic mosaic variants within a genomic dataset in accordance with one or more embodiments.
- FIG. 9 illustrates a flowchart of a series of acts for generating variant calls corresponding to germline variants and somatic mosaic variants in accordance with one or more embodiments.
- FIG. 10 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure.
- This disclosure describes embodiments of a dual-variant-type call recalibration system that uses machine learning to recalibrate or confirm genotype calls (e.g., variant calls) corresponding to germline variants and somatic mosaic variants in a genomic sample.
- the disclosed dual-variant-type call recalibration system utilizes a variant-call-recalibration machine-learning model trained to jointly generate genotype probabilities accounting for germline variants and somatic mosaic variants within a genomic sample.
- the disclosed dual -variant-type call recalibration system can train a variant-call-recalibration machinelearning model utilizing ground-truth training data generated to includes somatic mosaic variants at various allele frequencies, such as synthetically modified existing ground truth sequences or an admixture of germline truth sets simulating mosaic sequences. Accordingly, in some embodiments, the disclosed dual-variant-type call recalibration system trains the variant-call-recalibration machine-learning model to generate more accurate genotype probabilities using ground-truth training data for both germline variants and somatic mosaic variants.
- the disclosed dual-variant-type call recalibration system utilizes a trained variant-call-recalibration machine-learning model to generate genotype probabilities for genotypes at genomic regions of a genomic sample, where at least some of the genotypes for which genotype probabilities are determined include either a germline variant or a somatic mosaic variant.
- the dual-variant-type call recalibration system determines base-call-quality metrics, mapping quality metrics, and/or other sequencing metrics for nucleotide reads corresponding to genomic regions of a genomic sample.
- the dual-variant- type call recalibration system executes the machine-learning model to generate genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants.
- genotype probabilities can represent probabilities of homozygous reference calls, homozygous variant calls, or heterozygous calls for a given reference nucleobase and corresponding alternate base or bases (e.g., alternate base 1 and alternate base 2).
- the dual-variant-type call recalibration system determines at least one variant call corresponding to a germline variant in the genomic sample and at least one variant call corresponding to a somatic mosaic variant in the genomic sample. While a single, integrated variant-call-recalibration machine-learning model can be used as described by this disclosure, in some embodiments, the dual-variant-type call recalibration system utilizes a variant-call-recalibration machine-learning model comprising two or more machine-learning sub-models to generate genotype probabilities for variants within genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants.
- the dual-variant-type call recalibration system can utilize a first machinelearning sub-model trained for germline variant identification and a second machine-learning submodel trained for integrated germline and somatic mosaic variant identification.
- the dual-variant-type call recalibration system can provide improved accuracy in identifying variants within genomic regions corresponding to germline and somatic mosaic variants.
- the dual-variant-type call recalibration system includes a user-selectable option for implementing detection of candidate somatic mosaic variants in addition to germline variants. For example, upon receiving an indication of user selection of a provided variant-sensitivity option, the dual-variant-type call recalibration system can execute the aforementioned variant-call- recalibration machine-learning model instead of a germline-variant-call-recalibration machinelearning model configured to generate a different type of genotype probabilities for candidate germline variants.
- the dual-variant-type call recalibration system provides several technical advantages, benefits, and/or improvements over existing sequencing systems, including variant callers and other sequencing data analysis software.
- the dual-variant-type call recalibration system increases the flexibility and variant-type breadth with which a sequencing system can determine, modify, or update genotype calls corresponding to germline variants and somatic mosaic variants.
- many existing machine- leaming-based variant callers for instance, are limited to determining variant calls exclusively for germline variants. Such machine-leaming-based variant callers accordingly perform better when facilitating genotype calls exhibiting an allele frequency of approximately 0.5 or 1.0 in a genomic sample.
- the dual-variant-type call recalibration system utilizes a variant-call-recalibration machine-learning model trained to generate genotype probabilities that help distinguish somatic mosaic variants from various sources of noise that currently prevent existing sequencing systems from identifying variants exhibiting relatively low allele frequencies.
- the dual-variant-type call recalibration system successfully identifies variants corresponding to somatic mosaic variants within genomic samples, where existing sequencing systems can only accurately identify variants corresponding to germline variants.
- the dual-variant-type call recalibration system provides improved accuracy over existing sequencing systems, particularly in identifying variants corresponding to somatic mosaic variants within a genomic sample. For example, by training a variant-call-recalibration machine-learning model utilizing ground truth sequencing data that includes ground truth somatic mosaic variants (e.g., synthetic or naturally occurring-derived ground truth) at various target allele frequencies, the dual-variant-type call recalibration system can identify variants within genomic samples that correspond to somatic mosaic variants where existing sequencing systems are generally limited to variants corresponding to germline variants. This disclosure further illustrates such improved accuracy below with respect to at least FIGS. 7A-7F.
- ground truth somatic mosaic variants e.g., synthetic or naturally occurring-derived ground truth
- the dual-variant-type call recalibration system improves the computing efficiency with which a sequencing system identifies somatic mosaic variants within genomic sequences.
- some existing Bayesian-based sequencing systems are configured to identify somatic mosaic variants within existing sequencing data by extensive statistical data analysis. While some such systems can also determine germline variants, they require significant computation time, processing, memory, and other computational resources and often result in multiple false-positive somatic mosaic variants.
- existing sequencing systems that employ two separate single-variant-type pipelines or sequencing data analysis software to separately determine somatic mosaic variant calls and germline variant calls for a given sample bum unnecessary memory, processing, and computational time.
- some existing sequencing systems utilize computationally expensive, slow neural network architectures (e.g., deep learning architectures such as convolutional neural networks) that require many hours (e.g., tens to hundreds of hours) across multiple-core processors to implement for processing read data to generate variant calls for a sample. Such deep learning architectures can further require several days (or weeks) to train.
- the dual-variant-type call recalibration system utilizes a comparatively lightweight, fast architecture for generating variant calls as described herein.
- the dual-variant-type call recalibration system requires under an hour (for both germline and mosaic variant calling) of runtime (e.g., on a single field programmable gate array and/or a multicore processor) to generate variant calls for a genomic sample (see, e.g., FIG. 8 and the corresponding text).
- runtime e.g., on a single field programmable gate array and/or a multicore processor
- the dual-variant-type call recalibration system is significantly faster and less computationally expensive than many deep learning approaches to somatic mosaic variant calling.
- the models of the dual-variant-type call recalibration system faster and less computationally expensive to implement, but the disclosed variant-call-recalibration machine-learning models are also much faster and less computationally expensive to train than many existing deep learning systems.
- the dual-variant-type call recalibration system provides improved efficiency compared to existing variant caller systems.
- the variant-call- recalibration machine-learning model is trained to generate genotype probabilities that account for both germline variants and somatic mosaic variants. Based on such dual-purpose genotype probabilities, the disclosed dual-variant-type call recalibration system can generate, for genomic regions of a genomic sample, accurate variant calls corresponding to germline variants and variant calls corresponding to somatic mosaic variants.
- sample nucleotide sequence refers to a sequence of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence).
- sample nucleotide sequence includes a segment of a nucleic acid polymer that is isolated or extracted from a genomic sample and composed of nitrogenous heterocyclic bases.
- a sample nucleotide sequence can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or chimeric or hybrid forms of nucleic acids noted below. More specifically, in some cases, the sample nucleotide sequence is found in a sample prepared or isolated by a kit and received by a sequencing device.
- DNA deoxyribonucleic acid
- RNA ribonucleic acid
- the sample nucleotide sequence is found in a sample prepared or isolated by a kit and received by a sequencing device.
- genomic sample refers to a target genome or portion of a genome undergoing an assay or sequencing.
- a genomic sample includes one or more sequences of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence).
- a genomic sample includes a full genome that is isolated or extracted (in whole or in part) from a sample organism and composed of nitrogenous heterocyclic bases.
- a genomic sample can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or chimeric or hybrid forms of nucleic acids noted below.
- the genomic sample is found in a sample prepared or isolated by a kit and received by a sequencing device.
- genotype call refers to a determination or prediction of a particular genotype of a genomic sample or a sample nucleotide sequence at a genomic locus.
- a genotype call can include a prediction of a particular genotype of a genomic sample with respect to a reference genome or a reference sequence at a genomic coordinate or a genomic region.
- a genotype call includes a determination or prediction that a genomic sample comprises both a nucleobase and a complementary nucleobase at a genomic coordinate that is either homozygous or heterozygous for a reference base or a variant (e.g., homozygous reference bases represented as 0
- a genotype call can include a prediction of a variant or reference base for one or more alleles of a genomic sample and indicate zygosity with respect to a variant or reference base.
- a genotype call is often determined for a genomic coordinate or genomic region at which an SNP, insertion, deletion, or other variant has been identified for a population of organisms.
- nucleobase call refers to a determination or prediction of a particular nucleobase (or nucleobase pair) for an oligonucleotide (e.g., nucleotide read) during a sequencing cycle or for a genomic coordinate of a sample genome.
- a nucleobase call can indicate (i) a determination or prediction of the type of nucleobase that has been incorporated within an oligonucleotide on a nucleotide-sample slide (e.g., read-based nucleobase calls) or (ii) a determination or prediction of the type of nucleobase that is present at a genomic coordinate or region within a genome, including a variant call or a non-variant call in a digital output file.
- a nucleobase call includes a determination or a prediction of a nucleobase based on intensity values resulting from fluorescent- tagged nucleotides added to an oligonucleotide of a nucleotide-sample slide (e.g., in a cluster of a flow cell).
- a nucleobase call includes a determination or a prediction of a nucleobase from chromatogram peaks or electrical current changes resulting from nucleotides passing through a nanopore of a nucleotide-sample slide.
- a nucleobase call can also include a final prediction of a nucleobase at a genomic coordinate of a sample genome for a variant call file (VCF) or another base-call-output file — based on nucleotide reads corresponding to the genomic coordinate.
- a nucleobase call can include a base call corresponding to a genomic coordinate and a reference genome, such as an indication of a variant or a non-variant at a particular location corresponding to the reference genome.
- a nucleobase call can refer to a variant call, including but not limited to, a single nucleotide variant (SNV), an insertion or a deletion (indel), or base call that is part of a structural variant.
- a single nucleobase call can be an adenine (A) call, a cytosine (C) call, a guanine (G) call, a thymine (T) call, or an uracil (U) call.
- A adenine
- C cytosine
- G guanine
- T thymine
- U uracil
- nucleotide read refers to an inferred sequence of one or more nucleobases (or nucleobase pairs) from all or part of a sample nucleotide sequence (e.g., a sample genomic sequence, complementary DNA).
- a nucleotide read includes a determined or predicted sequence of nucleobase calls for a nucleotide fragment (or group of monoclonal nucleotide fragments) from a sequencing library corresponding to a genomic sample.
- the dual-variant-type call recalibration system determines a nucleotide read by generating nucleobase calls for nucleobases passed through a nanopore of a nucleotide-sample slide, determined via fluorescent tagging, or determined from a well in a flow cell.
- a nucleotide read can refer to a particular type of read, such as a nucleotide read synthesized from sample library fragments that are shorter than a threshold number of nucleobases (e.g., SBS reads).
- nucleotide read can refer to (i) assembled nucleotide reads that have been assembled from shorter nucleotide reads to form a contiguous sequence (e.g., assembled nucleotide reads) satisfying a threshold number of nucleobases, (ii) circular consensus sequencing (CCS) reads satisfying the threshold number of nucleobases, or (iii) nanopore long reads satisfying the threshold number of nucleobases.
- CCS circular consensus sequencing
- the dual-variant-type call recalibration system determines sequencing metrics for nucleobase calls of nucleotide reads.
- sequencing metric refers to a quantitative measurement or score indicating a degree to which an individual nucleobase call (or a sequence of nucleobase calls) aligns, compares, or quantifies with respect to a genomic coordinate or genomic region of a reference genome, with respect to nucleobase calls from nucleotide reads, or with respect to external genomic sequencing or genomic structure.
- a sequencing metric includes a quantitative measurement or score indicating a degree to which (i) individual nucleobase calls align, map, or cover a genomic coordinate or reference base of a reference genome; (ii) nucleobase calls compare to reference or alternative nucleotide reads in terms of mapping, mismatch, base call quality, or other raw sequencing metrics; or (iii) genomic coordinates or regions corresponding to nucleobase calls demonstrate mappability, repetitive base call content, DNA structure, or other generalized metrics.
- the dual-variant-type call recalibration system determines various types of sequencing metrics from different sources, such as read-based sequencing metrics, externally sourced sequencing metrics, and call-model-generated sequencing metrics.
- read-based sequencing metrics refers to sequencing metrics derived from nucleotide reads of a sample nucleotide sequence.
- read-based sequencing metrics include sequencing metrics determined by applying statistical tests to detect differences between a reference sequence and nucleotide reads.
- read-based sequencing metrics can include a comparative-mapping-quality-distribution metric that indicates a comparison between mapping qualities or a comparative-mismatch-count metric that indicates a comparison between mismatch counts.
- read-based sequencing metrics can correspond to genotype calls generated from different read types, such as assembled nucleotide reads and/or SBS reads.
- externally sourced sequencing metrics refer to sequencing metrics identified or obtained from one or more external databases.
- externally sourced sequencing metrics include metrics relating to mappability of nucleotides, replication timing, or DNA structure that are available outside of the dual-variant-type call recalibration system.
- call-model-generated sequencing metrics refers to internal, modelspecific sequencing metrics generated or extracted by a call generation model.
- call- model-generated sequencing metrics include variant calling sequencing metrics extracted or determined via variant caller components of a call generation model and mapping-and-alignment sequencing metrics extracted or determined via mapping-and-alignment components of a call generation model.
- call-model-generated sequencing metrics can include alignment metrics that quantify a degree to which nucleotide reads align with genomic coordinates of a reference genome or other example nucleic acid sequence, such as deletion-size metrics or mapping-quality metrics.
- call-model-generated sequencing metrics can include depth metrics that quantify the depth of nucleobase calls for nucleotide reads at genomic coordinates of a reference genome or other example nucleic acid sequence, such as forward-reverse-depth metrics or normalized-depth metrics.
- Call-model-generated sequencing metrics can also include callquality metrics that quantify a quality or accuracy of nucleobase calls, such as nucleobase-call- quality metrics, callability metrics, or somatic-quality metrics.
- base-call-quality metric refers to a specific score or other measurement indicating an accuracy of a nucleobase call.
- a base-call-quality metric comprises a value indicating a likelihood that one or more predicted nucleobase calls for a genomic coordinate contain errors.
- a base-call-quality metric can comprise a Q score (e.g., a PHil’s Read EDitor (PHRED) quality score) predicting the error probability of any given nucleobase call.
- a quality score (or Q score) may indicate that a probability of an incorrect nucleobase call at a genomic coordinate is equal to 1 in 100 for a Q20 score, 1 in 1,000 for a Q30 score, 1 in 10,000 for a Q40 score, etc.
- genomic coordinate refers to a particular location or position of a nucleobase within a genome (e.g., an organism’s genome or a reference genome).
- a genomic coordinate includes an identifier for a particular chromosome of a genome and an identifier for a position of a nucleobase within the particular chromosome.
- a genomic coordinate or coordinates may include a number, name, or other identifier for a chromosome (e.g., chrl or chrX) and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl : 1234570 or chrl : 1234570-1234870).
- a genomic coordinate refers to a genomic coordinate on a sex chromosome (e.g., chrX or chrY).
- a genomic coordinate refers to a source of a reference genome (e.g., mt for a mitochondrial DNA reference genome or SARS-CoV-2 for a reference genome for the SARS-CoV-2 virus) and a position of a nucleobase within the source for the reference genome (e.g., mt: 16568 or SARS-CoV-2:29001).
- a genomic coordinate refers to a position of a nucleobase within a reference genome without reference to a chromosome or source (e.g., 29727).
- multiallelic genomic coordinate refers to a genomic coordinate associated with three or more alleles.
- a multiallelic genomic coordinate includes a genomic coordinate of a nucleotide sequence where nucleotide reads indicate three or more possible alleles corresponding to the coordinate, such as a reference allele, a first alternate allele, a second alternate allele, and so forth.
- a multiallelic genomic coordinate corresponds to a genomic coordinate where a read pileup occurs or where an insertion occurs.
- a multiallelic genomic coordinate can exhibit a multiallelic genotype, such as a 1/2 genotype, where the first allele at the coordinate corresponds to an allele from a first alternate nucleotide sequence and the second allele corresponds to an allele from a second alternate nucleotide sequence.
- genomic coordinates within a nucleotide sequence can exhibit different genotypes.
- a “homozygous reference genotype” refers to a genotype where both nucleobases at a given coordinate of a sample nucleotide sequence match a reference nucleobase of a reference sequence or a reference genome (represented as 0/0).
- a “homozygous alternate genotype” refers to a genotype at a given coordinate where both nucleobases differ from a reference nucleobase of a reference sequence or a reference genome (represented as 1/1).
- a “heterozygous genotype” refers to a genotype where the nucleobases at a given coordinate are not the same.
- a heterozygous genotype includes a genotype in which one nucleobase matches a reference nucleobase and the other nucleobase differs from the reference nucleobase (represented as 0/1 or 1/0).
- genotypes can exhibit nucleobases from more than one alternate nucleobase differing from a reference nucleobase of a reference genome.
- a multiallelic heterozygous genotype can be represented as 1/2, where one nucleobase call matches a first alternate nucleobase differing from a reference nucleobase and the other nucleobase call matches a second alternate nucleobase differing from the reference nucleobase.
- a genomic coordinate includes a position within a reference genome. Such a position may be within a particular reference genome.
- the term “reference genome” refers to a digital nucleic acid sequence assembled as a representative example (or representative examples) of genes and other genetic sequences of an organism. Regardless of the sequence length, in some cases, a reference genome represents an example set of genes or a set of nucleic acid sequences in a digital nucleic acid sequenced determined by scientists as representative of an organism of a particular species.
- a linear human reference genome may be GRCh38 or other versions of reference genomes from the Genome Reference Consortium.
- a reference genome may include a reference graph genome that includes both a linear reference genome and paths representing nucleic acid sequences from ancestral haplotypes, such as Illumina DRAGEN Graph Reference Genome hg!9.
- genomic region refers to a range of genomic coordinates. Like genomic coordinates, in certain implementations, a genomic region may be identified by an identifier for a chromosome and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570-1234870). In various implementations, a genomic coordinate includes a position within a reference genome. In some cases, a genomic coordinate is specific to a particular reference genome.
- the dual-variant-type call recalibration system can utilize a machine-learning model to modify sequencing metrics and update a genotype call.
- the term “machine-learning model” refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular task through experience based on use of data.
- a machine-learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness.
- Example machine-learning models include various types of decision trees, support vector machines, Bayesian networks, or neural networks.
- the call- recalibration machine-learning model is a series of gradient boosted decision trees (e.g., XGBoost algorithm), while in other cases the call-recalibration machine-learning model is a random forest model, a multilayer perceptron, a linear regression, a support vector machine, a deep tabular learning architecture, a deep learning transformer (e.g., self-attention-based-tabular transformer), or a logistic regression.
- XGBoost algorithm gradient boosted decision trees
- the call-recalibration machine-learning model is a random forest model, a multilayer perceptron, a linear regression, a support vector machine, a deep tabular learning architecture, a deep learning transformer (e.g., self-attention-based-tabular transformer), or a logistic regression.
- the dual-variant-type call recalibration system utilizes a call-recalibration machine-learning model to generate outputs for confirming, modifying, or updating a genotype call based on sequencing metrics.
- the term “variant-call-recalibration machine-learning model” refers to a machine-learning model that generates variant-call classifications (e.g., genotype probabilities).
- the variant-call-recalibration machine-learning model is trained to generate variant-call classifications indicating various probabilities or predictions for genotype calls (e.g., variant calls) based on the aforementioned sequencing metrics.
- a variant-call-recalibration machine-learning model is a variant-call-recalibration machine-learning model.
- the call-recalibration machine-learning model is a series of gradient boosted decision trees (e.g., XGBoost algorithm or treelite algorithm for an ensemble of decision trees), while in other cases the variant-call-recalibration machine-learning model is a random forest model, a multilayer perceptron, a linear regression, a support vector machine, a deep tabular learning architecture, a deep learning transformer (e.g., self-attention-based-tabular transformer), or a logistic regression.
- a variant-call-recalibration machinelearning model includes multiple sub-models or operates in tandem with another call-recalibration machine-learning model. For instance, a first call-recalibration machine-learning model (e.g., an ensemble of gradient boosted trees) generates a first set of variant-call classifications and a second call-recalibration machine-learning model (e.g., a random forest) generates a second set of variantcall classifications.
- a first call-recalibration machine-learning model e.g., an ensemble of gradient boosted trees
- a second call-recalibration machine-learning model e.g., a random forest
- a variant-call-recalibration machine-learning model includes a first machine-learning sub-model configured to generate genotype probabilities for genotype calls corresponding to germline variants only and a second machine-learning sub-model configured to generate genotype probabilities for genotype calls corresponding to somatic mosaic variants (in some cases, in addition to genotype probabilities for genotype calls corresponding to germline variants).
- variant-call classification refers to a predicted classification from a variant-call-recalibration machine-learning model that indicates a probability, score, or other quantitative measurement associated with some aspect of a genotype or variant call based on one or more sequencing metrics.
- a variant-call classification can include a specialized prediction depending on the application of a call-recalibration machine-learning model.
- variant-call classifications for a biallelic genomic coordinate includes (i) a false-positive probability that a genotype call is a false positive, (ii) a genotype-error probability that a genotype for the genotype call is incorrect, and (iii) a true-positive probability that the genotype call is a true positive.
- variant-call classifications can include: (i) a reference probability that a genotype call comprises a homozygous reference genotype at a multiallelic genomic coordinate, (ii) a zygosity-error probability that the genotype call comprises a genotype-zygosity error at a multiallelic genomic coordinate, and (iii) a true-positive variant probability that the genotype call constitutes a true positive variant at a multiallelic genomic coordinate.
- variantcall classifications can include: (i) a first genotype probability of a first genotype at the genomic coordinate and (ii) a second genotype probability of a second genotype at the genomic coordinate.
- the first genotype probability can be a probability that a genotype at a genomic coordinate is a haploid reference genotype
- the second genotype probability can be a probability that a genotype at the genomic coordinate is a haploid alternate genotype.
- variant-call classifications can include: (i) a false-positive probability or a homozygous reference classification indicating a probability that a genotype call is a false positive or a homozygous reference genotype, respectively; (ii) a zygosityerror probability or a heterozygous genotype classification indicating a probability that a genotype (e.g., an indication of a heterozygous or homozygous genotype for a variant call at a particular location) is incorrect or a heterozygous genotype, respectively; and/or (iii) a true-positive classification or a homozygous alternate classification indicating a probability that a genotype call is a true positive or a homozygous alternate genotype, respectively.
- the variant-call classifications accordingly represent intermediate scoring metrics and/or a predicted probability that a genotype for a genotype call is accurate
- genotype probability refers to a likelihood, probability, or score of a particular genotype at a genomic coordinate or genomic region.
- a genotype probability includes a likelihood of a homozygous reference genotype, a likelihood of a heterozygous variant genotype, or a likelihood of a homozygous variant genotype at one or more genomic coordinates.
- a genotype probability can refer to a posterior genotype probability.
- a genotype probability determined by a variant- call-recalibration machine-learning model can be presented in (or modified to be presented in) a posterior genotype probability (GP) field of a VCF or other sequencing data file, such as a recalibrated VCF or other recalibrated sequencing data file.
- GP posterior genotype probability
- a genotype probability can include a specialized prediction depending on the application of a call-recalibration machine-learning model, such as for predicting SNPs.
- the variant-call-recalibration machine-learning model can be a neural network.
- the term the term “neural network” refers to a machine-learning model that can be trained and/or tuned based on inputs to determine classifications or approximate unknown functions.
- a neural network includes a model of interconnected artificial neurons (e.g., organized in layers) that communicate and leam to approximate complex functions and generate outputs (e.g., generated digital images) based on a plurality of inputs provided to the neural network.
- a neural network refers to an algorithm (or set of algorithms) that implements deep learning techniques to model high-level abstractions in data.
- a neural network can include a convolutional neural network, a recurrent neural network (e.g., an LSTM), a graph neural network, a self-attention transformer neural network, or a generative adversarial neural network.
- the dual-variant-type call recalibration system can generate variant-call classifications that indicate or reflect a likelihood of identifying a variant corresponding to a germline variant or a somatic mosaic variant (i.e., two variant types) at a genomic coordinate.
- variant refers to a nucleobase or multiple nucleobases that do not align with, differs from, or varies from a corresponding nucleobase (or nucleobases) in a reference sequence or a reference genome.
- a variant includes a SNP, an indel, or a structural variant that indicates nucleobases in a sample nucleotide sequence that differ from nucleobases in corresponding genomic coordinates of a reference sequence.
- the term “germline variant” refers to a variant or mutation inherited by a sample organism from biological parents or present within germ cells.
- a germline variant is a heritable variant that tends to be present in every somatic and germline cell of offspring.
- the term “somatic mosaic variant” refers to a variant or mutation introduced or derived from a post-zygotic event.
- a somatic mosaic variant includes a variant or mutation that was (i) introduced after zygote formation during cell development (e.g., 1 of 4 early cells, 1 of 32 early cells), but is (ii) not inherited from a sample organism’s biological parents and (iii) not introduced by a form of cancer or tumor in the given sample organism.
- a “variant call” refers to a nucleobase call comprising a mutation or a variant at a particular genomic coordinate or genomic region with respect to a reference.
- a variant call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that differs from a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
- a “non-variant call” refers to a nucleobase call comprising a non-variant or a reference nucleobase at a genomic coordinate or a genomic region with respect to a reference.
- a non-variant or reference call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that matches a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
- a variant allele frequency (sometimes referred to as a variant allele fraction (VAF)) is the percentage (or fraction) of sequence reads observed at a genomic coordinate or region that match a particular variant. Accordingly, a variant with an allele frequency of approximately 50% (or 0.5) or 100% (or 1.0) is more likely to be a germline variant, whereas a relatively low allele frequency — in particular, an allele frequency below 50% (or 0.5) — is more likely to be a somatic mosaic variant.
- VAF variant allele fraction
- variants of relatively low allele frequency within a sample can be relatively difficult to distinguish within a genomic sample due to, for example, lack of read coverage, GC bias, sequencing specific errors (SSEs), mapping inaccuracies, and so forth.
- SSEs sequencing specific errors
- the dual-variant-type call recalibration system identifies and/or stores sequencing metrics within one or more sequencing data files.
- sequencing data file refers to a digital file that includes genetic sequencing information concerning genotype calls or nucleotide reads generated by one or more genomic sequencing procedures. Such sequencing information may include, for example, nucleotide reads, alignment and mapping information, nucleotide reads at one or more genomic coordinates, and so forth.
- the dual-variant-type call recalibration system modifies data fields corresponding to a genotype-call data file, such as a variant call file.
- a genotype-call data file refers to a digital file that indicates or represents one or more genotype calls (e.g., including reference and/or variant calls) compared to a reference genome along with other information pertaining to the genotype calls (e.g., variant calls).
- a genotype-call data file can include a variant call file, such as but not limited to a variant call format (VCF) file (as well as a genomic variant call format (gVCF) file).
- VCF variant call format
- gVCF genomic variant call format
- genotype-call data file can include a General Feature Format (GFF), a Genome Variant Format (GVF), or other suitable data file comprising genotype calls for a sample nucleotide sequence.
- GFF General Feature Format
- VVF Genome Variant Format
- a “variant call file” refers to a particular genotype-call data file that comprises a text file format that contains information about variants at specific genomic coordinates.
- a variant call file can include meta-information lines, a header line, and data lines where each data line contains information about a single genotype call (e.g., a single variant).
- the dual-variant-type call recalibration system can generate different versions of genotype-call data files, including a pre-filter variant call file comprising variant genotype calls that either pass or fail a quality filter for base-call-quality metrics or a postfilter variant call file comprising variant genotype calls that pass the quality filter but excludes variant genotype calls that fail the quality filter.
- one or more sequencing data files in which the dual-variant-type call recalibration system identifies or stores sequencing metrics include an alignment data file containing information from a read processing and mapping procedure.
- alignment data file refers to a digital file that indicates mapping and alignment information for nucleotide reads of a sample nucleotide sequence.
- an alignment data file can include a binary alignment map (BAM) file, a compressed reference- oriented alignment map (CRAM) file, or another file indicating nucleotide reads of a sample nucleotide sequence.
- BAM binary alignment map
- CRAM compressed reference- oriented alignment map
- the dual-variant-type call recalibration system modifies data fields corresponding to metrics of a genotype call associated with a variant call file, such as fields for call quality, genotype, and genotype quality.
- the term “call quality” when used with respect to a data field in a variant call file refers to a measure or an indication of a likelihood or a probability that a variant exists at a given location.
- a call quality field (or QUAL field) corresponding to a VCF file may include a base-call-quality metric, such as a PHRED-scaled quality or Q score, representing a probability that a genomic coordinate of a sample genome includes a variant.
- a “genotype quality” when used with respect to a field refers to a likelihood or a probability that a particular predicted genotype for a nucleobase call is correct.
- the dual-variant-type call recalibration system utilizes a call generation model to determine initial genotype calls or initial variant calls.
- the term “call generation model” refers to a probabilistic model that generates sequencing data from nucleotide reads of a sample nucleotide sequence, including nucleobase calls, variant calls, and/or genotype calls along with associated metrics. Accordingly, in some cases, a call generation model may be a variant call generation model.
- a call generation model refers to a Bayesian probability model that generates variant calls based on nucleotide reads of a sample nucleotide sequence.
- Such a model can process or analyze sequencing metrics corresponding to read pileups (e.g., multiple nucleotide reads corresponding to a single genomic coordinate), including mapping quality, base quality, and various hypotheses including foreign reads, missing reads, joint detection, and more.
- a call generation model may likewise include multiple components, including, but not limited to, different software applications or components for mapping and aligning, sorting, duplicate marking, computing read pileup depths, and variant calling.
- a call generation model refers to an ILLUMINA DRAGEN model for variant calling functions and mapping and alignment functions (e.g., a DRAGEN variant caller or “DRAGEN VC”).
- FIG. 1 illustrates a schematic diagram of a system environment (or “environment”) 100 in which a dual -variant-type call recalibration system 106 operates in accordance with one or more embodiments.
- the computing system 100 includes one or more server device(s) 102 connected to a client device 108 and a sequencing device 114 via a network 112. While FIG. 1 shows an embodiment of the dual -variant-type call recalibration system 106, this disclosure describes alternative embodiments and configurations below.
- the server device(s) 102, the client device 108, and the sequencing device 114 can communicate with each other via the network 112.
- the network 112 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 10.
- the sequencing device 114 comprises a device for sequencing one or more nucleic acid polymers.
- the sequencing device 114 analyzes nucleic acid segments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems (described herein) either directly or indirectly on the sequencing device 114. More particularly, the sequencing device 114 receives and analyzes, within nucleotide-sample slides (e.g., flow cells), nucleic acid sequences extracted from genomic samples.
- nucleotide-sample slides e.g., flow cells
- the sequencing device 114 utilizes SBS to sequence nucleic acid polymers (e.g., clusters of oligonucleotides) into nucleotide reads.
- the sequencing device 114 bypasses the network 112 and communicates directly with the client device 108.
- the server device(s) 102 may generate, receive, analyze, store, and transmit digital data, such as data for determining nucleobase calls or sequencing nucleic acid polymers.
- the sequencing device 114 may send (and the server device(s) 102 may receive) call data from the sequencing device 114.
- the server device(s) 102 may also communicate with the client device 108.
- the server device(s) 102 can send data to the client device 108, including sequencing data files, such as genotype-call data files or alignment data files, or other information indicating nucleobase calls, sequencing metrics, error data, or other metrics associated with nucleobase calls or genotype calls.
- the server device(s) 102 comprise a distributed collection of servers where the server device(s) 102 include a number of server devices distributed across the network 112 and located in the same or different physical locations. Further, the server device(s) 102 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server. In some cases, the server device(s) 102 are located at a same physical location as the sequencing device 114.
- the server device(s) 102 can include a sequencing system 104.
- the sequencing system 104 analyzes call data, such as sequencing metrics received from the sequencing device 114, to determine nucleobase sequences for nucleic acid polymers.
- the sequencing system 104 can receive raw data from the sequencing device 114 and can determine a nucleobase sequence for a sample nucleotide sequence (e.g., a genomic sample).
- the sequencing system 104 determines the sequences of nucleobases in DNA and/or RNA segments or oligonucleotides.
- the sequencing system 104 also generates a genotype-call data file, such as a variant call file, indicating one or more genotype calls and/or variant calls for one or more genomic coordinates.
- the dual -variant-type call recalibration system 106 analyzes call data, such as sequencing metrics from the sequencing device 114, to recalibrate genotype calls for sample nucleotide sequences that were previously generated (e.g., by a call generation model).
- the dual -variant-type call recalibration system 106 includes a variant- call-recalibration machine-learning model trained to identify variant calls at genomic regions corresponding to germline variants and somatic mosaic variants.
- the dual- variant-type call recalibration system 106 determines sequencing metrics for sample nucleotide sequences based on information stored in existing sequencing data fdes, such as alignment data files.
- the dual-variant-type call recalibration system 106 trains and applies a variant-call-recalibration machine-learning model to confirm or recalibrate genotype calls for the sample sequence at genomic coordinates corresponding to candidate germline variants and candidate somatic mosaic variants.
- the dual- variant-type call recalibration system 106 further utilizes the variant-call-recalibration machinelearning model to generate sets of variant-call classifications (e.g., genotype probabilities) to update or modify the genotype calls (e.g., variant calls).
- the dual -variant-type call recalibration system 106 can update data fields corresponding to genotype-call data file, such as a variant call file, to update a genotype call (e.g., a variant call) for improved accuracy.
- the dual-variant- type call recalibration system 106 outputs an updated variant call file (or other format of genotypecall data file) with the modified or updated genotype calls and/or variant calls.
- the client device 108 can generate, store, receive, and send digital data.
- the client device 108 can receive sequencing metrics from the sequencing device 114.
- the client device 108 may communicate with the server device(s) 102 to receive sequencing data comprising genotype calls and/or other metrics, such as a call-quality, a genotype indication, and a genotype quality.
- the client device 108 can accordingly present or display information pertaining to the genotype call within a graphical user interface to a user associated with the client device 108.
- the client device 108 illustrated in FIG. 1 may comprise various types of client devices.
- the client device 108 includes non-mobile devices, such as desktop computers or servers, or other types of client devices.
- the client device 108 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device 108 are discussed below with respect to FIG. 10.
- the client device 108 includes a sequencing application 110.
- the sequencing application 110 may be a web application or a native application stored and executed on the client device 108 (e.g., a mobile application, desktop application).
- the sequencing application 110 can include instructions that (when executed) cause the client device 108 to receive data from the dual -variant-type call recalibration system 106 and present, for display at the client device 108, data from a variant call file and/or an updated variant call file.
- the sequencing application 110 can instruct the client device 108 to display a visualization of sequencing metrics of a nucleobase call or genotype call.
- the dual -variant-type call recalibration system 106 may be located on the client device 108 as part of the sequencing application 110 or on the sequencing device 114. Accordingly, in some embodiments, the dual -variant-type call recalibration system 106 is implemented by (e.g., located entirely or in part) on the client device 108. In yet other embodiments, the dual -variant-type call recalibration system 106 is implemented by one or more other components of the computing system 100, such as the sequencing device 114.
- the dual -variant-type call recalibration system 106 can be implemented in a variety of different ways across the server device(s) 102, the network 112, the client device 108, and the sequencing device 114.
- the dual-variant-type call recalibration system 106 can be downloaded from the server device(s) 102 to the client device 108 and/or to the sequencing device 114 where all or part of the functionality of the dual -variant-type call recalibration system 106 is performed at each respective device within the computing system 100.
- the computing system 100 includes a database 116.
- the database 116 can store information, such as sequencing data files, sample nucleotide sequences, nucleotide reads, nucleobase calls, genotype calls (e.g., variant calls), and sequencing metrics.
- the server device(s) 102, the client device 108, and/or the sequencing device 114 communicate with the database 116 (e.g., via the network 112) to store and/or access information, such as sequencing data files, sample nucleotide sequences, nucleotide reads, nucleobase calls, genotype calls (e.g., variant calls), and sequencing metrics.
- the database 116 also stores one or more models, such as a variant-call-recalibration machine-learning model.
- FIG. 1 illustrates the components of computing system 100 communicating via the network 112, in certain implementations, the components of computing system 100 can also communicate directly with each other, bypassing the network 112.
- the client device 108 communicates directly with the sequencing device 114.
- the client device 108 communicates directly with the dual -variant-type call recalibration system 106.
- the dual -variant-type call recalibration system 106 can access one or more databases housed on or accessed by the server device(s) 102 or elsewhere in the computing system 100.
- the dual -variant-type call recalibration system 106 can determine genotype calls for germline and somatic mosaic variants based on genotype probabilities generated by a variant-call-recalibration machine-learning model.
- the dual-variant-type call recalibration system 106 can determine genotype probabilities for variants of various allele frequencies based on sequencing metrics associated with nucleotide reads for a genomic sample.
- FIG. 2 illustrates an overview of the dual -variant-type call recalibration system 106 utilizing a variant-call-recalibration machine-learning model to determine genotype probabilities based on sequencing metrics and, based on such probabilities, generating genotype calls in accordance with one or more embodiments.
- the dual -variant-type call recalibration system 106 performs an act 202 to determine sequencing metrics.
- the dual-variant-type call recalibration system 106 determines sequencing metrics, such as read-based sequencing metrics, externally sourced sequencing metrics, and call model generated sequencing metrics.
- sequencing metrics such as read-based sequencing metrics, externally sourced sequencing metrics, and call model generated sequencing metrics.
- the dual- variant-type call recalibration system 106 determines sequencing metrics that indicate various attributes or data in relation to various nucleobase calls of nucleotide reads from a sample nucleotide sequence from a genomic sample. Additional detail regarding determining the various types of sequencing metrics is provided below in reference to FIGS. 3A-3C.
- the dual -variant-type call recalibration system 106 can determine sequencing metrics for nucleotide reads comprising (or including supporting evidence for) variants of various allele frequencies within a given genomic sample.
- a first variant relative to a reference genome cytosine (C) in lieu of thymine (T)
- C cytosine
- T thymine
- A adenine
- the allele frequency of the second variant indicates that the second variant is more likely a germline variant, while the relatively low allele frequency of the first variant indicates that the first variant may be a somatic mosaic variant. While FIG. 2 illustrates a particular allele frequency of the first variant (i.e., less than or equal to 0.35), variants within a genomic sample may occur at any variant allele frequency. In some cases, such variant allele frequencies are accounted for in the determined sequencing metrics.
- the dual- variant-type call recalibration system 106 performs an act 204 to generate genotype probabilities. More specifically, the dual -variant-type call recalibration system 106 generates (or updates or refines) variant call classifications, such as genotype probabilities, from sequencing metrics utilizing a variant-call-recalibration machine-learning model. To elaborate, the dual-variant-type call recalibration system 106 utilizes the variant-call-recalibration machine-learning model to process or analyze one or more sequencing metrics associated with one or more nucleotide reads to generate a set of classifications (e.g., predicted probabilities associated with genotypes). As shown in FIG.
- the dual -variant-type call recalibration system 106 generates, utilizing the variant-call-recalibration machine-learning model, certain genotype probabilities associated with a candidate genotype indicated by the sequencing metrics, including genotype probabilities for variants within genomic regions corresponding to both candidate germline variants and candidate somatic mosaic variants.
- the dual- variant-type call recalibration system 106 Based on the generated genotype probabilities, as further illustrated in FIG. 2, the dual- variant-type call recalibration system 106 also performs an act 206 to determine genotype calls, such as a reference call or a variant call corresponding to a germline variant or a somatic mosaic variant. More particularly, the dual -variant-type call recalibration system 106 confirms, determines, or updates a preliminary genotype call by a call generation model (e.g., Bayesian probabilistic-based variant caller) for a sample nucleotide sequence at a genomic coordinate within a reference genome.
- a call generation model e.g., Bayesian probabilistic-based variant caller
- the dual -variant-type call recalibration system 106 determines initial genotype calls utilizing a call generation model and edits or updates certain initial genotype calls based on the genotype probabilities generated by the variant-call-recalibration machine-learning model.
- the dual-variant-type call recalibration system 106 outputs genotype calls corresponding to the nucleotide reads described above in relation to the act 202.
- FIG. 2 shows, in relation to the act 206, a first genotype call at a first genomic coordinate (i.e., position or POS) indicating an alternate allele (cytosine) with an allele frequency of 0.3 and a second genotype call at a second genomic coordinate indicating an alternate allele (thymine) with an allele frequency of 0.5.
- a first genomic coordinate i.e., position or POS
- a second genotype call at a second genomic coordinate indicating an alternate allele (thymine) with an allele frequency of 0.5.
- the dual -variant-type call recalibration system 106 utilizes a call generation model to process or analyze sequencing metrics (e.g., one or more of the same sequencing metrics used to generate the genotype probabilities in act 204) to determine genotype calls (e.g., initial genotype calls) that a genomic sample comprises reference bases or variants at certain genomic coordinates based on one or more of the sequencing metrics.
- sequencing metrics e.g., one or more of the same sequencing metrics used to generate the genotype probabilities in act 204
- genotype calls e.g., initial genotype calls
- the dual -variant-type call recalibration system 106 applies a number of Bayesian probabilistic models or algorithms to derive various probabilities for different reference bases or variant bases, quality metrics, mapping metrics, joint metrics, and other data occurring within the sample nucleotide sequence to include within a variant call file.
- the dual -variant-type call recalibration system 106 determines genotype calls (e.g., calls indicating differences or likenesses to reference bases from a reference genome) that indicates predicted nucleobases for the sample genome at a corresponding genomic coordinates.
- genotype calls e.g., calls indicating differences or likenesses to reference bases from a reference genome
- the dual- variant-type call recalibration system 106 can confirm or update the initial genotype call — and/or corresponding sequencing metrics in various field — based on the genotype probabilities from the variant-call-recalibration machine-learning model. As described further below with respect to FIG.
- the dual -variant-type call recalibration system 106 can modify one or more of a base-call-quality metric, a genotype-probability metric, a genotype metric, a genotype-likelihood metric, or a genotype-quality metric for the initial genotype call — including calls corresponding to germline variants and somatic mosaic variants.
- the dual-variant-type call recalibration system 106 determines or extracts sequencing metrics for nucleobase calls or genotype calls at particular genomic coordinates, such as genomic coordinates corresponding to candidate germline variants and/or candidate somatic mosaic variants.
- the dual- variant-type call recalibration system 106 determines or extracts sequencing metrics, such as readbased sequencing metrics, externally sourced sequencing metrics, and call-model-generated sequencing metrics from sequence data (e.g., one or more sequencing data files) for calls corresponding to nucleotide reads of a sample nucleotide sequence.
- FIGS. 3A-3C illustrate determining sequencing metrics in accordance with one or more embodiments. Specifically, FIG. 3A illustrates determining read-based sequencing metrics, while FIG. 3B illustrates determining call-model-generated sequencing metrics, and FIG. 3C illustrates determining externally sourced sequencing metrics.
- the dual-variant-type call recalibration system 106 accesses, retrieves, or otherwise obtains nucleotide reads 302.
- the nucleotide reads 302 are generated utilizing the sequencing device 114, the nucleotide reads 302 comprising nucleobase calls for regions from a sample nucleotide sequence (e.g., sample genome).
- the nucleotide reads 302 can be generated utilizing sequencing-by-synthesis (SBS) techniques and/or Sanger sequencing techniques to determine nucleobase calls for oligonucleotide clusters from wells in a flow cell and/or via fluorescent tagging.
- SBS sequencing-by-synthesis
- the nucleotide reads 302 are generated utilizing cluster generation and SBS chemistry to sequence millions or billions of clusters in a flow cell.
- SBS chemistry for each cluster, the call nucleobase calls from the nucleotide reads 302 are stored and, in some embodiments, provided directly to the dual- variant-type call recalibration system 106, for every cycle of sequencing via real-time analysis (RTA) software.
- RTA real-time analysis
- the dual-variant-type call recalibration system 106 performs read processing and mapping 304.
- the read processing and mapping 304 can include utilizing real-time analysis (RTA) software to store base call data in the form of individual base call data files (or BCLs).
- RTA real-time analysis
- the read processing and mapping 304 further includes converting the BCL files into sequence data (e.g., via BCL to FASTQ conversion) to be analyzed by a call-generation model 310 to determine genotype calls for the nucleotide reads 302.
- the read processing and mapping 304 includes aligning the nucleotide reads 302 with a reference genome or receiving information pertaining to the read alignment. Specifically, the read processing and mapping 304 determines which nucleobase(s) of a given read align with which genomic coordinate of a reference sequence (or receives information indicating alignment). Different reads have different lengths and include different nucleobases. Accordingly, in some cases, the read processing and mapping 304 includes analysis of each nucleotide of each read to determine (or receives information indicating) where the read “fits” in relation to a reference sequence — e.g., where the bases within the read align with bases in the reference.
- the read processing and mapping 304 includes alignment of many reads at a single genomic coordinate, thus resulting in a read pileup.
- the dual -variant-type call recalibration system 106 performs additional statistical tests to determine or detect differences between metrics associated with a reference nucleotide sequence (e.g., within a reference genome) and metrics associated with alternative supporting nucleotide reads. Through these statistical tests, the dual-variant-type call recalibration system 106 re-engineers raw sequencing metrics to determine read-based sequencing metrics.
- the dual -variant-type call recalibration system 106 determines raw sequencing metrics that include one or more of (i) alignment metrics for quantifying alignment of sample nucleotide sequences with genomic coordinates of an example nucleotide sequence (e.g., a reference genome or a nucleotide sequence from an ancestral haplotype), (ii) depth metrics for quantifying depth of nucleobase calls for sample nucleotide sequences at genomic coordinates of the example nucleotide sequence, or (iii) call-quality metrics for quantifying quality of nucleobase calls for sample nucleotide sequences at genomic coordinates of the example nucleotide sequence.
- alignment metrics for quantifying alignment of sample nucleotide sequences with genomic coordinates of an example nucleotide sequence (e.g., a reference genome or a nucleotide sequence from an ancestral haplotype)
- depth metrics for quantifying depth of nucleobase calls for sample nucleotide sequences at genomic coordinates of the example nucle
- the dual -variant-type call recalibration system 106 determines mapping-quality metrics (e.g., the MAPQ metrics indicated in FIG. 3 A), soft-clipping metrics, or other alignment metrics that measure an alignment of sample sequences with a reference genome.
- mapping-quality metrics e.g., the MAPQ metrics indicated in FIG. 3 A
- soft-clipping metrics e.g., the MAPQ metrics indicated in FIG. 3 A
- other alignment metrics e.g., the MAPQ metrics indicated in FIG. 3 A
- the dual -variant-type call recalibration system 106 extracts forward-reverse-depth metrics (or other such depth metrics) or callability metrics for variant genotype calls (or other such callquality metrics).
- the dual -variant-type call recalibration system 106 re-engineers the raw sequencing metrics to generate read-based sequencing metrics that are more informative for comparing metrics associated with a reference nucleotide sequence with metrics associated with various supporting alternative nucleotide reads. For example, the dual- variant-type call recalibration system 106 determines various metrics for a sample sequence in relation to a reference sequence and further determines various metrics for the sample sequence in relation to alternative supporting sequences. In addition, in some embodiments, the dual-variant- type call recalibration system 106 performs comparative analyses between metrics associated with the reference sequence and the metrics associated with the alternative supporting reads.
- the dual -variant-type call recalibration system 106 compares how nucleobases of a sample nucleotide sequence (e.g., sample genome) map to a reference sequence with how the nucleobases map to various alternative supporting reads. In some cases, the dual- variant-type call recalibration system 106 determines mapping qualities associated with the reference sequence to compare with mapping qualities associated with alternative supporting reads. For example, the dual -variant-type call recalibration system 106 determines mapping quality statistics reflecting differences in the distribution of reads supporting a reference sequence versus reads supporting alternative alleles.
- the dual -variant-type call recalibration system 106 determines mismatch counts between the sample sequence and the reference sequence and between the reference sequence and alternative supporting reads. The dual-variant-type call recalibration system 106 further compares the mismatch counts to determine a comparative-mismatch-count metric. Further, the dual -variant-type call recalibration system 106 determines soft-clipping metrics for the sample sequence in relation to the reference sequence and further determines soft-clipping metrics in relation to alternative supporting reads. The dual-variant-type call recalibration system 106 also compares the soft clipping metrics between the reference sequence and the alternative supporting reads to generate a comparative-soft-clipping metric.
- the dual-variant-type call recalibration system 106 compares base-call-quality metrics in relation to the reference sequence and alternative supporting reads and/or compares query positions of the sample sequence in relation to the reference sequence with those in relation to alternative supporting reads.
- the dual-variant-type call recalibration system 106 utilizes the comparisons and/or other statistical tests to generate the read-based sequencing metrics 306, including, for example: (i) a comparative-mapping-quality-distribution metric indicating a mapping quality distribution comparing mapping qualities in relation to the reference sequence and mapping qualities in relation to alternative supporting reads, (ii) a comparative-secondary- mapping-alignment metric indicating a comparison between secondary mapping in relation to bases in the reference sequence and bases in alternative supporting reads, (iii) a comparative-mismatchcount metric indicating a comparison between mismatched nucleobases in relation to the reference sequence and mismatched bases in relation to alternative supporting reads, (iv) a comparative-soft- clipping metric indicating a comparison between soft-clipping metrics in relation to the reference sequence and soft-clipping metrics in relation to alternative supporting reads, (v) one or more comparative-read-depth metrics indicating comparisons
- the dual -variant-type call recalibration system 106 determines or re-engineers additional or alternative read-based sequencing metrics 306. [0092] In addition to the read-based sequencing metrics 306, as illustrated in FIG. 3B, the dual- variant-type call recalibration system 106 generates call-model-generated sequencing metrics 312 utilizing a call-generation model 310. In particular, the dual -variant-type call recalibration system 106 generates the call-model-generated sequencing metrics 312 from sequence data 308 utilizing the call-generation model 310. For example, the dual -variant-type call recalibration system 106 extracts or determines sequence data 308 based on the read processing and mapping 304 described in relation to FIG. 3 A. In some cases, the dual -variant-type call recalibration system 106 generates the sequence data 308 as part of one or more digital files, such as BCL and FASTQ files.
- the sequencing device 114 utilizes cluster generation and SBS chemistry to sequence millions or billions of clusters in a flow cell.
- the sequencing device 114 stores nucleobase calls from the nucleotide reads 302 for every cycle of sequencing via real-time analysis (RTA) software.
- RTA real-time analysis
- the sequencing device 114 (or the call-generation model 310) utilizes RTA software to further store base call data in the form of individual base call data files (or BCLs).
- the sequencing device 114 (or the callgeneration model 310) further converts the BCL files into sequence data (e.g., via BCL to FASTQ conversion).
- the sequencing device 114 (or the call-generation model 310) generates a FASTQ file from the nucleotide reads 302, where the FASTQ file includes the sequence data 308 (or a portion thereof).
- the call-generation model 310 generates the sequence data 308 for each cluster that passes an initial quality filter from a sample sequence. For example, the call-generation model 310 generates entries for each cluster, where each entry includes four lines (or four items of sequence data): (i) a sequence identifier with information about the sequencing run and the cluster, (ii) nucleobase calls that make up the sequence (e.g., a sequence of A, C, T, G, and/or N calls), (iii) a separator (e.g., a “+” sign), and (iv) base-call-quality metrics indicating probabilities of correctness for the nucleobase calls (Phred +33 encoded).
- a sequence identifier with information about the sequencing run and the cluster
- nucleobase calls that make up the sequence e.g., a sequence of A, C, T, G, and/or N calls
- separator e.g., a “+” sign
- base-call-quality metrics indicating probabilities
- the dual -variant-type call recalibration system 106 implements, utilizes, or applies the call-generation model 310 processes or analyzes the sequence data 308 to generate genotype calls. Indeed, in some embodiments, the dual -variant-type call recalibration system 106 determines the call-model-generated sequencing metrics 312 by utilizing the call-generation model 310 to re-engineer raw sequencing metrics (e.g., raw sequencing metrics within the sequence data 308). In particular, the call-generation model 310 includes mapping-and- alignment components to map and align nucleobase calls from the sequence data 308.
- the call-generation model 310 includes variant calling components to generate genotype calls (e.g., reference-base calls such as variant calls or non-variant calls) from the sequence data 308.
- the dual -variant-type call recalibration system 106 determines the call-model-generated sequencing metrics 312 that have been generated utilizing the mapping-and-alignment components and the variant calling components of the call-generation model 310.
- the dual -variant-type call recalibration system 106 generates variant calling metrics including one or more of: (i) genotype metrics corresponding to a GT field of a VCF file and indicating a genotype of a genomic coordinate, (ii) base-call-quality metrics (e.g., DRAGEN QUAL scores) indicating quality scores for genotype calls generated via the call-generation model 310, (iii) genotype quality metrics (e.g., a GQ score) indicating a measure of confidence or quality of a predicted genotype for a genomic coordinate, (iv) genotype probability metrics indicating one or more probabilities of various genotypes occurring at a genomic coordinate, (v) PHRED-scaled- likelihood metrics or non-PHRED-scaled-likelihood metrics indicating probabilities of errors associated with genotype calls, (vi) a call-model-generated-foreign-read-detection metric (e.
- the dual -variant-type call recalibration system 106 generates the call-model- generated sequencing metrics 312 from internal (e.g., proprietary, and model-specific) variables that reflect interacting processing paths, comer cases, and difficult predictions/decisions.
- internal e.g., proprietary, and model-specific
- the dual -variant-type call recalibration system 106 determines FRD scores according to the methods described in U.S. Patent Application No. 16/280,022 to Eric Jon Ojard, entitled System and Method for Correlated Error Event Mitigation for Variant Calling, filed February 19, 2019, which is incorporated by reference herein in its entirety.
- the dual -variant-type call recalibration system 106 also (or alternatively) determines BQD scores, FRD scores, HMM statistics, and/or other variant calling metrics according to the methods described in U.S. Patent Application Nos. 17/165,828, 15/643,381, and 14/811,836, which are incorporated by reference herein in their entireties.
- the call-model-generated sequencing metrics 312 include, but are not limited to, variant calling metrics determined via the variant calling components of the callgeneration model 310.
- the dual-variant-type call recalibration system 106 determines or generates (e.g., via metric re-engineering) variant calling metrics including one or more of: (i) a number of samples in a population, (ii) a number of reads processed for generating genotype calls, (iii) a number of variants (e.g., SNPs, indels, and MNPs), (iv) a number of biallelic sites (e.g., genomic coordinates that contain two observed alleles), (v) a number of multiallelic sites (e.g., a number of sites in a variant call fde that contain three or more observed alleles), (vi) a number of SNPs, indels, and MNPs), (iv) a number of biallelic sites (e.g., genomic coordinates that
- the call-model-generated sequencing metrics can include mapping-and- alignment sequencing metrics determined via the mapping-and-alignment components of the callgeneration model 310.
- the dual -variant-type call recalibration system 106 determines or generates (e.g., via metric re-engineering) mapping-and-alignment metrics including one or more of: (i) a number of total input reads, (ii) a number of duplicate marked reads, (iii) a number of duplicate marked and mate reads removed, (iv) a number of unique reads, (v) a number of reads with mate sequenced, (vi) a number of reads without mate sequenced, (vii) indications of reads that fail quality checks, (viii) indications of mapped reads, (ix) a number of unique and mapped reads, (x) a number of unmapped reads, (xi) a number of singleton reads (e.g., where the read is
- the dual-variant-type call recalibration system 106 generates, extracts, or determines externally sourced sequencing metrics 316.
- the dual -variant-type call recalibration system 106 determines externally sourced sequencing metrics 316 from one or more databases external to the dual -variant-type call recalibration system 106, such as a sequencing information database 314 (e.g., the database 116).
- a sequencing information database 314 e.g., the database 116
- the dual- variant-type call recalibration system 106 accesses sequencing metrics that are generic or applicable to sequencing nucleotides generally.
- the dual-variant-type call recalibration system 106 accesses or determines sequencing information about a particular reference sequence (e.g., stored within the sequencing information database 314).
- the dual -variant-type call recalibration system 106 determines externally sourced sequencing metrics 316 including: (i) a mappability metric indicating an ease or difficult of mapping a particular nucleotide sequence (or a particular nucleotide read or nucleobase call), (ii) a guanine-cytosine-content metric indicating a count (or a dropout or a mean) of guanine- cytosine content in a reference nucleotide sequence (e.g., reference genome), (iii) a replicationtiming metric indicating a time required to replicate a particular number of nucleotides from a reference sequence, (iv) one or more DNA-structure-metrics indicating DNA structures of a reference sequence (e.g., reference genome), (v) a conservation metric indicating a measure of sequence conservation across multiple species (e.g., a measure of change relative to an average), (vi) a confidence classification
- the dual -variant-type call recalibration system 106 determines the externally sourced sequencing metrics 316 by analyzing one or more genomic regions of a reference genome corresponding to (or aligning with) the one or more genomic coordinates for an initial genotype call. Many challenging variant calls occur in low complexity genomic regions of the reference genome. In some cases, these genomic regions are characterized by some combination of multiple instances of long repeat sequences (e.g., more than 50 base pairs), very high number (e.g., more than 10) of shorter repeat sequences (e.g., 4-8 repeated bases), and on occasion containing a subset of the bases (e.g., As and Ts but no Cs or Gs).
- long repeat sequences e.g., more than 50 base pairs
- very high number e.g., more than 10
- shorter repeat sequences e.g., 4-8 repeated bases
- a subset of the bases e.g., As and Ts but no Cs or Gs.
- nucleotide reads that are aligned correctly to such low complexity genomic regions often have portions or fragments of the nucleotide reads that map to a more unique sequence flanking a repeat-heavy region.
- a reference genome or genomic sample may include some intermediate breaks (e.g., single bases in between the primary repeat pattern that breaks the repetitiveness) that help with alignment of nucleotide reads with a low complexity genomic region of a reference genome.
- intermediate breaks e.g., single bases in between the primary repeat pattern that breaks the repetitiveness
- the dual -variant-type call recalibration system 106 monitors externally sourced sequencing metrics 316 (associated with complexity) which can be augmented with read-based sequencing metrics to provide an overall assessment of the likelihood of the presence of a variant (for both Bayesian and machine-learning approaches).
- the dual-variant-type call recalibration system 106 accesses or determines sequencing information about a particular reference genome (e.g., stored within the sequencing information database 314). In some cases, the dual -variant-type call recalibration system 106 determines externally sourced sequencing metrics 316 including a tandem repeat length in nucleobases of a target genomic region within a reference genome corresponding to a candidate region of a genomic sample.
- the dual -variant-type call recalibration system 106 analyzes portions of a reference genome that correspond to variant regions of a genomic sample to identify tandem repeats (e.g., sequences of two or bases that are repeated numerous times in a head- to-tail manner) and to further determine lengths (e.g., numbers of base pairs) within the tandem repeats.
- tandem repeats e.g., sequences of two or bases that are repeated numerous times in a head- to-tail manner
- lengths e.g., numbers of base pairs
- the dual -variant-type call recalibration system 106 determines an externally sourced sequencing metric in the form of a repetitiveness metric or homopolymer metric. Indeed, one indicator of a likelihood of a mis-mapping that needs to be corrected (e.g., a mis-mapping that results in a false positive) is based on repetitiveness of bases within a reference sequence.
- the dual -variant-type call recalibration system 106 can utilize various sequencing metrics to measure this repetitiveness, including: (i) a maximum repeat pattern length that indicates the maximum length of a sequence of bases that is repeated at least two times over the span of the (reference genome corresponding to the) candidate region, (ii) a maximum repeat length percentage that indicates the percentage of the (portion of the reference genome corresponding to the) region that is consumed or occupied by the maximum repeat pattern length, and (iii) a maximum homopolymer length that indicates the length of the longest sequence of the same base in the (portion of the reference genome corresponding to the) candidate region.
- the dual- variant-type call recalibration system 106 determines an externally sourced sequencing metric in the form of a permutation entropy of nucleobases. For example, the dual-variant-type call recalibration system 106 determines a measure of randomness of nucleotide sequences, which can be predictive of mapping/alignment accuracy. In some cases, the dual-variant-type call recalibration system 106 determines a permutation entropy by determining an entropy over permutations of a nucleotide sequence of a given length. For instance, the dual-variant-type call recalibration system 106 can determine permutation entropy according to the following formula:
- S 4 G ⁇ AAAA, AAAC, AAAG, AAAT, AACA, ... , TTGT, TTT A, TTTC, TTTG, TTTT)
- S N is a set of all permutations of length N base sequences, and where:
- the dual -variant-type call recalibration system 106 normalizes the permutation entropy as: where K ⁇ 0, . . . , 4 W — 1 ⁇ is the set of indices such that p N k > 0.
- the dual -variant-type call recalibration system 106 can further determine an externally sourced sequencing metric in the form of identifying a presence or absence of a cytosine quadruplex (C-quadruplex) or a guanine quadruplex (G-quadruplex) in a target genomic region.
- C-quadruplex cytosine quadruplex
- G-quadruplex guanine quadruplex
- the dual -variant-type call recalibration system 106 determines counts of cytosine calls and guanine calls within a target genomic region of a reference genome corresponding to a variant region of a genomic sample or genomic region under consideration for an initial variant call.
- the dual-variant-type call recalibration system 106 identifies occurrences (within the target genomic region) of four or more instantiations of three consecutive cytosine bases separated by one or more different nucleobases (e.g., a pattern of CCC A CCC A CCC A CCC).
- the dual-variant-type call recalibration system 106 identifies occurrences (within the target genomic region) of four or more instantiations of three consecutive guanine bases separated by one or more different nucleobases (e g., a pattern of GGG T GGG T GGG T GGG).
- the dual -variant-type call recalibration system 106 identifies a C-quadruplex or a G-quadruplex where up to a threshold number of nucleobases (e.g., up to 7 nucleobases) occur between instantiations of triple Cs or triple Gs. For instance, the dual- variant-type call recalibration system 106 identifies GGGTACC GGGTGTACA GGG AAGTCT GGG as a G-quadruplex. In some cases, G-quadruplexes (and C-quadruplexes) are known to cause issues with sequencing. Accordingly, the dual -variant-type call recalibration system 106 uses the presence of such sequences to adjust the confidence in the mapping and alignment of reads and the accuracy of subsequent contiguous sequence construction.
- a threshold number of nucleobases e.g., up to 7 nucleobases
- the dual -variant-type call recalibration system 106 determines a data compression metric as part of the externally sourced sequencing metrics 316.
- the dual -variant-type call recalibration system 106 determines a data compression metric that quantifies a measure of randomness of a sequence using one or more data compression algorithms.
- One such data compression algorithm for lossless compression is the Liv-Zempel-Welch algorithm.
- the dual -variant-type call recalibration system 106 builds a dictionary of unique k-mers starting with length of one and comes up with an encoding for each entry in the dictionary.
- the dual -variant-type call recalibration system 106 can utilize the number of keys in the dictionary for the structural variant and the flanking regions in the reference genome as a sequencing metric.
- the dual -variant-type call recalibration system 106 determines a structural variant sequence alignment metric as part of the externally sourced sequencing metrics 316. For instance, the dual -variant-type call recalibration system 106 uses gapless alignment scoring and Smith-Waterman alignment scoring of a proposed deletion sequence against the left/right flanking genomic regions in the reference. If there are multiple alignments that score above a threshold gapless alignment score and/or a threshold Smith-Waterman alignment score, the variant-call-integration machine-learning model may process a variant sequence alignment metrics as an indicator that there is a higher likelihood of an imprecise variant call.
- the dual -variant-type call recalibration system 106 can also determine a simulated read alignment metric as an externally sourced sequencing metric. Assuming that the contiguous sequence representing or including a variant is accurate, there should theoretically be many nucleotide reads with good alignment to the contiguous sequence, even for heterozygous deletions. However, for low evidence true-positive cases of variants, there is a likelihood of missing reads because the reads corresponding to the structural variant (SV) region were either mapped elsewhere or unmapped. The dual -variant-type call recalibration system 106 can thus determine a likelihood of missing reads by simulating reads.
- the dual-variant-type call recalibration system 106 chooses segments from the contiguous sequence equal in length to the SBS reads.
- the dual-variant-type call recalibration system 106 chooses segments of the contiguous sequence that cross the breakend(s), that are equivalent to SBS read length, and that are aligned to the reference sequence in the SV region. For cases where alignment is ambiguous, alternate alignment scores will be higher and can serve as a possible guide for expected read depth.
- the dual -variant-type call recalibration system 106 can further use the segment of the contiguous sequence equivalent to read length that is symmetric about the breakend to obtain the highest alignment scores.
- the dual-variant-type call recalibration system 106 can further determine additional offsets from this symmetric point to check alternate alignment scores for a range of overlaps.
- the dual -variant-type call recalibration system 106 determines, receives, or extracts additional or alternative sequencing metrics, including read-based sequencing metrics, call-model-generated sequencing metrics, and/or externally sourced sequencing metrics. For example, the dual -variant-type call recalibration system 106 determines, extracts, or receives the sequencing metrics in following table, where each of the metrics belongs to one or more of the read-based sequencing metrics, call-model-generated sequencing metrics, and/or externally sourced sequencing metrics.
- the dual-variant-type call recalibration system 106 generates genotype probabilities for variants in genomic regions corresponding to germline and somatic mosaic variants.
- the dual-variant-type call recalibration system 106 utilizes a variant-call-recalibration machine-learning model to generate genotype probabilities corresponding to various genomic coordinates.
- the call recalibration system 106 updates of modifies a genotype call by generating an updated genotypecall data file, such as variant call file (e.g., a recalibrated variant call file) based on the genotype probabilities and/or the variant-call classifications.
- the dual-variant-type call recalibration system 106 determines output genotype calls and generates or modifies a genotypecall file, such as a variant call file, with various information corresponding to the output genotype calls.
- FIGS. 4A-4B illustrate the dual -variant- type call recalibration system 106 generating genotype probabilities and determining output genotype calls according to one or more embodiments.
- the dual- variant-type call recalibration system 106 utilizes a variant-call-recalibration machine-learning model together with a call-generation model to generate genotype calls in genomic regions corresponding to germline and/or somatic mosaic variants.
- the dual-variant-type call recalibration system 106 utilizes the variant-call-recalibration machine-learning model to modify data fields corresponding to a variant call file representing one or more genotype calls.
- FIG. 4A illustrates generating variant calls by modifying a variant call file utilizing a variant-call- recalibration machine-learning model and a call-generation model in accordance with one or more embodiments.
- the dual -variant-type call recalibration system 106 accesses a sequencing information database 402 (e.g., the sequencing information database 314), a reference sequence 403, and sequence data 404 (e.g., the sequence data 308) extrapolated from one or more nucleotide reads (e.g., the nucleotide reads 302).
- the dual-variant-type call recalibration system 106 performs sequencing-metric extraction 410 to extract or re-engineer sequencing metrics as described above in relation to FIGS. 3A-3C.
- the dual-variant-type call recalibration system 106 generates read-based sequencing metrics, externally sourced sequencing metrics, and call model generated sequencing metrics.
- the dual-variant-type call recalibration system 106 utilizes mapping-and-alignment components 406 of a call-generation model 420 (e.g., the call-generation model 310) to determine mapping-and-alignment sequencing metrics as described above.
- the dual -variant-type call recalibration system 106 utilizes variantcaller components 408 of the call-generation model 420 to generate variant calling metrics as described above.
- the dual -variant-type call recalibration system 106 determines read-based sequencing metrics and externally source sequencing metrics as well (e.g., from sequencing information database 402 and/or the reference sequence 403).
- the dual -variant-type call recalibration system 106 generates genotype probabilities 414. More specifically, the dual -variant-type call recalibration system 106 utilizes a variant-call-recalibration machine-learning model 412 to generate the genotype probabilities 414 from the sequencing metrics extracted via the sequencing-metric extraction 410. For example, the variant-call-recalibration machine-learning model 412 generates genotype probabilities 414 for variants within genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants. While not shown in FIG.
- the dual -variant-type call recalibration system 106 utilizes the variant-call- recalibration machine-learning model 412 (or a different machine-learning model) to generate variant-call classifications in place of the genotype probabilities 414 (e.g., when identifying indels and/or variants at multiallelic genomic coordinates).
- the dual-variant-type call recalibration system 106 generates genotype probabilities for genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants.
- the dual- variant-type call recalibration system 106 utilized the variant-call-recalibration machine-learning model 412 to generate the genotype probabilities 414 for a candidate genomic coordinate (e.g., “chr5:4”), including: (i) a first genotype probability that the genomic sample includes a homozygous reference genotype (e.g., “L(0/0)@chr5:4”) at the candidate genomic coordinate, (ii) a second genotype probability that the candidate genomic coordinate includes a heterozygous variant genotype (e.g., “L(0/l)@chr5:4”), and (iii) a third genotype probability that the candidate genomic coordinate includes ahomozygous variant genotype (e.g., “L
- the first genotype probability indicates a likelihood of 0.10 that the genotype call is a homozygous reference genotype
- the second genotype probability indicates a likelihood of 0.76 that the genotype call is a heterozygous variant genotype
- the third genotype probability indicates a likelihood of 0.14 that the genotype call is a homozygous variant genotype.
- the variant indicated by a “1” in a candidate genotype corresponds to a germline variant.
- the variant indicated by a “1” in a candidate genotype corresponds to a somatic mosaic variant.
- both types of variants i.e., germline variants and somatic mosaic variants are accounted for by the genotype probabilities 414 output by the variant-call-recalibration machinelearning model 412.
- FIG. 4A shows a particular formatting of the genotype probabilities 414
- alternative embodiments can include additional or alternative formatting, such as genotype probabilities for candidate genotypes in which the format specifies a candidate germline variant at a particular position(s) and a candidate somatic mosaic variant at another position.
- the output genotype probabilities may take the form of probabilities for candidate genotypes represented by three positions (e.g., “L(0/l/0)” or “L(0/0/l)”) in which (i) the initial two positions of binary code separated by a slash represent a candidate germline genotype corresponding to either a reference base (i.e., designated as “0” in either the first or second position) or a germline variant (i.e., designated as “1” in the first or second position) and (ii) the last position of binary code following the second slash represents a presence or absence of a candidate somatic mosaic variant (i.e., no somatic mosaic variant designated as “0” and a somatic mosaic variant designated as “1” in the third position).
- a reference base i.e., designated as “0” in either the first or second position
- a germline variant i.e., designated as “1” in the first or second position
- the last position of binary code following the second slash represents
- the dual-variant-type call recalibration system 106 generates a genotype call for a haploid genomic coordinate.
- the variant-call-recalibration machine-learning model 412 generates the genotype probabilities 414 for a haploid genomic coordinate as follows: (i) a first genotype probability of a first genotype at the genomic coordinate and (ii) a second genotype probability of a second genotype at the genomic coordinate.
- the first genotype probability can be a probability that a genotype at a genomic coordinate is a haploid reference genotype
- the second genotype probability can be a probability that a genotype at the genomic coordinate is a haploid alternate genotype.
- the variant-call-recalibration machine-learning model 412 can also generate variant-call classifications including: (i) a false-positive probability or a homozygous reference classification indicating a probability that a genotype call is a false positive or a homozygous reference genotype, respectively; (ii) a zygosity-error probability or a heterozygous genotype classification indicating a probability that a genotype (e.g., an indication of a heterozygous or homozygous genotype for a variant call at a particular location) is incorrect or a heterozygous genotype, respectively; and/or (iii) a true-positive classification or a homozygous alternate classification indicating a probability that a genotype call is a true positive or a homozygous alternate genotype, respectively.
- the generated variant-call classifications accordingly represent intermediate scoring metrics and
- the variant-call-recalibration machine-learning model 412 is an ensemble of gradient boosted trees that processes the sequencing metrics to generate the genotype probabilities 414.
- the variant-call-recalibration machine-learning model 412 can include a series of weak learners such as non-linear decision trees that are trained in a logistic regression to generate the genotype probabilities 414.
- the variant-call-recalibration machine-learning model 412 includes metrics within various trees that define how the variant-call- recalibration machine-learning model 412 processes the sequencing metrics to generate the genotype probabilities 414. Additional detail regarding the training of the variant-call-recalibration machine-learning model 412 is provided below with reference to FIGS. 5A-5B and 6.
- the variant-call-recalibration machine-learning model 412 is a different type of machine learning model such as a neural network, a support vector machine, or a random forest.
- the variant-call-recalibration machine-learning model 412 includes one or more layers each with neurons that make up the layer for processing the sequencing metrics.
- the variant-call-recalibration machine-learning model 412 generates the genotype probabilities 414 by extracting latent vectors from the sequencing metrics, passing the latent vectors from layer to layer (or neuron to neuron) to manipulate the vectors until utilizing an output layer (e.g., one or more fully connected layers) to generate the genotype probabilities.
- an output layer e.g., one or more fully connected layers
- the dual -variant-type call recalibration system 106 utilizes statistics to summarize a mapping quality distribution of reference supporting reads and alternative supporting reads (e.g., for a comparative- mapping-quality-distribution metric).
- the dual -variant-type call recalibration system 106 can determine and utilize the mean of the MAPQ for reads supporting an alternative allele from SBS reads and from assembled nucleotide reads.
- the variant-call- recalibration machine-learning model 412 leams from the data that, when the MAPQ of an alternative allele (indicated by SBS reads or assembled nucleotide reads) is low and a depth metric is high relative to other MAPQ and depth metrics in distributions, a resultant genotype call is more likely to be a false positive. Indeed, as the probability of a false positives increases, the MAPQ metrics would likely decrease.
- the dual -variant-type call recalibration system 106 compares a mapping quality (e.g., MAPQ) associated with an SBS read and/or an assembled nucleotide read with a mapping-quality threshold.
- a mapping quality e.g., MAPQ
- the dual-variant-type call recalibration system 106 utilizes a mapping-quality threshold such as a threshold difference between best and second-best alignment scores. Upon determining that one or more of mapping qualities for the different read types does not satisfy the threshold, the dual-variant-type call recalibration system 106 adjusts one or more of the genotype probabilities 414 accordingly (e.g., to select a read with a higher MAPQ).
- a mapping-quality threshold such as a threshold difference between best and second-best alignment scores.
- the dual -variant-type call recalibration system 106 can determine the genotype probabilities 414 by utilizing an accumulation of statistical analyses over complex functions (depending on the architecture of the variant-call-recalibration machinelearning model 412) to determine how to best fit the data. For example, as described above, the dual -variant-type call recalibration system 106 trains the variant-call-recalibration machinelearning model 412 to minimize a loss generated from a number of (different types of) sequencing metrics to determine weights and biases that best fit the data (e.g., that result in a reduced or minimized loss).
- the dual -variant-type call recalibration system 106 performs data field generation 416. More specifically, the dual -variant-type call recalibration system 106 generates or modifies data fields for a variant call file 418. To generate (or modify) the variant call file 418, the dual-variant-type call recalibration system 106 utilizes the variant-caller components 408 of the call-generation model 420 and modifies or maintains values for such data fields based on the genotype probabilities 414 generated by the variant-call-recalibration machine-learning model 412.
- the dual -variant-type call recalibration system 106 modifies various metrics such as quality metrics, mapping metrics, or other metrics associated with the genotype call. As mentioned, in some cases, the dual -variant-type call recalibration system 106 selects metrics associated with nucleotide reads and/or associated with the genotype probabilities 414. In other cases, the dual -variant-type call recalibration system 106 generates new metrics from the data generated by the call-generation model 420 and/or the variant-call-recalibration machine-learning model 412.
- the genotype call is represented or defined by the variant call file 418 which includes metrics corresponding to the data fields, such as a call-quality metric corresponding to a call-quality field, a genotype metric corresponding to a genotype field, and a genotype-quality metric corresponding to a genotype-quality field.
- the dual -variant-type call recalibration system 106 indicates variant calls within the variant call file 418 (or other sequencing data file) without an indication of a germline variant or a somatic mosaic variant.
- the dual- variant-type call recalibration system 106 having predicted or otherwise determined that an identified variant call corresponds to a somatic mosaic variant, includes an indication within the variant call file 418 that the identified variant is a somatic mosaic variant.
- Such an indicator may take the form of an acronym (e.g., “GV” for germline variant and “SMV” for somatic mosaic variant), a color (e.g., green for germline variant and red for somatic mosaic variant), a code (e.g., “7” for germline variant and “9” for somatic mosaic variant), or any combination or other suitable indicator.
- an acronym e.g., “GV” for germline variant and “SMV” for somatic mosaic variant
- a color e.g., green for germline variant and red for somatic mosaic variant
- a code e.g., “7” for germline variant and “9” for somatic mosaic variant
- the dual -variant-type call recalibration system 106 generates and reports variant calls in a VCF or other sequencing data file (e.g., a first variant call at a first genomic coordinate and a second variant call at a second genomic coordinate different than the first genomic coordinate) but does not include a specific indication that a particular variant call is either a germline variant or a somatic mosaic variant.
- the VCF or other sequencing data file includes neither acronym, neither color, nor code indicating that a particular variant call is a germline variant or a somatic mosaic variant.
- the dual -variant-type call recalibration system 106 generates (data fields for) a genotype call utilizing the variant-caller components 408 together with the genotype probabilities 414. For instance, the dual -variant-type call recalibration system 106 generates, for inclusion within the variant call file 418, data fields for various metrics of a genotype call such as nucleotide(s) included in the call, a call quality (QUAL), a genotype (GT), a genotype quality (GQ), one or more normalized PHRED-scale likelihoods (PL), a genotype probability (GP), an allele frequency (AF), allele count (AC), and/or total number of alleles (AN).
- QUAL call quality
- GT genotype
- GQ genotype quality
- PL genotype probability
- GP genotype probability
- AF allele frequency
- AC allele count
- AN total number of alleles
- the allele frequency (AF) can indicate whether a variant call corresponds to a germline variant (e.g., with an AF of approximately 0.5 or 1.0) or a somatic mosaic variant (e.g., with a relatively low AF, such an AF less than 0.5).
- the dual -variant-type call recalibration system 106 can require a threshold AF before including a particular call within the variant call file 418, wherein the threshold AF is sufficiently high to allow for identification of somatic mosaic variants of relatively low AF (e.g., a threshold AF of 0.05, 0.1, or 0.15).
- the dual -variant-type call recalibration system 106 recalibrates or modifies a genotype call (or generates a new genotype call) using the genotype probabilities 414 from the variant-call-recalibration machine-learning model 412. As described, the dual -variant-type call recalibration system 106 modifies the genotype call by modifying or recalibrating data fields for one or more of the metrics associated with the genotype call (e.g., as included within the variant call file 418).
- the dual -variant-type call recalibration system 106 determines how each of the genotype probabilities 414 impact or affect the base-call-quality metric. For example, the dual- variant-type call recalibration system 106 determines that a high probability for a genotype error results in a lower overall genotype quality and possibly a different overall call quality. As another example, the dual -variant-type call recalibration system 106 determines that a high probability for a false positive variant results in a lower overall call quality.
- QUAL call-quality metric
- the dual- variant-type call recalibration system 106 determines that a high probability for a true positive variant results in a higher overall (variant) call quality.
- the dual-variant-type call recalibration system 106 accordingly updates the genotype along with the genotype quality and the call quality associated with the genotype call.
- the dual -variant-type call recalibration system 106 generates a combination (e.g., a weighted combination or an average) of the genotype probabilities 414 to recalibrate the call-quality metric.
- the dual -variant-type call recalibration system 106 weights the various predictions of the genotype probabilities 414 according to their respective impact on (variant) call quality.
- the dual-variant-type call recalibration system 106 weights each genotype probability, while in other cases the dual -variant-type call recalibration system 106 determines different weights for each.
- the dual -variant-type call recalibration system 106 determines a weighted combination or a weighted average of the genotype probabilities 414 to recalibrate (increase or decrease) a call-quality metric for a genotype call (e.g., an initial variant call).
- the dual -variant-type call recalibration system 106 utilizes one or more of the genotype probabilities 414. For example, the dual -variant-type call recalibration system 106 compares the various constituent predictions of each to determine which of the genotype probabilities 414 has a highest probability. In some cases, the dual -variant-type call recalibration system 106 utilizes the genotype probability with the highest probability to recalibrate the genotype metric (e.g., from 0 as corresponding to the reference base to 1 as corresponding to a first alternative supporting read).
- the dual -variant-type call recalibration system 106 utilizes one or more of the genotype probabilities 414. More specifically, the dual -variant-type call recalibration system 106 determines how each of the genotype probabilities 414 affect the genotype-quality metric. The dual -variant-type call recalibration system 106 recalibrates the genotype-quality metric accordingly (e.g., by increasing or decreasing the quality score between 0 to 10 or 0 to 100 or on some other scale).
- the dual -variant-type call recalibration system 106 determines that a higher genotype error probability (generally) indicates a lower genotype-quality metric, and the dual -variant-type call recalibration system 106 reduces the metric accordingly.
- the dual -variant-type call recalibration system 106 determines a combination (e.g., a weighted combination or a weighted average) of the genotype probabilities 414 to modify the genotype-quality metric. For example, the dual -variant-type call recalibration system 106 determines a combined effect that the genotype probabilities 414 have on the genotypequality metric. As another example, the dual -variant-type call recalibration system 106 determines individual impacts that each constituent prediction of the genotype probabilities 414 has on the genotype-quality metric and weights each accordingly. The dual-variant-type call recalibration system 106 further recalibrates the genotype-quality metric by increasing or decreasing its value based on the indicated probabilities.
- a combination e.g., a weighted combination or a weighted average
- the dual -variant-type call recalibration system 106 generates an output genotype call from the same set of sequencing metrics (or a subset of the sequencing metrics that are shared between the variant-call-recalibration machine-learning model 412 and the callgeneration model 420). Indeed, the dual -variant-type call recalibration system 106 can operate the variant-call-recalibration machine-learning model 412 in parallel with the call-generation model 420 to generate metrics for an output genotype call and genotype probabilities 414 for recalibrating the generated metrics.
- the dual -variant-type call recalibration system 106 updates or otherwise modifies the data fields for the variant call file 418 according to particular algorithms. After modifying such data fields, the dual -variant-type call recalibration system 106 can generate the variant call file 418 (e.g., a post-filter variant call file) to include metrics reflecting the updated data fields. For instance, in some cases, the dual-variant-type call recalibration system 106 updates the QU AL field for every variant based on the probability of a false positive variant. As indicated above, in some cases, QU AL indicates the probability that there is some kind of variant (or other nucleobase call) at a given location, measured in PHRED scale.
- the dual-variant-type call recalibration system 106 increases or decreases a base-call-quality metric (e.g., Q score) for a genotype call. Based on the genotype probabilities 414, for example, the dual -variant-type call recalibration system 106 increases base-call-quality metrics for genotype calls that would not have previously passed a quality filter and determines that the increased base-call-quality metrics now passes the quality filter. In some such cases, the dual -variant-type call recalibration system 106 includes genotype calls with such increased base-call-quality metrics (passing the quality filter) in a post- filter variant call file.
- a base-call-quality metric e.g., Q score
- the dual-variant-type call recalibration system 106 decreases base-call-quality metrics for genotype calls that previously would have passed a quality filter and determines that the decreased base-call-quality metrics now fail the quality filter. In some such cases, the dual -variant-type call recalibration system 106 excludes genotype calls with decreased base-call-quality metrics (failing the quality filter) from a post-filter variant call file but includes the genotype calls with such decreased base-call-quality metrics in a pre-filter variant call file.
- the dual -variant-type call recalibration system 106 can remove false positive variant calls and recover false negative variant calls by changing corresponding base-call- quality metrics.
- the dual -variant-type call recalibration system 106 decreases the base-call-quality metric of a genotype call that initially passed a quality filter — based on the genotype probabilities 414 from the variant-call-recalibration machinelearning model 412. Based on determining the decreased base-call-quality metric falls below a threshold metric (e.g., a Q score of 3.0 or 10.0), the dual-variant-type call recalibration system 106 determines that the genotype call no longer passes the quality filter. The dual-variant-type call recalibration system 106 thus filters out, or removes, the false positive-genotype call that initially passed the filter by changing its base-call-quality metric.
- a threshold metric e.g., a Q score of 3.0 or 10.0
- the dual -variant-type call recalibration system 106 does not identify the genotype call as a variant and, in some cases, excludes data for the genotype call from the variant call file 418. For instance, the dual-variant-type call recalibration system 106 can use anull- data indicator for a genotype call (or a particular field) of the variant call file 418. In some cases, the dual -variant-type call recalibration system 106 uses a null-data indicator in cases where a certain sequencing metric does not apply to a particular variant call or VCF field (e.g., where SBS- based calls use different metrics than assembled-nucleotide-read-based calls).
- the dual -variant-type call recalibration system 106 increases the base-call-quality metric of a genotype call that initially failed a quality filter. Based on determining the increased base-call-quality metric exceeds a threshold metric, the dual-variant- type call recalibration system 106 determines that the genotype call passes the quality filter. The dual -variant-type call recalibration system 106 thus recovers a false-negative-genotype call that was initially filtered out by changing its base-call-quality metric.
- the dual -variant-type call recalibration system 106 Based on the differing genotype of the updated genotype call and a passing base-call-quality metric, the dual -variant-type call recalibration system 106 identifies the genotype call as a variant and includes the genotype call within the variant call file 418.
- the dual -variant-type call recalibration system 106 operates in a specific sequential order utilizing the call-generation model 420 and the variant-call- recalibration machine-learning model 412. For example, the dual -variant-type call recalibration system 106 generates a FASTQ file by converting a BCL file to FASTQ.
- the dual- variant-type call recalibration system 106 (subsequently) utilizes the mapping-and-alignment components 406 of the call-generation model 420 to map and align nucleobases from a sample nucleotide sequence.
- the dual -variant-type call recalibration system 106 maps and aligns the nucleobases of the sample sequence in relation to the reference sequence 403 (e.g., reference genome) and/or various alternative supporting reads.
- the dual-variant-type call recalibration system 106 utilizes the variant-caller components 408 of the call-generation model 420 to generate an initial genotype call for the sample sequence corresponding to a particular genomic coordinate — based on various sequencing metrics.
- the dual-variant-type call recalibration system 106 also applies the variant-call-recalibration machine-learning model 412 to generate the genotype probabilities 414 from sequencing metrics extracted via the mapping and aligning, the variant calling, and/or from other sources as described above.
- the dual -variant-type call recalibration system 106 recalibrates the genotype call (e.g., by modifying various data fields corresponding to specific metrics of the nucleobase call, such as QU AL, GT, GQ, GP, AF, and/or PL), as described above.
- specific metrics of the nucleobase call such as QU AL, GT, GQ, GP, AF, and/or PL
- the dual -variant-type call recalibration system 106 further applies a quality filter to the genotype call to determine whether the genotype call passes the quality filter (e.g., a hard pass filter of Q20 or other Q score).
- the dual-variant-type call recalibration system 106 subsequently identifies a subset of genotype calls that represent variants from reference bases and pass the quality filter.
- the dual -variant-type call recalibration system 106 further generates a modified or updated variant call file (e.g., the variant call file 418) that includes the subset of genotype calls and recalibrated metrics for the subset of genotype calls, such as updated QU AL metrics, updated GT metrics, updated GQ metrics, updated GP metrics, and/or updated PL metrics.
- the dual-variant-type call recalibration system 106 can utilize multiple call-recalibration sub-machine-leaming models together to generate genotype probabilities and/or genotype calls for nucleotide reads within regions corresponding to germline and somatic mosaic variants. For example, FIG.
- FIG. 4B illustrates the dual- variant-type call recalibration system 106 utilizing a machine-learning sub-model 424a (e.g., a germline-specific sub-model) of a variant-call-recalibration machine-learning model 422 to generate a first set of genotype probabilities 430a and a machine-learning sub-model 424b (i.e., a dual-variant-type sub-model or a somatic-mosaic-specific sub-model) of the variant-call- recalibration machine-learning model 422 (e.g., with the same or a different architecture) to generate a second set of genotype probabilities 430b.
- a machine-learning sub-model 424a e.g., a germline-specific sub-model
- a machine-learning sub-model 424b i.e., a dual-variant-type sub-model or a somatic-mosaic-specific sub-model
- the dual -variant-type call recalibration system 106 utilizes two (or more) different call-recalibration machine-learning models in parallel, each trained with different truth datasets, resulting in different genotype probabilities from the same sequencing metrics. For instance, in some implementations, the dual -variant-type call recalibration system 106 trains the machine-learning sub-model 424a to identify germline variants within sample nucleotide sequences by training the machine-learning sub-model 424a with truth datasets comprising ground truth germline variants.
- the dual-variant-type call recalibration system 106 trains the machine-learning sub-model 424b to identify variants corresponding to germline variants and somatic mosaic variants by training the machine-learning sub-model 424b with truth datasets comprising ground truth germline variants and ground truth somatic mosaic variants.
- the dual-variant-type call recalibration system 106 trains the machine-learning sub-model 424b to exclusively identify somatic mosaic variants. Additional details with respect to training a variant-call-recalibration machine-learning model (or sub-models thereof) are provided below in relation to FIGS. 5A-5B and 6.
- the dual -variant-type call recalibration system 106 determines or extracts sequencing metrics 426 from sequence data 428 corresponding to one or more nucleotide reads of a genomic sample (e.g., as described above in relation to FIGS. 3A-3C and 4A).
- the dual -variant-type call recalibration system 106 determines or extracts sequencing metrics 426 including read-based metrics, call-model-generated metrics, and externally sourced metrics from the sequence data 428 and utilizes the variant-call- recalibration machine-learning model 422 to generate genotype probabilities for variants within genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants based on the determined or extracted sequencing metrics 426.
- the variant-call-recalibration machine-learning model 422 utilizes the machine-learning sub-model 424a to generate the first set of genotype probabilities 430a corresponding to one or more germline variants and, based on the first set of genotype probabilities 430a, determines one or more genotype calls 432a comprising one or more germline variants.
- the variant-call-recalibration machine-learning model 422 utilizes the machine-learning sub-model 424b to generate the second set of genotype probabilities 430b corresponding to one or more germline variants and/or one or more somatic mosaic variants and, based on the second set of genotype probabilities 430b, determines one or more genotype calls 432b comprising one or more somatic mosaic variants and, in some cases, one or more germline variants.
- the variant-call-recalibration machine-learning model 422 utilizes the second machine-learning sub-model 424b to exclusively determine genotype calls corresponding to somatic mosaic variants.
- the dual -variant-type call recalibration system 106 can generate (or modify) a genotype call fde, such as a variant call file 434, to include genotype calls from the first set of genotype calls 432a and/or the second set of genotype calls 432b.
- a genotype call fde such as a variant call file 434
- the variant call file 434 includes germline variants from the first set of genotype calls 432a and somatic mosaic variants and/or germline variants from the second set of genotype calls 432b.
- the dual-variant-type call recalibration system 106 can compare the genotype probabilities 430a generated by the machine-learning sub-model 424a with the genotype probabilities 430b generated by the machine-learning sub-model 424b to identify one or more genotype calls as somatic mosaic variant calls 432c.
- the dual -variant-type call recalibration system 106 compares, for a genotype call at a candidate genomic coordinate, a first genotype probability of the genotype probabilities 430a generated by the machine-learning sub-model 424a with a second genotype probability of the genotype probabilities 430b generated by the machine-learning submodel 424b. Based on the comparison, the dual -variant-type call recalibration system 106 identifies the genotype call comprising a variant at the candidate genomic coordinate as a somatic mosaic variant.
- the dual -variant-type call recalibration system 106 may determine the second genotype probability of the genotype probabilities 430b exceeds (or exceeds to a threshold percentage or fixed number) the first genotype probability of the genotype probabilities 430a and, therefore, identify the genotype call comprising a variant at the candidate genomic coordinate as a somatic mosaic variant.
- the dual -variant-type call recalibration system 106 can specifically identify the somatic mosaic variant calls 432c within the variant call file 434.
- the dual -variant-type call recalibration system 106 may determine the first genotype probability of the genotype probabilities 430a exceeds (or exceeds to a threshold percentage or fixed number) the second genotype probability of the genotype probabilities 430b and, therefore, identify the genotype call comprising a variant at the candidate genomic coordinate as a germline variant. In some embodiments, having identified one or more genotype calls as germline variants, the dual -variant-type call recalibration system 106 can specifically identify the germline variants within the variant call file 434.
- the dual-variant-type call recalibration system 106 further generates a combined set of genotype probabilities from the different genotype probabilities generated via the different sub-models of the variant-call-recalibration machinelearning model 422. In some cases, the dual -variant-type call recalibration system 106 selects genotype probabilities from the set of genotype probabilities 430a generated by the machinelearning sub-model 424a and the set of genotype probabilities 430b generated by the machinelearning sub-model 424b.
- the dual-variant-type call recalibration system 106 determines an average or a weighted combination of the respective sets of genotype probabilities to generate combined genotype probabilities for recalibrating a genotype call. In some embodiments, the dual -variant-type call recalibration system 106 determines a mean for each genotype probability across each sub-model of the variant-call-recalibration machinelearning model 422 and renormalizes the mean genotype probability. In other embodiments, the dual -variant-type call recalibration system 106 leams linear weights and adapts the weights to minimize overall error or loss for the genotype probabilities. In still other embodiments, the dual- variant-type call recalibration system 106 weights the genotype probabilities for each sub-model based on the inverse of average error across the models.
- the dual -variant-type call recalibration system 106 provides a selectable option to a user for adjustment a variant-sensitivity of the variant-call- recalibration machine-learning model 422.
- the variant-sensitivity of the variant-call- recalibration machine-learning model in generating genotype probabilities can be adjusted to implement detection of candidate somatic mosaic variants.
- such variant sensitivity may be set to detect and report variants that equal or exceed a particular genotype probability (e.g., 0.45 or 0.50) and/or that that equal or exceed a particular allele frequency (e.g., 0.15 or 0.20) as supported by nucleotide reads covering a genomic coordinate.
- the dual-variant-type call recalibration system 106 can exclusively generate genotype probabilities corresponding to germline variants (e.g., by solely utilizing the machine-learning sub-model 424a of the variant-call-recalibration machinelearning model 422).
- the dual -variant-type call recalibration system 106 executes the variant-call-recalibration machine-learning model 422 to generate the genotype probabilities corresponding to candidate germline variants and candidate somatic mosaic variants (e.g., by executing the machine-learning sub-model 424b or both machine-learning sub-models of the variant-call-recalibration machine-learning model 422 as described above).
- the dual -variant-type call recalibration system 106 can utilize various sources of truth data to train a variant-call-recalibration machine-learning model to generate genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants.
- FIG. 5A illustrates the dual-variant- type call recalibration system 106 generating training data by synthetically modifying existing sequencing files to include somatic mosaic variants at various allele frequencies.
- the dual -variant-type call recalibration system 106 synthetically modifies existing sequencing files to include variants in nucleotide reads at different read depths to simulate somatic mosaic variant’s relatively lower depths in read data and/or to include somatic mosaic variants in exome regions that may include somatic mosaic variants.
- the dual -variant-type call recalibration system 106 identifies or receives a genome sample 502 comprising sample nucleotide reads 504.
- the sample nucleotide reads 504 include one or more known germline variants usable as ground truth for training a call-recalibration machine-learning model to identify such variants within sample nucleotide sequences.
- the dual-variant-type call recalibration system 106 To generate ground truth data for training the variant-call- recalibration machine-learning model to identify somatic mosaic variant, the dual-variant-type call recalibration system 106 generates multiple synthetic nucleotide reads 506 within the genome sample 502 by altering a portion of the sample nucleotide reads 504.
- the dual-variant- type call recalibration system 106 generates the synthetic nucleotide reads 506 within the genome sample 502 at one or more allele frequencies representative of one or more ground-truth somatic mosaic variants 512.
- the synthetic nucleotide reads 506 can likewise be generated from methods using or not using polymerase chain reaction (PCR).
- the dual- variant-type call recalibration system 106 generates the synthetic nucleotide reads 506 at one or more read depths (e.g., 10X, 15X) to mimic the relatively lower read depths of somatic mosaic variants relative to germline variants. Also, in some embodiments, the dual-variant-type call recalibration system 106 can generate the synthetic nucleotide reads 506 in one or more exome regions.
- the dual -variant-type call recalibration system 106 utilizes one or more editing tools (e.g., BAMSurgeon or similar applications/tools) to add simulated variants to existing alignment data files (e.g., binary alignment map (BAM) files or similar files).
- existing alignment data files e.g., binary alignment map (BAM) files or similar files.
- BAM binary alignment map
- the dual -variant-type call recalibration system 106 can add single nucleotide variants (SNVs), insertions or deletions (INDELs), and/or several forms of structural variants (SV) to existing alignment data files to generate sample data and corresponding ground truth for training of a variant-call-recalibration machine-learning model, as further illustrated in FIG. 5A.
- SNVs single nucleotide variants
- INDELs insertions or deletions
- SV structural variants
- the dual -variant-type call recalibration system 106 generates modified sample sequencing data 508, including the synthetic nucleotide reads 506 and at least a portion of the original sample nucleotide reads 504 (i.e., any remaining unaltered reads of the sample nucleotide reads 504). Further, the dual-variant-type call recalibration system 106 determines or extracts sample sequencing metrics 510 based on the modified sample sequencing data 508 (e.g., such as described above in relation to FIGS. 3A-3C).
- the sample sequencing metrics 510 include sample-read-based sequencing metrics based on the remaining unaltered reads of the sample nucleotide reads 504, as well as synthetic-read-based sequencing metrics based on the synthetic nucleotide reads 506.
- the dual-variant-type call recalibration system 106 can utilize the modified sample sequencing data 508, the ground-truth somatic mosaic variants 512, and any ground-truth germline variants remaining after modifying the genome sample 502 to train a variant-call-recalibration machine-learning model to generate genotype probabilities for germline and somatic mosaic variants.
- the dual-variant-type call recalibration system 106 can also utilize other sample nucleotide reads and corresponding sequencing data — which have not been synthetically modified — as ground truth training data for germline variants.
- the dual-variant-type call recalibration system 106 utilizes an admixture of germline truth sets to simulate somatic mosaicisms in ground truth data for training a variant-call-recalibration machine-learning model to generate genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants.
- FIG. 5B illustrates the dual-variant-type call recalibration system 106 determining subsets (e.g., percentages) of sample genomic sequences from a combination of male and female genomic samples that together simulate variant-allele frequencies of a genome sample with somatic mosaicism.
- the dual -variant-type call recalibration system 106 determines subsets of sample nucleotide sequences from different genomic samples forming an admixture genome. When the corresponding subsets are mixed together, the admixture genome simulates a genomic sample with somatic mosaicism.
- the dual -variant-type call recalibration system 106 determines a percentage of sample nucleotide sequences 522a from a first genome sample 520a and a percentage of sample nucleotide sequences 522b from a second genome sample 520b that, when mixed together, simulate variant-allele frequencies of a genomic sample exhibiting characteristics of somatic mosaicism.
- the dual -variant-type call recalibration system 106 estimates the variant-allele frequencies of different subset mixtures (or percentage mixtures) from truth set bases of Platinum Genomes for the first genome sample 520a and the second genome sample 520b.
- FIG. 5B illustrates an example of the dual -variant-type call recalibration system 106 determining subsets of sample nucleotide sequences for one such admixture genome and determining corresponding variant allele frequencies. As depicted in FIG. 5B, the dual-variant-type call recalibration system 106 determines the variant-allele frequencies for SNPs of both heterozygous and homozygous alleles for an admixture genome.
- the dual -variant-type call recalibration system 106 determines or predicts the relevant variant allele frequencies by referencing the truth set bases of the first genome sample 520a (e.g., NA12877) and the second genome sample 520b (e.g., NA12878) from Platinum Genomes. While FIG.
- the dual -variant-type call recalibration system 106 can determine admixture genomes and variant allele frequencies for other specific variant types, such as insertions, deletions, or structural variants. Also, the dual -variant-type call recalibration system 106 can determine admixture genomes utilizing nucleotide reads from sample library fragments generated by various means, such as nucleotide reads generated using or not using PCR techniques. [0162] As shown in an allele-frequency table 524 presented in FIG.
- the dual- variant-type call recalibration system 106 determines that unique homozygous alleles and unique heterozygous alleles from the second genome sample 520b occur at variant allele frequencies of 0.4 and 0.2, respectively, in the admixture genome. As further shown, the dual-variant-type call recalibration system 106 determines that unique homozygous alleles and unique heterozygous alleles from the first genome sample 520a occur at variant allele frequencies of 0.6 and 0.3, respectively, in the admixture genome.
- the dual-variant-type call recalibration system 106 determines that common alleles present in the 60%-and-40% admixture genome as homozygous-homozygous combinations, heterozygous-homozygous combinations, homozygous- heterozygous combinations, and heterozygous-heterozygous combinations — according to the corresponding allele zygosities in the second genome sample 520b and the first genome sample 520a — occur at variant allele frequencies of 1.0, 0.8, 0.7 and 0.5, respectively.
- the dual -variant-type call recalibration system 106 can determine variant allele frequencies from truth set bases of various combinations (and percentages) of genome samples in a given admixture genome. In addition to the variant allele frequencies present in the 60%-and- 40% admixture genome depicted in FIG. 5B, in some embodiments, the dual-variant-type call recalibration system 106 determines variant allele frequencies for other possible admixture genomes to simulate a genomic sample with somatic mosaicism.
- the dual-variant- type call recalibration system 106 determines that 30% of sample nucleotide sequences from the first genome sample 520a and 70% of sample nucleotide sequences from the second genome sample 520b would produce unique homozygous alleles from the first genome sample 520a and from the second genome sample 520b at variant allele frequencies of 0.7 and 0.3, respectively, as well as unique heterozygous alleles from the first genome sample 520a and from the second genome sample 520b at variant allele frequencies of 0.35 and 0.15, respectively.
- the dual -variant-type call recalibration system 106 determines or predicts that common alleles present in such a 30%-and-70% admixture genome as homozygous-homozygous combinations, heterozygous-homozygous combinations, homozygous-heterozygous combinations, and heterozygous-heterozygous combinations — according to the same 30% and 70% admixture — would produce variant allele frequencies of 1.0, 0.85, 0.65 and 0.5, respectively.
- the dual-variant-type call recalibration system 106 determines variant allele frequencies from combinations of different sample genomes to identify a suitable admixture genome simulating a genomic sample with somatic mosaicism.
- the dual -variant-type call recalibration system 106 can select the admixture genome that more closely (or most closely) simulates the variant allele frequencies of a target somatic mosaicism (e.g., a somatic mosaic variant in a genomic region of interest) and use data from such a simulated genomic sample as ground truth data for training a variant-call-recalibration machine-learning model.
- a target somatic mosaicism e.g., a somatic mosaic variant in a genomic region of interest
- the synthetic nucleotide reads 506 within the genome sample 502 selects an admixture genome that include somatic mosaic variants of relatively lower depths and in particular exome regions.
- the dual -variant-type call recalibration system 106 can generate training data implementing various other features for intelligently training the variant-call-recalibration machine-learning model.
- the dual-variant-type call recalibration system 106 can generate training data (e.g., according to the methods described above in relation to FIGS. 5A and 5B) to include simulated read data of varying depth, read data with variants mimicking somatic mosaic variants in exome regions, and so forth.
- the dual-variant-type call recalibration system 106 trains or tunes a variant-call-recalibration machine-learning model (e.g., the variant- call-recalibration machine-learning model 412 or one or more sub-models of the variant-call- recalibration machine-learning model 422).
- the dual-variant-type call recalibration system 106 utilizes an iterative training process to fit a variant-call-recalibration machine-learning model by adjusting or adding decision trees or learning parameters that result in accurate genotype probabilities (e.g., genotype probabilities 414, 430a, or 430b).
- FIG. 6 illustrates the dual -variant-type call recalibration system 106 training a variant-call-recalibration machinelearning model in accordance with one or more embodiments.
- the dual -variant-type call recalibration system 106 accesses modified sample sequencing data 603 and determines or extracts sample sequencing metrics 604 from the modified sample sequencing data 603 and receives or obtains some metrics (e.g., externally sourced metrics) from a database 602 (e.g., the database 116, the sequencing information database 314, or the sequencing information database 402).
- a database 602 e.g., the database 116, the sequencing information database 314, or the sequencing information database 402
- the dual -variant- type call recalibration system 106 can access the modified sample sequencing data 603 in the form of the modified sample sequencing data 508 generated in FIG. 5A or in the form of simulated admixture data generated in FIG. 5B.
- the dual- variant-type call recalibration system 106 determines or extracts the sample sequencing metrics 604 as part of training a variant-call-recalibration machine-learning model 606. For example, the dual -variant-type call recalibration system 106 determines or extracts sample sequencing metrics in the form of sample read-based metrics, sample externally sourced sequencing metrics, and sample call-model-generated sequencing metrics.
- the modified sample sequencing data 603 has corresponding ground truth data 616 indicating ground truth genotype calls corresponding to somatic mosaic variants and germline variants.
- the ground truth data 616 also includes various ground truth metrics that result from the set of sample sequencing metrics 604.
- the dual-variant-type call recalibration system 106 also access or extracts sample sequencing metrics from genomic data comprising ground truth germline variants.
- the dual-variant-type call recalibration system 106 utilizes ground truth data from a training dataset from the food and drug administration, called the PrecisionFDA dataset, for ground truth data comprising germline variants alongside ground truth data corresponding to synthesized somatic mosaic variants (e.g., such as described above in relation to FIGS. 5A or 5B).
- the dual -variant-type call recalibration system 106 generates predicted genotype probabilities 608 based on the determined or extracted sample sequencing metrics 604. Specifically, the dual -variant-type call recalibration system 106 utilizes the variant-call-recalibration machine-learning model 606 to generate the predicted genotype probabilities 608. Indeed, in some embodiments, the variant-call-recalibration machine-learning model 606 generates a set of three predicted genotype probabilities 608, as described above (e.g., probabilities for homozygous reference calls, homozygous variant calls, or heterozygous calls at a given genomic coordinate). Indeed, the predicted genotype probabilities 608 can accordingly take the form of any of the variant-call classifications described above.
- the dual-variant-type call recalibration system 106 determines one or more predicted genotype calls 610 and, in some implementations, data field entries corresponding to predicted genotype calls 610. As indicated above, the dual -variant-type call recalibration system 106 can utilize (i) existing genotype calls generated by a call generation model and included with the modified sample sequencing data 603 and (ii) the variant-call-recalibration machine-learning model 606 to modify data fields corresponding to a variant call file (e.g., data fields corresponding to initial genotype calls of the modified sample sequencing data 603).
- a variant call file e.g., data fields corresponding to initial genotype calls of the modified sample sequencing data 603
- Such modified or recalibrated values are output in the by, for example, the variant-call-recalibration machine-learning model 606.
- the dual- variant-type call recalibration system 106 determines recalibrated values for particular metrics corresponding to the predicted genotype calls 610, including a base-call-quality metric (QUAL), a genotype metric (GT), a genotype-quality metric (GQ), allele frequency (AF), allele count (AC), and total number of alleles (AN), and so forth.
- QUAL base-call-quality metric
- GT genotype metric
- GQ genotype-quality metric
- AF allele frequency
- AC allele count
- AN total number of alleles
- the dual -variant-type call recalibration system 106 performs a comparison 612. Specifically, the dual -variant-type call recalibration system 106 performs the comparison 612 between (i) predicted genotype calls 610 and/or corresponding data fields output by the variant-call-recalibration machine-learning model 606 and (ii) genotype calls and/or corresponding data fields in the ground truth data 616. In some embodiments, the dual- variant-type call recalibration system 106 utilizes a loss function 614 to compare genotype calls and/or corresponding data fields (e.g., to determine an error or a measure of loss between them).
- the dual -variant-type call recalibration system 106 utilizes a mean squared error loss function (e.g., for regression) and/or a logarithmic loss function (e.g., for classification) as the loss function 614.
- a mean squared error loss function e.g., for regression
- a logarithmic loss function e.g., for classification
- the dual -variant-type call recalibration system 106 can utilize a cross entropy loss function, an LI loss function, or a mean squared error loss function as the loss function 614.
- the dual -variant-type call recalibration system 106 utilizes the loss function 614 to determine a difference between predicted genotype calls and/or corresponding data fields and the ground truth data 616.
- the dual -variant-type call recalibration system 106 performs model fitting 618.
- the dual-variant-type call recalibration system 106 fits the variant-call-recalibration machine-learning model 606 based on the comparison 612.
- the dual -variant-type call recalibration system 106 performs modifications or adjustments to the variant-call-recalibration machine-learning model 606 to reduce the measure of loss from the loss function 614 for a subsequent training iteration.
- the dual-variant-type call recalibration system 106 trains the variant-call-recalibration machine-learning model 606 on the gradients of the errors determined by the loss function 614. For instance, the dual -variant-type call recalibration system 106 solves a convex optimization problem (e.g., of infinite dimensions) while regularizing the objective to avoid overfitting. In certain implementations, the dual -variant-type call recalibration system 106 scales the gradients to emphasize corrections to under-represented classes (e.g., where there are significantly more true positives than false positive variant calls).
- the dual -variant-type call recalibration system 106 adds a new weak learner (e.g., a new boosted tree) to the variant-call-recalibration machine-learning model 606 for each successive training iteration as part of solving the optimization problem.
- a new weak learner e.g., a new boosted tree
- the dual-variant-type call recalibration system 106 finds a feature (e.g., a sequencing metric) that minimizes a loss from the loss function 614 and either adds the feature to the current iteration’s tree or starts to build a new tree with the feature.
- the dual-variant-type call recalibration system 106 trains a logistic regression to leam parameters for generating one or more genotype probabilities and/or other variant call classifications.
- the dual- variant-type call recalibration system 106 further regularizes based on hyperparameters such as the learning rate, stochastic gradient boosting, the number of trees, the tree-depth(s), complexity penalization, and Ll/L2 regularization.
- the dual -variant-type call recalibration system 106 performs the model fitting 618 by modifying internal parameters (e.g., weights) of the variant-call-recalibration machinelearning model 606 to reduce the measure of loss for the loss function 614.
- the dual-variant- type call recalibration system 106 modifies how the variant-call-recalibration machine-learning model 606 analyzes and passes data between layers and neurons by modifying the internal network parameters.
- the dual -variant-type call recalibration system 106 improves the accuracy of the variant-call-recalibration machine-learning model 606.
- the dual -variant-type call recalibration system 106 repeats the training process illustrated in FIG. 6 for multiple iterations. For example, the dual-variant-type call recalibration system 106 repeats the iterative training by selecting a new set of sequencing metrics for each genotype call along with a corresponding ground truth genotype call in corresponding ground truth data. The dual -variant-type call recalibration system 106 further generates a new set of predicted genotype probabilities for each iteration.
- the dual-variant-type call recalibration system 106 also compares genotype calls and/or corresponding data fields from at each iteration with the corresponding genotype calls and/or data fields from the corresponding ground truth data and further performs model fitting 618.
- the dual-variant-type call recalibration system 106 repeats this process until the variant-call-recalibration machine-learning model 606 generates predicted genotype probabilities that result in variant calls that satisfies a threshold measure of loss.
- the dual-variant-type call recalibration system 106 provides improvements in flexibility and accuracy over existing systems.
- the dual -variant-type call recalibration system 106 provides the flexibility of calling variants corresponding to germline variants and somatic mosaic variants while identifying somatic mosaic variants with increased accuracy.
- FIGS. 7A-7F show experimental results of the dual -variant-type call recalibration system 106 in identifying somatic mosaic variants within sample genomic sequences.
- FIGS. 7A-7B illustrate graphs illustrating experimental results of utilizing the dual -variant-type call recalibration system 106 to identify mosaic variants within two modified whole genome sequence (WGS) PrecisionFDA datasets (specifically, HG002 and HG003) comprising synthesized nucleotide reads simulating somatic mosaic variants at various allele frequencies.
- the two modified datasets comprise synthetic nucleotide reads simulating various SNPs at allele frequencies between 5% and 25%, as particularly shown in FIG. 7B.
- the dual -variant-type call recalibration system 106 recalls a significant percentage of somatic mosaic variants within the modified datasets.
- FIGS. 7C-7D show graphs illustrating experimental results of utilizing the dual -variant-type call recalibration system 106 to identify mosaic variants within four modified whole exome sequence (WES) PrecisionFDA datasets (specifically, HG002 from four different exome libraries) comprising synthesized nucleotide reads simulating somatic mosaic variants at various allele frequencies.
- the four modified datasets comprise synthetic nucleotide reads simulating various SNPs at allele frequencies between 5% and 25%, as particularly shown in FIG. 7D.
- FIG. 7D shows that provides synthetic nucleotide reads at allele frequencies between 5% and 25%
- the dual -variant-type call recalibration system 106 recalls a significant percentage of somatic mosaic variants within the modified datasets, with additional improvements in accuracy when analyzing WES sequences (in comparison with WGS sequences as shown in FIG. 7A).
- FIGS. 7E-7F show graphs illustrating experimental results of utilizing the dual -variant-type call recalibration system 106 to identify mosaic variants within four additional modified whole exome sequence (WES) PrecisionFDA datasets (specifically, HG003 from four different exome libraries) comprising synthesized nucleotide reads simulating somatic mosaic variants at various allele frequencies.
- the four modified datasets comprise synthetic nucleotide reads simulating various SNPs at allele frequencies between 5% and 25%, as particularly shown in FIG. 7F. Indeed, as shown in FIG.
- the dual -variant-type call recalibration system 106 recalls a significant percentage of somatic mosaic variants within the modified datasets, with additional improvements in accuracy when analyzing WES sequences (in comparison with WGS sequences as shown in FIG. 7A).
- the table below illustrates numerical results corresponding to the results shown in FIGS. 7C-7F.
- the dual-variant-type call recalibration system 106 improves the computing efficiency with which somatic mosaic variants are identified within a genomic sequence.
- the dual-variant-type call recalibration system 106 generates accurate variant calls corresponding to somatic mosaic variants with increased speed and requiring fewer computational resources compared to existing sequencing systems.
- FIG. 8 shows experimental results of the dual-variant-type call recalibration system 106 identifying variants within a genomic dataset comprising mosaic variants of various variant allele frequencies (VAF).
- VAF variant allele frequencies
- FIG. 8 includes a bar graph 800 illustrating a number of variants within a genomic dataset (indicated as “SetA M3 -12”) with variant allele frequencies between 4% and 32%.
- the genomic dataset comprises approximately 6,000 variants with a corresponding allele frequency of 0.04 (4%), approximately 350 variants with a corresponding allele frequency of 0.08 (8%), approximately 2,750 variants with a corresponding allele frequency of 0.096 (9.6%), approximately 2,100 variants with a corresponding allele frequency of 0.016 (1.6%), approximately 100 variants with a corresponding allele frequency of 0.192 (19.2%), and approximately 150 variants with a corresponding allele frequency of 0.32 (32%).
- the variants represented by the bar graph 800 according exhibit relatively low variant allele frequencies consistent with (or that mimic) those of somatic mosaic variants.
- FIG. 8 includes a table 810 of experimental results for run time required by the dual -variant-type call recalibration system 106, in comparison with two existing deep-leaming-based sequencing systems (indicated as “Prior System A” and “Prior System B”), to identify variants within the genomic dataset represented by the bar graph 800.
- the run-time results for the dual -variant-type call recalibration system 106 include time for mapping and alignment of reads from the genomic dataset, as well as variant calling of variants, whereas the run-time results provided for the two existing sequencing systems only include the computation time utilized for mosaic variant calling.
- each of the three listed sequencing systems utilized the same read alignment method and candidate genomic coordinates for the genomic dataset to determine their respective variant calls.
- the table 810 indicates the computation hardware upon which each respective sequencing system was implemented.
- the dual -variant-type call recalibration system 106 determines variant calls for the provided dataset within a significantly reduced computational run-time compared to the existing deep-leaming-based sequencing systems.
- the run-time results for the dual -variant-type call recalibration system 106 also include the time for read mapping and read alignment — and the run-time results for the two existing sequencing systems exclude such time for read mapping and read alignment — the run time of approximately 0.3 hours in the table 810 underestimates the superior speed with which the dual- variant-type call recalibration system 106 determines somatic mosaic variants relative to the 5.5 hours and 12.8 hours consumed by the existing sequencing systems to determine somatic mosaic variants.
- the extensive statistical data analysis performed by such existing deep- leaming-based sequencing systems requires excessive computation time relative to the dual- variant-type call recalibration system 106.
- FIG. 9 illustrates an example flowchart of a series of acts of generating variant calls corresponding to germline variants and somatic mosaic variants in accordance with one or more embodiments.
- FIG. 9 illustrates acts according to one embodiment
- alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIG. 9.
- the acts of FIG. 9 can be performed as part of a method.
- a non- transitory computer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device to perform the acts depicted in FIG. 9.
- a system comprising at least one processor and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform the acts of FIG. 9.
- the series of acts 900 includes an act 902 of determining sequencing metrics for nucleotide reads, an act 904 of generating genotype probabilities for variants corresponding to candidate germline variants and candidate somatic mosaic variants, and an act 906 of generating a first variant call corresponding to a germline variant and a second variant call corresponding to a somatic mosaic variant.
- the series of acts 900 can include acts to perform any of the operations described in the following clauses: CLAUSE 1.
- a method comprising: determining sequencing metrics for nucleotide reads corresponding to genomic regions of a genomic sample; generating, utilizing a variant-call-recalibration machine-learning model and based on the sequencing metrics, genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants; and generating, for the genomic regions and based on the genotype probabilities, at least a first variant call corresponding to a germline variant in the genomic sample and at least a second variant call corresponding to a somatic mosaic variant in the genomic sample.
- CLAUSE 2 The method of clause 1, further comprising: generating, within a sequencing data file, a germline-variant indicator identifying the first variant call as a germline variant; and generating, within the sequencing data file, a somatic-mosaic-variant indicator identifying the second variant call as a somatic mosaic variant.
- CLAUSE 3 The method of any of clauses 1-2, further comprising generating, within a sequencing data file, a variant indicator identifying the first variant call or the second variant call as a variant without an indication of a germline variant or a somatic mosaic variant.
- CLAUSE 4 The method of any of clauses 1-3, wherein the first variant call corresponds to a first genomic coordinate of the genomic sample and the second variant call corresponds to a second genomic coordinate of the genomic sample different than the first genomic coordinate.
- CLAUSE 6 The method of any of clauses 1-5, wherein the genomic regions comprise one or more target genomic regions comprising one or more candidate somatic mosaic variants for which the variant-call-recalibration machine-learning model was trained to generate predicted genotype probabilities.
- CLAUSE 7 The method of any of clauses 1-6, further comprising: generating, utilizing a germline-variant-call-recalibration machine-learning model and based on the sequencing metrics, additional genotype probabilities for germline variants within the genomic regions corresponding to the candidate germline variants; and generating, for the genomic regions and based on the additional genotype probabilities, one or more additional candidate variant calls corresponding to one or more germline variants in the genomic sample.
- CLAUSE 8 The method of any of clauses 1-7, further comprising: comparing, for a genomic coordinate for the second variant call, a genotype probability generated by the variant-call-recalibration machine-learning model with an additional genotype probability generated by the germline-variant-call-recalibration machine-learning model; and identifying the second variant call as a somatic mosaic variant based on a comparison of the genotype probability and the additional genotype probability.
- the variant-call-recalibration machine-learning model comprises a first machine-learning sub-model configured to generate a first type of genotype probabilities accounting for a set of candidate germline variants and a second machine-learning sub-model configured to generate a second type of genotype probabilities accounting for a set of candidate somatic mosaic variants.
- CLAUSE 10 The method of any of clauses 1-9, further comprising: accessing, based on user input, sequencing data comprising sample nucleotide reads and synthetic nucleotide reads comprising modified nucleobases representing ground-truth somatic mosaic variants; determining the sequencing metrics for the sequencing data by determining sample-read- based sequencing metrics for the sample nucleotide reads and synthetic-read-based sequencing metrics for the synthetic nucleotide reads; and training the variant-call-recalibration machine-learning model to generate, based on the sample-read-based sequencing metrics and the synthetic-read-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants based on comparisons of variant calls and the ground-truth somatic mosaic variants.
- CLAUSE 11 The method of any of clauses 1-10, further comprising causing the system to generate the synthetic nucleotide reads by modifying existing nucleotide reads to include the ground-truth somatic mosaic variants at one or more variant allele frequencies representative of one or more somatic mosaic variants.
- CLAUSE 12 The method of any of clauses 1-11, further comprising: identifying an admixture of genomic samples that simulates variant-allele frequencies of ground-truth somatic mosaic variants and ground-truth germline variants; accessing a mixture of nucleotide reads comprising a first set of nucleotide reads from a first genomic sample of the admixture of genomic samples and a second set of nucleotide reads from a second genomic sample of the admixture of genomic samples; determining the sequencing metrics for the nucleotide reads by determining admixturebased sequencing metrics for the mixture of nucleotide reads; and training the variant-call-recalibration machine-learning model to generate, based on the admixture-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants and germline variants based on comparisons of predicted variant calls with the ground-truth somatic mosaic variants and the ground-truth germline variants.
- CLAUSE 13 The method of any of clauses 1-12, further comprising: receiving an indication of a user selection of a variant-sensitivity option corresponding to detection of the candidate somatic mosaic variants; and executing the variant-call-recalibration machine-learning model to generate the genotype probabilities instead of a germline-variant-call-recalibration machine-learning model configured to generate a different type of genotype probabilities for candidate germline variants.
- CLAUSE 14 The method of any of clauses 1-13, wherein the variant-call-recalibration machine-learning model comprises one or more of a gradient boost decision tree or a random forest model.
- nucleic acid sequencing techniques can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable.
- the process to determine the nucleotide sequence of a target nucleic acid i.e., a nucleic acid polymer
- Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
- SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand.
- a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery.
- more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
- SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties.
- Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below.
- the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery.
- the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).
- SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like.
- a characteristic of the label such as fluorescence of the label
- a characteristic of the nucleotide monomer such as molecular weight or charge
- a byproduct of incorporation of the nucleotide such as release of pyrophosphate; or the like.
- the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used.
- the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by
- Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release.” Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing.” Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P.
- PPi inorganic pyrophosphate
- the nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array.
- An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images.
- the images can be stored, processed and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminatorbased sequencing methods.
- cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference.
- This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744, each of which is incorporated herein by reference.
- the availability of fluorescently labeled terminators in which both the termination can be reversed, and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing.
- Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
- the labels do not substantially inhibit extension under SBS reaction conditions.
- the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features.
- each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step.
- each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator- SBS methods can be stored, processed and analyzed as set forth herein. Following the image capture step, labels can be removed, and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
- nucleotide monomers can include reversible terminators.
- reversible terminators/cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference).
- Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety).
- Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension but could easily be deblocked by a short treatment with a palladium catalyst.
- the fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light.
- disulfide reduction or photocleavage can be used as a cleavable linker.
- Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP.
- the presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and/or electrostatic hindrance.
- Some embodiments can utilize detection of four different nucleotides using fewer than four different labels.
- SBS can be performed utilizing methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232.
- a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair.
- nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e.g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into a nucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal.
- one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels.
- An exemplary embodiment that combines all three examples is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g.
- sequencing data can be obtained using a single channel.
- the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated.
- the third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
- Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides.
- the oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize.
- images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images.
- Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing.” Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, “Characterization of nucleic acids by nanopore analysis”. Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope” Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties).
- the target nucleic acid passes through a nanopore.
- the nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin.
- each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore.
- Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity.
- Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No.
- FRET fluorescence resonance energy transfer
- the illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations.” Science 299, 682-686 (2003); Lundquist, P. M. et al.
- Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product.
- sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 2010/0137143 Al; or US 2010/0282617 Al, each of which is incorporated herein by reference.
- Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.
- the above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously.
- different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner.
- the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner.
- the target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface.
- the array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
- the methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features/cm2, 100 features/cm2, 500 features/cm2, 1,000 features/cm2, 5,000 features/cm2, 10,000 features/cm2, 50,000 features/cm2, 100,000 features/cm2, 1,000,000 features/cm2, 5,000,000 features/cm2, or higher.
- an advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly, the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above.
- an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and/or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like.
- a flow cell can be configured and/or used in an integrated system for detection of target nucleic acids. Exemplary flow cells are described, for example, in US 2010/0111768 Al and US Ser. No.
- one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method.
- one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above.
- an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods.
- Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13/273,666, which is incorporated herein by reference.
- sample and its derivatives, is used in its broadest sense and includes any specimen, culture and the like that is suspected of including a target.
- the sample comprises DNA, RNA, PNA, LNA, chimeric or hybrid forms of nucleic acids.
- the sample can include any biological, clinical, surgical, agricultural, atmospheric or aquatic-based specimen containing one or more nucleic acids.
- the term also includes any isolated nucleic acid sample such a genomic DNA, fresh-frozen or formalin-fixed paraffin-embedded nucleic acid specimen.
- the sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA.
- the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
- the nucleic acid sample can include high molecular weight material such as genomic DNA (gDNA).
- the sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples.
- low molecular weight material includes enzymatically or mechanically fragmented DNA.
- the sample can include cell-free circulating DNA.
- the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples.
- the sample can be an epidemiological, agricultural, forensic or pathogenic sample.
- the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source.
- the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus or fungus.
- the source of the nucleic acid molecules may be an archived or extinct sample or species.
- forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel.
- the nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric or other substrate that may be impregnated with saliva, blood, or other bodily fluids.
- the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA.
- target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine and serum.
- target sequences can be obtained from hair, skin, tissue samples, autopsy or remains of a victim.
- nucleic acids including one or more target sequences can be obtained from a deceased animal or human.
- target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA.
- target sequences or amplified target sequences are directed to purposes of human identification.
- the disclosure relates generally to methods for identifying characteristics of a forensic sample.
- the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein.
- a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
- the components of the dual -variant-type call recalibration system 106 can include software, hardware, or both.
- the components of the dual-variant-type call recalibration system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device 108). When executed by the one or more processors, the computer-executable instructions of the dual -variant-type call recalibration system 106 can cause the computing devices to perform the bubble detection methods described herein.
- the components of the dual-variant- type call recalibration system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions.
- the components of the dual -variant-type call recalibration system 106 can include a combination of computer-executable instructions and hardware.
- components of the dual -variant-type call recalibration system 106 performing the functions described herein with respect to the dual-variant-type call recalibration system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model.
- components of the dual-variant- type call recalibration system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device.
- the components of the dual -variant-type call recalibration system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and/or other countries. [0212] Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below.
- Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures.
- one or more of the processes described herein may be implemented at least in part as instructions embodied in a non- transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein).
- a processor e.g., a microprocessor
- Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system.
- Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices).
- Computer- readable media that carry computer-executable instructions are transmission media.
- embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
- Non-transitory computer-readable storage media includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
- SSDs solid state drives
- PCM phasechange memory
- a “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices.
- a network or another communications connection can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
- program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa).
- computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a NIC), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system.
- a network interface module e.g., a NIC
- non-transitory computer- readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
- Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.
- computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure.
- the computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code.
- the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
- the disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks.
- program modules may be located in both local and remote memory storage devices.
- Embodiments of the present disclosure can also be implemented in cloud computing environments.
- “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources.
- cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources.
- the shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
- a cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth.
- a cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS).
- SaaS Software as a Service
- PaaS Platform as a Service
- laaS Infrastructure as a Service
- a cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth.
- a “cloud-computing environment” is an environment in which cloud computing is employed.
- FIG. 10 illustrates a block diagram of a computing device 1000 that may be configured to perform one or more of the processes described above.
- the computing device 1000 may implement the dual -variant-type call recalibration system 106 and the sequencing system 104.
- the computing device 1000 can comprise a processor 1002, a memory 1004, a storage device 1006, an I/O interface 1008, and a communication interface 1010, which may be communicatively coupled by way of a communication infrastructure 1012.
- the computing device 1000 can include fewer or more components than those shown in FIG. 10. The following paragraphs describe components of the computing device 1000 shown in FIG. 10 in additional detail.
- the processor 1002 includes hardware for executing instructions, such as those making up a computer program.
- the processor 1002 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1004, or the storage device 1006 and decode and execute them.
- the memory 1004 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processor(s).
- the storage device 1006 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein.
- the I/O interface 1008 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1000.
- the I/O interface 1008 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces.
- the I/O interface 1008 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers.
- the I/O interface 1008 is configured to provide graphical data to a display for presentation to a user.
- the graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
- the communication interface 1010 can include hardware, software, or both. In any event, the communication interface 1010 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1000 and one or more other computing devices or networks.
- the communication interface 1010 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
- NIC network interface controller
- WNIC wireless NIC
- WI-FI wireless network
- the communication interface 1010 may facilitate communications with various types of wired or wireless networks.
- the communication interface 1010 may also facilitate communications using various communication protocols.
- the communication infrastructure 1012 may also include hardware, software, or both that couples components of the computing device 1000 to each other.
- the communication interface 1010 may use one or more networks and/or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein.
- the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Data Mining & Analysis (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
This disclosure describes methods, non-transitory computer readable media, and systems that can utilize a machine-learning model to recalibrate genotype calls (e.g., variant calls) corresponding to germline variants and somatic mosaic variants. For instance, based on sequencing metrics for nucleotide reads of a genomic sample, the disclosed systems can utilize a variant-call-recalibration machine-learning model to generate genotype probabilities for variants within genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants. Further, the disclosed systems can generate genotype calls, such as variant calls corresponding to somatic mosaic variants, based on the generated genotype probabilities.
Description
MACHINE-LEARNING MODEL FOR RECALIBRATING GENOTYPE CALLS CORRESPONDING TO GERMLINE VARIANTS AND SOMATIC MOSAIC VARIANTS
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/511,605, entitled, “MACHINE-LEARNING MODEL FOR RECALIBRATING GENOTYPE CALLS CORRESPONDING TO GERMLINE VARIANTS AND SOMATIC MOSAIC VARIANTS,” filed on June 30, 2023, and U.S. Provisional Patent Application No. 63/607,446, entitled, “MACHINE-LEARNING MODEL FOR RECALIBRATING GENOTYPE CALLS CORRESPONDING TO GERMLINE VARIANTS AND SOMATIC MOSAIC VARIANTS,” filed on December 7, 2023. Each of the aforementioned applications is hereby incorporated by reference in its entirety.
BACKGROUND
[0002] In recent years, biotechnology firms and research institutions have improved hardware and software for sequencing nucleotides and determining germline variant calls for genomic samples. For instance, some existing nucleobase sequencing platforms determine individual nucleobases within sequences from germ cells by using conventional Sanger sequencing or by using sequencing-by-synthesis (SBS) methods. When using SBS, existing platforms can monitor millions to billions of nucleic acid polymers being synthesized in parallel to predict nucleobase calls from a larger base call dataset. For instance, a camera in many SBS platforms captures images of irradiated fluorescent tags incorporated into oligonucleotides for determining the nucleobase calls. After capturing such images, existing SBS platforms send base call data (or image-based data) to a computing device to apply sequencing data analysis software that determines a nucleobase sequence for a genomic sample or other nucleic acid polymer. Based on differences between the aligned nucleotide reads and the reference genome, existing data analysis software can further utilize a variant caller to identify germline variants within the germline cells of a genomic sample, such as single nucleotide polymorphisms (SNPs), insertions or deletions (indels), and/or other structural variants, and genotype calls.
[0003] Despite these recent advances in sequencing and germline variant calling, existing nucleobase sequencing platforms and sequencing data analysis software (together and hereinafter, existing sequencing systems) often (a) limit variant calling to germline variants only and/or (b) cannot accurately detect both somatic mosaic variant calls and germline variant calls. For example, some existing systems utilize extensive statistical data analysis, such as a Bayesian probabilistic modeling, to implement computational tools (e.g., Java-based tools) for identifying somatic mosaic and germline variant calls within existing sequence data. But such Bayesian-based systems require significant computation time, processing, and resources and can often result in multiple false
positives in identifying somatic mosaic variants. Such limits and shortcomings also apply to state- of-the-art machine-leaming-based sequencing systems. In both machine-leaming-based and statistical or probabilistic models, existing sequencing systems exhibit the technical limits of (a) and (b) in part due to the nature of somatic mosaic variants. Germline variants of a genomic sample are inherited by the time of the sample’s zygote from parents and are present in the sample’s germ cells. By contrast, somatic mosaic variants typically constitute mutations that (i) were introduced after zygote formation during cell development (e.g., 1 of 4 early cells), but (ii) were not inherited from the given sample’s parents, and (iii) were not introduced by a form of cancer or tumor in the given sample. Consequently, a relatively small proportion of a given sample’s cells include such somatic mosaic variants. Depending on when in development or which cell type a somatic mosaic variant has been introduced, the variant allele fraction of a somatic mosaic variant in a given sample’s cells can range from 10-50% to much smaller percentages, such as 0.1%.
[0004] In addition to the relatively low variant allele fraction of somatic mosaic variants, existing sequencing systems often lack computational models (or other mechanisms) for filtering noise during DNA sequencing. Consequently, as indicated above, existing sequencing systems cannot accurately determine both somatic mosaic variant calls and germline variant calls for a given sample. For instance, existing sequencing systems often determine false-positive somatic mosaic variant calls based on various noise sources common in DNA sequencing, such as sequence specific errors (SSEs) induced by one or more of inverted repeats, homopolymers, nucleotide context; uneven read depth or coverage across genomic regions of a reference genome, where certain genomic regions comprising somatic mosaic variants may lack read coverage (e.g., below 10X or 20X); sequencing platform-specific errors induced by, for example, barcode swapping or allele capture bias against somatic mosaic variants; DNA sample contamination; misclassification of germline variants versus somatic mosaic variants induced by differences among corresponding variant allele fractions; nucleotide read mapping errors that obscure or hide somatic mosaic variants because nucleotide reads reflecting such somatic mosaic variants may be incorrectly mapped to the wrong genomic region; polymerase chain reaction (PCR) errors during the process of growing clusters of oligonucleotides; and DNA damage caused by reagents, heat, or other environmental sources.
[0005] Complicating such technical hurdles further, the low allele fractions and sui generis nature of somatic mosaic variants make ground-truth training data for machine-learning-models difficult (if not practically impossible) to find from naturally occurring genomic samples. Without such training data from naturally occurring sources, existing sequencing systems have yet to successfully train a machine-learning model to generate outputs with sufficient sensitivity, recall,
precision, or other measures of accuracy that facilitate somatic variant calling for clinical use — at least not in combination with germline variant calling.
[0006] In part due to the difficulty of obtaining training data from naturally occurring sources, many existing sequencing systems leverage only limited sets of data in determining germline variant calls, including some machine-leaming-based variant callers. For instance, existing sequencing systems frequently rely exclusively on information extracted directly from nucleotide reads of a sample, such as read depth, mismatch counts, sequence alignment scores, and mapping quality, to determine germline variant calls. While sequence information from nucleotide reads can provide valuable insight for determining germline variants from a given sample, existing sequencing systems that solely rely on these data can underperform in accurately determining germline variant calls and, relatedly, lack computational tools to also determine somatic mosaic variants. Indeed, some existing sequencing systems that rely on raw sequence data incorrectly determine SNPs, indels, or other variants in a genomic sample sequence in comparison to more complex models.
[0007] Because of relatively low variant allele fractions, noise-induced false positives for somatic mosaic variants, the dearth of ground-truth training data from naturally occurring variants, among other factors, existing sequencing systems often limit sequencing pipelines or sequencing data analysis software to a single variant type. For instance, one existing sequencing system is configured to determine germline variant calls only and another existing sequencing system is configured to determine somatic mosaic variant calls only. When clinicians, research laboratories, or other parties seek to identify both germline variants and somatic mosaic variants for a given sample, however, such existing sequencing systems limit options to using two separate single- variant-type pipelines or sequencing data analysis software to separately determine somatic mosaic variant calls and germline variant calls for the given sample. Such separate, single-variant-type sequencing unnecessarily consumes more memory, processing, and computational time for both (i) specialized sequencing platforms for primary analysis to determine nucleotide reads for a given sample and (ii) computing devices executing sequencing data analysis software for secondary analysis to determine variant calls for the given sample.
SUMMARY
[0008] This disclosure describes embodiments of methods, non-transitory computer-readable media, and systems that can utilize a machine-learning model to recalibrate genotype calls (e.g., variant calls) corresponding to germline variants and somatic mosaic variants. As described below, the disclosed systems can utilize one or more machine-learning models to jointly generate genotype probabilities that account for both germline variants and somatic mosaic variants within a genomic sample. For example, the disclosed systems can determine sequencing metrics for nucleotide reads
corresponding to genomic regions of a genomic sample, generate genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants, and generate variant calls corresponding to germline variants or somatic mosaic variants.
[0009] To facilitate such variant calling, in some embodiments, the disclosed systems trains or utilizes a variant-call-recalibration machine-learning model to generate predictions for genotype calls based on training data with known somatic mosaic variants at relatively low allele frequencies. As disclosed below, the systems can utilize various sources of such training data, including training data generated by synthetically modifying existing ground truth sequencing fdes or by implementing an admixture of germline truth sets to simulate somatic mosaicisms. After training, the disclosed variant-call-recalibration machine-learning model generates genotype probabilities for genotypes, where some of the genotypes for which probabilities are determined include either a germline variant or a somatic mosaic variant. Based on the generated probabilities, the disclosed systems can confirm or change various fields of sequencing information within a genotype-call data file or other output data file.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The detailed description refers to the drawings briefly described below.
[0011] FIG. 1 illustrates a block diagram of a sequencing system including a dual-variant-type call recalibration system in accordance with one or more embodiments.
[0012] FIG. 2 illustrates an overview of the dual -variant-type call recalibration system utilizing a variant-call-recalibration machine-learning model to generate genotype probabilities and, based on such probabilities, generating genotype calls in accordance with one or more embodiments.
[0013] FIGS. 3A-3C illustrate the dual -variant-type call recalibration system determining or identifying sequencing metrics in accordance with one or more embodiments.
[0014] FIG. 4A illustrates the dual-variant-type call recalibration system generating genotype probabilities utilizing a variant-call-recalibration machine-learning model in accordance with one or more embodiments.
[0015] FIG. 4B illustrates the dual-variant-type call recalibration system determining genotype calls corresponding to germline variants and somatic mosaic variants in accordance with one or more embodiments.
[0016] FIG. 5A-5B illustrate example processes for the dual-variant-type call recalibration system generating modified sample sequencing data with corresponding ground-truth somatic mosaic variants in accordance with one or more embodiments.
[0017] FIG. 6 illustrates an example process for the dual-variant-type call recalibration system training a variant-call-recalibration machine-learning model in accordance with one or more embodiments.
[0018] FIGS. 7A-7F illustrate graphs of experimental results of utilizing the dual -variant-type call recalibration system to identify somatic mosaic variants within modified genomic samples in accordance with one or more embodiments.
[0019] FIG. 8 illustrates further experimental results of utilizing the dual-variant-type call recalibration system relative to existing sequencing systems in identifying somatic mosaic variants within a genomic dataset in accordance with one or more embodiments.
[0020] FIG. 9 illustrates a flowchart of a series of acts for generating variant calls corresponding to germline variants and somatic mosaic variants in accordance with one or more embodiments.
[0021] FIG. 10 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure.
DETAILED DESCRIPTION
[0022] This disclosure describes embodiments of a dual-variant-type call recalibration system that uses machine learning to recalibrate or confirm genotype calls (e.g., variant calls) corresponding to germline variants and somatic mosaic variants in a genomic sample. In particular, the disclosed dual-variant-type call recalibration system utilizes a variant-call-recalibration machine-learning model trained to jointly generate genotype probabilities accounting for germline variants and somatic mosaic variants within a genomic sample.
[0023] To facilitate such dual germline and somatic mosaic variant calling, for example, the disclosed dual -variant-type call recalibration system can train a variant-call-recalibration machinelearning model utilizing ground-truth training data generated to includes somatic mosaic variants at various allele frequencies, such as synthetically modified existing ground truth sequences or an admixture of germline truth sets simulating mosaic sequences. Accordingly, in some embodiments, the disclosed dual-variant-type call recalibration system trains the variant-call-recalibration machine-learning model to generate more accurate genotype probabilities using ground-truth training data for both germline variants and somatic mosaic variants.
[0024] After training, the disclosed dual-variant-type call recalibration system utilizes a trained variant-call-recalibration machine-learning model to generate genotype probabilities for genotypes at genomic regions of a genomic sample, where at least some of the genotypes for which genotype probabilities are determined include either a germline variant or a somatic mosaic variant. To implement a trained version of the variant-call-recalibration machine-learning model, for instance, the dual-variant-type call recalibration system determines base-call-quality metrics, mapping
quality metrics, and/or other sequencing metrics for nucleotide reads corresponding to genomic regions of a genomic sample. By feeding such sequencing metrics or re-engineered versions of such sequencing metrics to the variant-call-recalibration machine-learning model, the dual-variant- type call recalibration system executes the machine-learning model to generate genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants. As illustrated below, such genotype probabilities can represent probabilities of homozygous reference calls, homozygous variant calls, or heterozygous calls for a given reference nucleobase and corresponding alternate base or bases (e.g., alternate base 1 and alternate base 2). Based on the genotype probabilities, in some cases, the dual-variant-type call recalibration system determines at least one variant call corresponding to a germline variant in the genomic sample and at least one variant call corresponding to a somatic mosaic variant in the genomic sample. While a single, integrated variant-call-recalibration machine-learning model can be used as described by this disclosure, in some embodiments, the dual-variant-type call recalibration system utilizes a variant-call-recalibration machine-learning model comprising two or more machine-learning sub-models to generate genotype probabilities for variants within genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants. For example, the dual-variant-type call recalibration system can utilize a first machinelearning sub-model trained for germline variant identification and a second machine-learning submodel trained for integrated germline and somatic mosaic variant identification. By utilizing multiple machine-learning models to generate genotype probabilities for a genomic sample, for example, the dual-variant-type call recalibration system can provide improved accuracy in identifying variants within genomic regions corresponding to germline and somatic mosaic variants.
[0025] To facilitate user-friendly implementation of the disclosed model, in some embodiments, the dual-variant-type call recalibration system includes a user-selectable option for implementing detection of candidate somatic mosaic variants in addition to germline variants. For example, upon receiving an indication of user selection of a provided variant-sensitivity option, the dual-variant-type call recalibration system can execute the aforementioned variant-call- recalibration machine-learning model instead of a germline-variant-call-recalibration machinelearning model configured to generate a different type of genotype probabilities for candidate germline variants.
[0026] As suggested above, the dual-variant-type call recalibration system provides several technical advantages, benefits, and/or improvements over existing sequencing systems, including variant callers and other sequencing data analysis software. In some embodiments, for instance, the dual-variant-type call recalibration system increases the flexibility and variant-type breadth with
which a sequencing system can determine, modify, or update genotype calls corresponding to germline variants and somatic mosaic variants. As indicated above, many existing machine- leaming-based variant callers, for instance, are limited to determining variant calls exclusively for germline variants. Such machine-leaming-based variant callers accordingly perform better when facilitating genotype calls exhibiting an allele frequency of approximately 0.5 or 1.0 in a genomic sample. Such machine-leaming-based variant callers often miss somatic mosaic variants due to, for example, lack of read coverage, GC bias, sequencing specific errors (SSEs), mapping inaccuracies, and other factors described above. In contrast, in one or more embodiments, the dual-variant-type call recalibration system utilizes a variant-call-recalibration machine-learning model trained to generate genotype probabilities that help distinguish somatic mosaic variants from various sources of noise that currently prevent existing sequencing systems from identifying variants exhibiting relatively low allele frequencies. Indeed, in one or more implementations, the dual-variant-type call recalibration system successfully identifies variants corresponding to somatic mosaic variants within genomic samples, where existing sequencing systems can only accurately identify variants corresponding to germline variants.
[0027] In addition to improved flexibility in determining genotype calls, the dual-variant-type call recalibration system provides improved accuracy over existing sequencing systems, particularly in identifying variants corresponding to somatic mosaic variants within a genomic sample. For example, by training a variant-call-recalibration machine-learning model utilizing ground truth sequencing data that includes ground truth somatic mosaic variants (e.g., synthetic or naturally occurring-derived ground truth) at various target allele frequencies, the dual-variant-type call recalibration system can identify variants within genomic samples that correspond to somatic mosaic variants where existing sequencing systems are generally limited to variants corresponding to germline variants. This disclosure further illustrates such improved accuracy below with respect to at least FIGS. 7A-7F.
[0028] Beyond improved flexibility and accuracy, the dual-variant-type call recalibration system improves the computing efficiency with which a sequencing system identifies somatic mosaic variants within genomic sequences. As indicated above, some existing Bayesian-based sequencing systems are configured to identify somatic mosaic variants within existing sequencing data by extensive statistical data analysis. While some such systems can also determine germline variants, they require significant computation time, processing, memory, and other computational resources and often result in multiple false-positive somatic mosaic variants. Further, existing sequencing systems that employ two separate single-variant-type pipelines or sequencing data analysis software to separately determine somatic mosaic variant calls and germline variant calls for a given sample bum unnecessary memory, processing, and computational time.
[0029] To further illustrate, some existing sequencing systems utilize computationally expensive, slow neural network architectures (e.g., deep learning architectures such as convolutional neural networks) that require many hours (e.g., tens to hundreds of hours) across multiple-core processors to implement for processing read data to generate variant calls for a sample. Such deep learning architectures can further require several days (or weeks) to train. Conversely, in some embodiments, the dual-variant-type call recalibration system utilizes a comparatively lightweight, fast architecture for generating variant calls as described herein. In contrast to the many hours across multiple-core processors required by existing deep-leaming- based sequencing systems, the dual-variant-type call recalibration system requires under an hour (for both germline and mosaic variant calling) of runtime (e.g., on a single field programmable gate array and/or a multicore processor) to generate variant calls for a genomic sample (see, e.g., FIG. 8 and the corresponding text). Thus, the dual-variant-type call recalibration system is significantly faster and less computationally expensive than many deep learning approaches to somatic mosaic variant calling. Indeed, not only are the models of the dual-variant-type call recalibration system faster and less computationally expensive to implement, but the disclosed variant-call-recalibration machine-learning models are also much faster and less computationally expensive to train than many existing deep learning systems.
[0030] By utilizing a variant-call-recalibration machine-learning model that facilitates accurately identifying a genomic sample’s variants corresponding to both germline variants and somatic mosaic variants, the dual-variant-type call recalibration system provides improved efficiency compared to existing variant caller systems. As described below, the variant-call- recalibration machine-learning model is trained to generate genotype probabilities that account for both germline variants and somatic mosaic variants. Based on such dual-purpose genotype probabilities, the disclosed dual-variant-type call recalibration system can generate, for genomic regions of a genomic sample, accurate variant calls corresponding to germline variants and variant calls corresponding to somatic mosaic variants.
[0031] As suggested by the foregoing discussion, this disclosure utilizes a variety of terms to describe features and benefits of the dual-variant-type call recalibration system. Additional detail is hereafter provided regarding the meaning of these terms as used in this disclosure. As used in this disclosure, for instance, the term “sample nucleotide sequence” or “sample sequence” refers to a sequence of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence). In particular, a sample nucleotide sequence includes a segment of a nucleic acid polymer that is isolated or extracted from a genomic sample and composed of nitrogenous heterocyclic bases. For example, a sample nucleotide sequence can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids
or chimeric or hybrid forms of nucleic acids noted below. More specifically, in some cases, the sample nucleotide sequence is found in a sample prepared or isolated by a kit and received by a sequencing device.
[0032] Relatedly, as used herein, the term “genomic sample” refers to a target genome or portion of a genome undergoing an assay or sequencing. For example, a genomic sample includes one or more sequences of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence). In particular, a genomic sample includes a full genome that is isolated or extracted (in whole or in part) from a sample organism and composed of nitrogenous heterocyclic bases. A genomic sample can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or chimeric or hybrid forms of nucleic acids noted below. In some cases, the genomic sample is found in a sample prepared or isolated by a kit and received by a sequencing device.
[0033] As further used herein, the term “genotype call” refers to a determination or prediction of a particular genotype of a genomic sample or a sample nucleotide sequence at a genomic locus. In particular, a genotype call can include a prediction of a particular genotype of a genomic sample with respect to a reference genome or a reference sequence at a genomic coordinate or a genomic region. For instance, in some cases, a genotype call includes a determination or prediction that a genomic sample comprises both a nucleobase and a complementary nucleobase at a genomic coordinate that is either homozygous or heterozygous for a reference base or a variant (e.g., homozygous reference bases represented as 0| 0 or heterozygous for a variant on a particular strand represented as 0| 1). Accordingly, a genotype call can include a prediction of a variant or reference base for one or more alleles of a genomic sample and indicate zygosity with respect to a variant or reference base. A genotype call is often determined for a genomic coordinate or genomic region at which an SNP, insertion, deletion, or other variant has been identified for a population of organisms.
[0034] As further used herein, the term “nucleobase call” (or simply “base call”) refers to a determination or prediction of a particular nucleobase (or nucleobase pair) for an oligonucleotide (e.g., nucleotide read) during a sequencing cycle or for a genomic coordinate of a sample genome. In particular, a nucleobase call can indicate (i) a determination or prediction of the type of nucleobase that has been incorporated within an oligonucleotide on a nucleotide-sample slide (e.g., read-based nucleobase calls) or (ii) a determination or prediction of the type of nucleobase that is present at a genomic coordinate or region within a genome, including a variant call or a non-variant call in a digital output file. In some cases, for a nucleotide read, a nucleobase call includes a determination or a prediction of a nucleobase based on intensity values resulting from fluorescent- tagged nucleotides added to an oligonucleotide of a nucleotide-sample slide (e.g., in a cluster of a
flow cell). Alternatively, a nucleobase call includes a determination or a prediction of a nucleobase from chromatogram peaks or electrical current changes resulting from nucleotides passing through a nanopore of a nucleotide-sample slide. By contrast, a nucleobase call can also include a final prediction of a nucleobase at a genomic coordinate of a sample genome for a variant call file (VCF) or another base-call-output file — based on nucleotide reads corresponding to the genomic coordinate. Accordingly, a nucleobase call can include a base call corresponding to a genomic coordinate and a reference genome, such as an indication of a variant or a non-variant at a particular location corresponding to the reference genome. Indeed, a nucleobase call can refer to a variant call, including but not limited to, a single nucleotide variant (SNV), an insertion or a deletion (indel), or base call that is part of a structural variant. As suggested above, a single nucleobase call can be an adenine (A) call, a cytosine (C) call, a guanine (G) call, a thymine (T) call, or an uracil (U) call.
[0035] Relatedly, as used herein, the term “nucleotide read” refers to an inferred sequence of one or more nucleobases (or nucleobase pairs) from all or part of a sample nucleotide sequence (e.g., a sample genomic sequence, complementary DNA). In particular, a nucleotide read includes a determined or predicted sequence of nucleobase calls for a nucleotide fragment (or group of monoclonal nucleotide fragments) from a sequencing library corresponding to a genomic sample. For example, in some embodiments, the dual-variant-type call recalibration system determines a nucleotide read by generating nucleobase calls for nucleobases passed through a nanopore of a nucleotide-sample slide, determined via fluorescent tagging, or determined from a well in a flow cell. In some cases, a nucleotide read can refer to a particular type of read, such as a nucleotide read synthesized from sample library fragments that are shorter than a threshold number of nucleobases (e.g., SBS reads). In these or other cases, another type of nucleotide read can refer to (i) assembled nucleotide reads that have been assembled from shorter nucleotide reads to form a contiguous sequence (e.g., assembled nucleotide reads) satisfying a threshold number of nucleobases, (ii) circular consensus sequencing (CCS) reads satisfying the threshold number of nucleobases, or (iii) nanopore long reads satisfying the threshold number of nucleobases.
[0036] As noted above, in some embodiments, the dual-variant-type call recalibration system determines sequencing metrics for nucleobase calls of nucleotide reads. As used herein, the term “sequencing metric” refers to a quantitative measurement or score indicating a degree to which an individual nucleobase call (or a sequence of nucleobase calls) aligns, compares, or quantifies with respect to a genomic coordinate or genomic region of a reference genome, with respect to nucleobase calls from nucleotide reads, or with respect to external genomic sequencing or genomic structure. For instance, a sequencing metric includes a quantitative measurement or score indicating a degree to which (i) individual nucleobase calls align, map, or cover a genomic coordinate or
reference base of a reference genome; (ii) nucleobase calls compare to reference or alternative nucleotide reads in terms of mapping, mismatch, base call quality, or other raw sequencing metrics; or (iii) genomic coordinates or regions corresponding to nucleobase calls demonstrate mappability, repetitive base call content, DNA structure, or other generalized metrics.
[0037] Along these lines, the dual-variant-type call recalibration system determines various types of sequencing metrics from different sources, such as read-based sequencing metrics, externally sourced sequencing metrics, and call-model-generated sequencing metrics. As used herein, the term “read-based sequencing metrics” refers to sequencing metrics derived from nucleotide reads of a sample nucleotide sequence. For example, read-based sequencing metrics include sequencing metrics determined by applying statistical tests to detect differences between a reference sequence and nucleotide reads. In some embodiments, read-based sequencing metrics can include a comparative-mapping-quality-distribution metric that indicates a comparison between mapping qualities or a comparative-mismatch-count metric that indicates a comparison between mismatch counts. In some cases, read-based sequencing metrics can correspond to genotype calls generated from different read types, such as assembled nucleotide reads and/or SBS reads.
[0038] By contrast, “externally sourced sequencing metrics” refer to sequencing metrics identified or obtained from one or more external databases. For example, externally sourced sequencing metrics include metrics relating to mappability of nucleotides, replication timing, or DNA structure that are available outside of the dual-variant-type call recalibration system.
[0039] Further, the term “call-model-generated sequencing metrics” refers to internal, modelspecific sequencing metrics generated or extracted by a call generation model. For example, call- model-generated sequencing metrics include variant calling sequencing metrics extracted or determined via variant caller components of a call generation model and mapping-and-alignment sequencing metrics extracted or determined via mapping-and-alignment components of a call generation model. As indicated above, call-model-generated sequencing metrics can include alignment metrics that quantify a degree to which nucleotide reads align with genomic coordinates of a reference genome or other example nucleic acid sequence, such as deletion-size metrics or mapping-quality metrics. Further, call-model-generated sequencing metrics can include depth metrics that quantify the depth of nucleobase calls for nucleotide reads at genomic coordinates of a reference genome or other example nucleic acid sequence, such as forward-reverse-depth metrics or normalized-depth metrics. Call-model-generated sequencing metrics can also include callquality metrics that quantify a quality or accuracy of nucleobase calls, such as nucleobase-call- quality metrics, callability metrics, or somatic-quality metrics.
[0040] As used herein, the term “base-call-quality metric” refers to a specific score or other measurement indicating an accuracy of a nucleobase call. In particular, a base-call-quality metric comprises a value indicating a likelihood that one or more predicted nucleobase calls for a genomic coordinate contain errors. For example, in certain implementations, a base-call-quality metric can comprise a Q score (e.g., a PHil’s Read EDitor (PHRED) quality score) predicting the error probability of any given nucleobase call. To illustrate, a quality score (or Q score) may indicate that a probability of an incorrect nucleobase call at a genomic coordinate is equal to 1 in 100 for a Q20 score, 1 in 1,000 for a Q30 score, 1 in 10,000 for a Q40 score, etc.
[0041] As further used herein, the term “genomic coordinate” (or sometimes simply “coordinate”) refers to a particular location or position of a nucleobase within a genome (e.g., an organism’s genome or a reference genome). In some cases, a genomic coordinate includes an identifier for a particular chromosome of a genome and an identifier for a position of a nucleobase within the particular chromosome. For instance, a genomic coordinate or coordinates may include a number, name, or other identifier for a chromosome (e.g., chrl or chrX) and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl : 1234570 or chrl : 1234570-1234870). In some cases, a genomic coordinate refers to a genomic coordinate on a sex chromosome (e.g., chrX or chrY). Consequently, the dual-variant-type call recalibration system can determine genotype probabilities for a genotype call (e.g., a variant call) for a genomic coordinate on a sex chromosome. Further, in certain implementations, a genomic coordinate refers to a source of a reference genome (e.g., mt for a mitochondrial DNA reference genome or SARS-CoV-2 for a reference genome for the SARS-CoV-2 virus) and a position of a nucleobase within the source for the reference genome (e.g., mt: 16568 or SARS-CoV-2:29001). By contrast, in certain cases, a genomic coordinate refers to a position of a nucleobase within a reference genome without reference to a chromosome or source (e.g., 29727).
[0042] Relatedly, as used herein, the term “multiallelic genomic coordinate” refers to a genomic coordinate associated with three or more alleles. For example, a multiallelic genomic coordinate includes a genomic coordinate of a nucleotide sequence where nucleotide reads indicate three or more possible alleles corresponding to the coordinate, such as a reference allele, a first alternate allele, a second alternate allele, and so forth. In some cases, a multiallelic genomic coordinate corresponds to a genomic coordinate where a read pileup occurs or where an insertion occurs. For instance, a multiallelic genomic coordinate can exhibit a multiallelic genotype, such as a 1/2 genotype, where the first allele at the coordinate corresponds to an allele from a first alternate nucleotide sequence and the second allele corresponds to an allele from a second alternate nucleotide sequence.
[0043] As indicated above, genomic coordinates within a nucleotide sequence can exhibit different genotypes. For example, a “homozygous reference genotype” refers to a genotype where both nucleobases at a given coordinate of a sample nucleotide sequence match a reference nucleobase of a reference sequence or a reference genome (represented as 0/0). As another example, a “homozygous alternate genotype” refers to a genotype at a given coordinate where both nucleobases differ from a reference nucleobase of a reference sequence or a reference genome (represented as 1/1). As a further example, a “heterozygous genotype” refers to a genotype where the nucleobases at a given coordinate are not the same. In some cases, a heterozygous genotype includes a genotype in which one nucleobase matches a reference nucleobase and the other nucleobase differs from the reference nucleobase (represented as 0/1 or 1/0). For multiallelic genomic coordinates, genotypes can exhibit nucleobases from more than one alternate nucleobase differing from a reference nucleobase of a reference genome. For instance, a multiallelic heterozygous genotype can be represented as 1/2, where one nucleobase call matches a first alternate nucleobase differing from a reference nucleobase and the other nucleobase call matches a second alternate nucleobase differing from the reference nucleobase.
[0044] As noted above, a genomic coordinate includes a position within a reference genome. Such a position may be within a particular reference genome. As used herein, the term “reference genome” refers to a digital nucleic acid sequence assembled as a representative example (or representative examples) of genes and other genetic sequences of an organism. Regardless of the sequence length, in some cases, a reference genome represents an example set of genes or a set of nucleic acid sequences in a digital nucleic acid sequenced determined by scientists as representative of an organism of a particular species. For example, a linear human reference genome may be GRCh38 or other versions of reference genomes from the Genome Reference Consortium. As a further example, a reference genome may include a reference graph genome that includes both a linear reference genome and paths representing nucleic acid sequences from ancestral haplotypes, such as Illumina DRAGEN Graph Reference Genome hg!9.
[0045] As used herein, a “genomic region” refers to a range of genomic coordinates. Like genomic coordinates, in certain implementations, a genomic region may be identified by an identifier for a chromosome and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570-1234870). In various implementations, a genomic coordinate includes a position within a reference genome. In some cases, a genomic coordinate is specific to a particular reference genome.
[0046] As suggested above, the dual-variant-type call recalibration system can utilize a machine-learning model to modify sequencing metrics and update a genotype call. As used herein, the term “machine-learning model” refers to a computer algorithm or a collection of computer
algorithms that automatically improve for a particular task through experience based on use of data. For example, a machine-learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness. Example machine-learning models include various types of decision trees, support vector machines, Bayesian networks, or neural networks. In some cases, the call- recalibration machine-learning model is a series of gradient boosted decision trees (e.g., XGBoost algorithm), while in other cases the call-recalibration machine-learning model is a random forest model, a multilayer perceptron, a linear regression, a support vector machine, a deep tabular learning architecture, a deep learning transformer (e.g., self-attention-based-tabular transformer), or a logistic regression.
[0047] In some cases, the dual-variant-type call recalibration system utilizes a call-recalibration machine-learning model to generate outputs for confirming, modifying, or updating a genotype call based on sequencing metrics. As used herein, the term “variant-call-recalibration machine-learning model” refers to a machine-learning model that generates variant-call classifications (e.g., genotype probabilities). For example, in some cases, the variant-call-recalibration machine-learning model is trained to generate variant-call classifications indicating various probabilities or predictions for genotype calls (e.g., variant calls) based on the aforementioned sequencing metrics. Accordingly, in some cases, a variant-call-recalibration machine-learning model is a variant-call-recalibration machine-learning model. In some cases, the call-recalibration machine-learning model is a series of gradient boosted decision trees (e.g., XGBoost algorithm or treelite algorithm for an ensemble of decision trees), while in other cases the variant-call-recalibration machine-learning model is a random forest model, a multilayer perceptron, a linear regression, a support vector machine, a deep tabular learning architecture, a deep learning transformer (e.g., self-attention-based-tabular transformer), or a logistic regression. In certain embodiments, a variant-call-recalibration machinelearning model includes multiple sub-models or operates in tandem with another call-recalibration machine-learning model. For instance, a first call-recalibration machine-learning model (e.g., an ensemble of gradient boosted trees) generates a first set of variant-call classifications and a second call-recalibration machine-learning model (e.g., a random forest) generates a second set of variantcall classifications. In another example, a variant-call-recalibration machine-learning model includes a first machine-learning sub-model configured to generate genotype probabilities for genotype calls corresponding to germline variants only and a second machine-learning sub-model configured to generate genotype probabilities for genotype calls corresponding to somatic mosaic variants (in some cases, in addition to genotype probabilities for genotype calls corresponding to germline variants).
[0048] Relatedly, the term “variant-call classification” refers to a predicted classification from a variant-call-recalibration machine-learning model that indicates a probability, score, or other
quantitative measurement associated with some aspect of a genotype or variant call based on one or more sequencing metrics. A variant-call classification can include a specialized prediction depending on the application of a call-recalibration machine-learning model. In some cases, variant-call classifications for a biallelic genomic coordinate includes (i) a false-positive probability that a genotype call is a false positive, (ii) a genotype-error probability that a genotype for the genotype call is incorrect, and (iii) a true-positive probability that the genotype call is a true positive. As a further example, in embodiments for generating genotype calls for multiallelic genomic coordinates, variant-call classifications can include: (i) a reference probability that a genotype call comprises a homozygous reference genotype at a multiallelic genomic coordinate, (ii) a zygosity-error probability that the genotype call comprises a genotype-zygosity error at a multiallelic genomic coordinate, and (iii) a true-positive variant probability that the genotype call constitutes a true positive variant at a multiallelic genomic coordinate.
[0049] In embodiments for generating genotype calls for a haploid genomic coordinate, variantcall classifications can include: (i) a first genotype probability of a first genotype at the genomic coordinate and (ii) a second genotype probability of a second genotype at the genomic coordinate. As suggested above, the first genotype probability can be a probability that a genotype at a genomic coordinate is a haploid reference genotype, and the second genotype probability can be a probability that a genotype at the genomic coordinate is a haploid alternate genotype. In these or other embodiments, such as embodiments for generating genotype calls for genomic coordinates indicated to exhibit homozygous reference genotypes, variant-call classifications can include: (i) a false-positive probability or a homozygous reference classification indicating a probability that a genotype call is a false positive or a homozygous reference genotype, respectively; (ii) a zygosityerror probability or a heterozygous genotype classification indicating a probability that a genotype (e.g., an indication of a heterozygous or homozygous genotype for a variant call at a particular location) is incorrect or a heterozygous genotype, respectively; and/or (iii) a true-positive classification or a homozygous alternate classification indicating a probability that a genotype call is a true positive or a homozygous alternate genotype, respectively. In some cases, the variant-call classifications accordingly represent intermediate scoring metrics and/or a predicted probability that a genotype for a genotype call is accurate.
[0050] Accordingly, as further used herein, the term “genotype probability” refers to a likelihood, probability, or score of a particular genotype at a genomic coordinate or genomic region. For instance, a genotype probability includes a likelihood of a homozygous reference genotype, a likelihood of a heterozygous variant genotype, or a likelihood of a homozygous variant genotype at one or more genomic coordinates. In some cases, a genotype probability can refer to a posterior genotype probability. Accordingly, in some cases, a genotype probability determined by a variant-
call-recalibration machine-learning model can be presented in (or modified to be presented in) a posterior genotype probability (GP) field of a VCF or other sequencing data file, such as a recalibrated VCF or other recalibrated sequencing data file. A genotype probability can include a specialized prediction depending on the application of a call-recalibration machine-learning model, such as for predicting SNPs.
[0051] As mentioned, in some embodiments, the variant-call-recalibration machine-learning model can be a neural network. The term the term “neural network” refers to a machine-learning model that can be trained and/or tuned based on inputs to determine classifications or approximate unknown functions. For example, a neural network includes a model of interconnected artificial neurons (e.g., organized in layers) that communicate and leam to approximate complex functions and generate outputs (e.g., generated digital images) based on a plurality of inputs provided to the neural network. In some cases, a neural network refers to an algorithm (or set of algorithms) that implements deep learning techniques to model high-level abstractions in data. For example, a neural network can include a convolutional neural network, a recurrent neural network (e.g., an LSTM), a graph neural network, a self-attention transformer neural network, or a generative adversarial neural network.
[0052] As noted above, the dual-variant-type call recalibration system can generate variant-call classifications that indicate or reflect a likelihood of identifying a variant corresponding to a germline variant or a somatic mosaic variant (i.e., two variant types) at a genomic coordinate. As used herein, the term “variant” refers to a nucleobase or multiple nucleobases that do not align with, differs from, or varies from a corresponding nucleobase (or nucleobases) in a reference sequence or a reference genome. For example, a variant includes a SNP, an indel, or a structural variant that indicates nucleobases in a sample nucleotide sequence that differ from nucleobases in corresponding genomic coordinates of a reference sequence. Relatedly, the term “germline variant” refers to a variant or mutation inherited by a sample organism from biological parents or present within germ cells. In particular, a germline variant is a heritable variant that tends to be present in every somatic and germline cell of offspring. By contrast, the term “somatic mosaic variant” refers to a variant or mutation introduced or derived from a post-zygotic event. In particular, a somatic mosaic variant includes a variant or mutation that was (i) introduced after zygote formation during cell development (e.g., 1 of 4 early cells, 1 of 32 early cells), but is (ii) not inherited from a sample organism’s biological parents and (iii) not introduced by a form of cancer or tumor in the given sample organism.
[0053] Along these lines, a “variant call” (or “variant nucleobase call”) refers to a nucleobase call comprising a mutation or a variant at a particular genomic coordinate or genomic region with respect to a reference. In particular, a variant call includes a determination or prediction that a
genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that differs from a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome. Conversely, a “non-variant call” (or “non-variant nucleobase call” or “reference call”) refers to a nucleobase call comprising a non-variant or a reference nucleobase at a genomic coordinate or a genomic region with respect to a reference. In particular, a non-variant or reference call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that matches a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
[0054] Relatedly, as used herein, the term “allele frequency,” at least when used in reference to a variant or variant call, refers to a frequency at which a variant occurs within a particular genomic sample or a population. In some cases, a variant allele frequency (sometimes referred to as a variant allele fraction (VAF)) is the percentage (or fraction) of sequence reads observed at a genomic coordinate or region that match a particular variant. Accordingly, a variant with an allele frequency of approximately 50% (or 0.5) or 100% (or 1.0) is more likely to be a germline variant, whereas a relatively low allele frequency — in particular, an allele frequency below 50% (or 0.5) — is more likely to be a somatic mosaic variant. As mentioned above, variants of relatively low allele frequency within a sample (e.g., somatic mosaic variants) can be relatively difficult to distinguish within a genomic sample due to, for example, lack of read coverage, GC bias, sequencing specific errors (SSEs), mapping inaccuracies, and so forth.
[0055] In one or more embodiments, the dual-variant-type call recalibration system identifies and/or stores sequencing metrics within one or more sequencing data files. As used herein, the term “sequencing data file” refers to a digital file that includes genetic sequencing information concerning genotype calls or nucleotide reads generated by one or more genomic sequencing procedures. Such sequencing information may include, for example, nucleotide reads, alignment and mapping information, nucleotide reads at one or more genomic coordinates, and so forth.
[0056] In some embodiments, the dual-variant-type call recalibration system modifies data fields corresponding to a genotype-call data file, such as a variant call file. As used herein, the term “genotype-call data file” refers to a digital file that indicates or represents one or more genotype calls (e.g., including reference and/or variant calls) compared to a reference genome along with other information pertaining to the genotype calls (e.g., variant calls). For example, a genotype-call data file can include a variant call file, such as but not limited to a variant call format (VCF) file (as well as a genomic variant call format (gVCF) file). Alternatively, as a further example, genotype-call data file can include a General Feature Format (GFF), a Genome Variant Format (GVF), or other suitable data file comprising genotype calls for a sample nucleotide sequence.
[0057] As further used herein, a “variant call file” refers to a particular genotype-call data file that comprises a text file format that contains information about variants at specific genomic coordinates. For instance, a variant call file can include meta-information lines, a header line, and data lines where each data line contains information about a single genotype call (e.g., a single variant). As described further below, the dual-variant-type call recalibration system can generate different versions of genotype-call data files, including a pre-filter variant call file comprising variant genotype calls that either pass or fail a quality filter for base-call-quality metrics or a postfilter variant call file comprising variant genotype calls that pass the quality filter but excludes variant genotype calls that fail the quality filter.
[0058] As also mentioned, in one or more embodiments, one or more sequencing data files in which the dual-variant-type call recalibration system identifies or stores sequencing metrics include an alignment data file containing information from a read processing and mapping procedure. As used herein, the term “alignment data file” refers to a digital file that indicates mapping and alignment information for nucleotide reads of a sample nucleotide sequence. For example, an alignment data file can include a binary alignment map (BAM) file, a compressed reference- oriented alignment map (CRAM) file, or another file indicating nucleotide reads of a sample nucleotide sequence.
[0059] In some embodiments, the dual-variant-type call recalibration system modifies data fields corresponding to metrics of a genotype call associated with a variant call file, such as fields for call quality, genotype, and genotype quality. As used herein, the term “call quality” when used with respect to a data field in a variant call file refers to a measure or an indication of a likelihood or a probability that a variant exists at a given location. Accordingly, a call quality field (or QUAL field) corresponding to a VCF file may include a base-call-quality metric, such as a PHRED-scaled quality or Q score, representing a probability that a genomic coordinate of a sample genome includes a variant. Similarly, a “genotype quality” when used with respect to a field refers to a likelihood or a probability that a particular predicted genotype for a nucleobase call is correct.
[0060] As noted, in some embodiments, the dual-variant-type call recalibration system utilizes a call generation model to determine initial genotype calls or initial variant calls. As used herein, the term “call generation model” refers to a probabilistic model that generates sequencing data from nucleotide reads of a sample nucleotide sequence, including nucleobase calls, variant calls, and/or genotype calls along with associated metrics. Accordingly, in some cases, a call generation model may be a variant call generation model. For example, in some cases, a call generation model refers to a Bayesian probability model that generates variant calls based on nucleotide reads of a sample nucleotide sequence. Such a model can process or analyze sequencing metrics corresponding to read pileups (e.g., multiple nucleotide reads corresponding to a single genomic coordinate),
including mapping quality, base quality, and various hypotheses including foreign reads, missing reads, joint detection, and more. A call generation model may likewise include multiple components, including, but not limited to, different software applications or components for mapping and aligning, sorting, duplicate marking, computing read pileup depths, and variant calling. In some cases, a call generation model refers to an ILLUMINA DRAGEN model for variant calling functions and mapping and alignment functions (e.g., a DRAGEN variant caller or “DRAGEN VC”).
[0061] The following paragraphs describe the dual-variant-type call recalibration system with respect to illustrative figures that portray example embodiments and implementations. For example, FIG. 1 illustrates a schematic diagram of a system environment (or “environment”) 100 in which a dual -variant-type call recalibration system 106 operates in accordance with one or more embodiments. As illustrated, the computing system 100 includes one or more server device(s) 102 connected to a client device 108 and a sequencing device 114 via a network 112. While FIG. 1 shows an embodiment of the dual -variant-type call recalibration system 106, this disclosure describes alternative embodiments and configurations below.
[0062] As shown in FIG. 1, the server device(s) 102, the client device 108, and the sequencing device 114 can communicate with each other via the network 112. The network 112 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 10.
[0063] As indicated by FIG. 1, the sequencing device 114 comprises a device for sequencing one or more nucleic acid polymers. In some embodiments, the sequencing device 114 analyzes nucleic acid segments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems (described herein) either directly or indirectly on the sequencing device 114. More particularly, the sequencing device 114 receives and analyzes, within nucleotide-sample slides (e.g., flow cells), nucleic acid sequences extracted from genomic samples. In one or more embodiments, the sequencing device 114 utilizes SBS to sequence nucleic acid polymers (e.g., clusters of oligonucleotides) into nucleotide reads. In addition or in the alternative to communicating across the network 112, in some embodiments, the sequencing device 114 bypasses the network 112 and communicates directly with the client device 108.
[0064] As further indicated by FIG. 1, the server device(s) 102 may generate, receive, analyze, store, and transmit digital data, such as data for determining nucleobase calls or sequencing nucleic acid polymers. As shown in FIG. 1, the sequencing device 114 may send (and the server device(s) 102 may receive) call data from the sequencing device 114. The server device(s) 102 may also communicate with the client device 108. In particular, the server device(s) 102 can send data to the
client device 108, including sequencing data files, such as genotype-call data files or alignment data files, or other information indicating nucleobase calls, sequencing metrics, error data, or other metrics associated with nucleobase calls or genotype calls.
[0065] In some embodiments, the server device(s) 102 comprise a distributed collection of servers where the server device(s) 102 include a number of server devices distributed across the network 112 and located in the same or different physical locations. Further, the server device(s) 102 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server. In some cases, the server device(s) 102 are located at a same physical location as the sequencing device 114.
[0066] As further shown in FIG. 1, the server device(s) 102 can include a sequencing system 104. Generally, the sequencing system 104 analyzes call data, such as sequencing metrics received from the sequencing device 114, to determine nucleobase sequences for nucleic acid polymers. For example, the sequencing system 104 can receive raw data from the sequencing device 114 and can determine a nucleobase sequence for a sample nucleotide sequence (e.g., a genomic sample). In some embodiments, the sequencing system 104 determines the sequences of nucleobases in DNA and/or RNA segments or oligonucleotides. In addition to processing and determining sequences for nucleic acid polymers, the sequencing system 104 also generates a genotype-call data file, such as a variant call file, indicating one or more genotype calls and/or variant calls for one or more genomic coordinates.
[0067] As mentioned, and as illustrated in FIG. 1, the dual -variant-type call recalibration system 106 analyzes call data, such as sequencing metrics from the sequencing device 114, to recalibrate genotype calls for sample nucleotide sequences that were previously generated (e.g., by a call generation model). The dual -variant-type call recalibration system 106 includes a variant- call-recalibration machine-learning model trained to identify variant calls at genomic regions corresponding to germline variants and somatic mosaic variants. In some embodiments, the dual- variant-type call recalibration system 106 determines sequencing metrics for sample nucleotide sequences based on information stored in existing sequencing data fdes, such as alignment data files.
[0068] Based on data derived or prepared from the sequencing metrics, the dual-variant-type call recalibration system 106 trains and applies a variant-call-recalibration machine-learning model to confirm or recalibrate genotype calls for the sample sequence at genomic coordinates corresponding to candidate germline variants and candidate somatic mosaic variants. The dual- variant-type call recalibration system 106 further utilizes the variant-call-recalibration machinelearning model to generate sets of variant-call classifications (e.g., genotype probabilities) to update or modify the genotype calls (e.g., variant calls). Based on such genotype probabilities or other
variant-call classifications, for example, the dual -variant-type call recalibration system 106 can update data fields corresponding to genotype-call data file, such as a variant call file, to update a genotype call (e.g., a variant call) for improved accuracy. In some embodiments, the dual-variant- type call recalibration system 106 outputs an updated variant call file (or other format of genotypecall data file) with the modified or updated genotype calls and/or variant calls.
[0069] As further illustrated and indicated in FIG. 1, the client device 108 can generate, store, receive, and send digital data. In particular, the client device 108 can receive sequencing metrics from the sequencing device 114. Furthermore, the client device 108 may communicate with the server device(s) 102 to receive sequencing data comprising genotype calls and/or other metrics, such as a call-quality, a genotype indication, and a genotype quality. The client device 108 can accordingly present or display information pertaining to the genotype call within a graphical user interface to a user associated with the client device 108.
[0070] The client device 108 illustrated in FIG. 1 may comprise various types of client devices. For example, in some embodiments, the client device 108 includes non-mobile devices, such as desktop computers or servers, or other types of client devices. In yet other embodiments, the client device 108 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device 108 are discussed below with respect to FIG. 10.
[0071] As further illustrated in FIG. 1, the client device 108 includes a sequencing application 110. The sequencing application 110 may be a web application or a native application stored and executed on the client device 108 (e.g., a mobile application, desktop application). The sequencing application 110 can include instructions that (when executed) cause the client device 108 to receive data from the dual -variant-type call recalibration system 106 and present, for display at the client device 108, data from a variant call file and/or an updated variant call file. Furthermore, the sequencing application 110 can instruct the client device 108 to display a visualization of sequencing metrics of a nucleobase call or genotype call.
[0072] As further illustrated in FIG. 1, the dual -variant-type call recalibration system 106 may be located on the client device 108 as part of the sequencing application 110 or on the sequencing device 114. Accordingly, in some embodiments, the dual -variant-type call recalibration system 106 is implemented by (e.g., located entirely or in part) on the client device 108. In yet other embodiments, the dual -variant-type call recalibration system 106 is implemented by one or more other components of the computing system 100, such as the sequencing device 114. In particular, the dual -variant-type call recalibration system 106 can be implemented in a variety of different ways across the server device(s) 102, the network 112, the client device 108, and the sequencing device 114. For example, the dual-variant-type call recalibration system 106 can be downloaded from the server device(s) 102 to the client device 108 and/or to the sequencing device 114 where
all or part of the functionality of the dual -variant-type call recalibration system 106 is performed at each respective device within the computing system 100.
[0073] As further illustrated in FIG. 1, the computing system 100 includes a database 116. The database 116 can store information, such as sequencing data files, sample nucleotide sequences, nucleotide reads, nucleobase calls, genotype calls (e.g., variant calls), and sequencing metrics. In some embodiments, the server device(s) 102, the client device 108, and/or the sequencing device 114 communicate with the database 116 (e.g., via the network 112) to store and/or access information, such as sequencing data files, sample nucleotide sequences, nucleotide reads, nucleobase calls, genotype calls (e.g., variant calls), and sequencing metrics. In some cases, the database 116 also stores one or more models, such as a variant-call-recalibration machine-learning model.
[0074] Though FIG. 1 illustrates the components of computing system 100 communicating via the network 112, in certain implementations, the components of computing system 100 can also communicate directly with each other, bypassing the network 112. For instance, and as previously mentioned, in some implementations, the client device 108 communicates directly with the sequencing device 114. Additionally, in some embodiments, the client device 108 communicates directly with the dual -variant-type call recalibration system 106. Moreover, the dual -variant-type call recalibration system 106 can access one or more databases housed on or accessed by the server device(s) 102 or elsewhere in the computing system 100.
[0075] As indicated above the dual -variant-type call recalibration system 106 can determine genotype calls for germline and somatic mosaic variants based on genotype probabilities generated by a variant-call-recalibration machine-learning model. In particular the dual-variant-type call recalibration system 106 can determine genotype probabilities for variants of various allele frequencies based on sequencing metrics associated with nucleotide reads for a genomic sample. For example, FIG. 2 illustrates an overview of the dual -variant-type call recalibration system 106 utilizing a variant-call-recalibration machine-learning model to determine genotype probabilities based on sequencing metrics and, based on such probabilities, generating genotype calls in accordance with one or more embodiments.
[0076] As illustrated in FIG. 2, the dual -variant-type call recalibration system 106 performs an act 202 to determine sequencing metrics. In particular, the dual-variant-type call recalibration system 106 determines sequencing metrics, such as read-based sequencing metrics, externally sourced sequencing metrics, and call model generated sequencing metrics. For example, the dual- variant-type call recalibration system 106 determines sequencing metrics that indicate various attributes or data in relation to various nucleobase calls of nucleotide reads from a sample
nucleotide sequence from a genomic sample. Additional detail regarding determining the various types of sequencing metrics is provided below in reference to FIGS. 3A-3C.
[0077] As mentioned above, the dual -variant-type call recalibration system 106 can determine sequencing metrics for nucleotide reads comprising (or including supporting evidence for) variants of various allele frequencies within a given genomic sample. In the illustrated example of FIG. 2 with nucleotide reads in relation to the act 202, for instance, a first variant relative to a reference genome (cytosine (C) in lieu of thymine (T)) has an allele frequency less than or equal to 0.35, whereas a second variant (thymine (T) in lieu of adenine (A)) has an allele frequency greater than or equal to 0.5. Indeed, the allele frequency of the second variant indicates that the second variant is more likely a germline variant, while the relatively low allele frequency of the first variant indicates that the first variant may be a somatic mosaic variant. While FIG. 2 illustrates a particular allele frequency of the first variant (i.e., less than or equal to 0.35), variants within a genomic sample may occur at any variant allele frequency. In some cases, such variant allele frequencies are accounted for in the determined sequencing metrics.
[0078] Based on the determined sequencing metrics, as further illustrated in FIG. 2, the dual- variant-type call recalibration system 106 performs an act 204 to generate genotype probabilities. More specifically, the dual -variant-type call recalibration system 106 generates (or updates or refines) variant call classifications, such as genotype probabilities, from sequencing metrics utilizing a variant-call-recalibration machine-learning model. To elaborate, the dual-variant-type call recalibration system 106 utilizes the variant-call-recalibration machine-learning model to process or analyze one or more sequencing metrics associated with one or more nucleotide reads to generate a set of classifications (e.g., predicted probabilities associated with genotypes). As shown in FIG. 2, for instance, the dual -variant-type call recalibration system 106 generates, utilizing the variant-call-recalibration machine-learning model, certain genotype probabilities associated with a candidate genotype indicated by the sequencing metrics, including genotype probabilities for variants within genomic regions corresponding to both candidate germline variants and candidate somatic mosaic variants.
[0079] Based on the generated genotype probabilities, as further illustrated in FIG. 2, the dual- variant-type call recalibration system 106 also performs an act 206 to determine genotype calls, such as a reference call or a variant call corresponding to a germline variant or a somatic mosaic variant. More particularly, the dual -variant-type call recalibration system 106 confirms, determines, or updates a preliminary genotype call by a call generation model (e.g., Bayesian probabilistic-based variant caller) for a sample nucleotide sequence at a genomic coordinate within a reference genome. To determine or generate the final genotype call, in some embodiments, the dual -variant-type call recalibration system 106 determines initial genotype calls utilizing a call
generation model and edits or updates certain initial genotype calls based on the genotype probabilities generated by the variant-call-recalibration machine-learning model.
[0080] In the illustrated example, for instance, the dual-variant-type call recalibration system 106 outputs genotype calls corresponding to the nucleotide reads described above in relation to the act 202. Accordingly, FIG. 2 shows, in relation to the act 206, a first genotype call at a first genomic coordinate (i.e., position or POS) indicating an alternate allele (cytosine) with an allele frequency of 0.3 and a second genotype call at a second genomic coordinate indicating an alternate allele (thymine) with an allele frequency of 0.5.
[0081] To elaborate, in some embodiments, the dual -variant-type call recalibration system 106 utilizes a call generation model to process or analyze sequencing metrics (e.g., one or more of the same sequencing metrics used to generate the genotype probabilities in act 204) to determine genotype calls (e.g., initial genotype calls) that a genomic sample comprises reference bases or variants at certain genomic coordinates based on one or more of the sequencing metrics. For example, in some embodiments, the dual -variant-type call recalibration system 106 applies a number of Bayesian probabilistic models or algorithms to derive various probabilities for different reference bases or variant bases, quality metrics, mapping metrics, joint metrics, and other data occurring within the sample nucleotide sequence to include within a variant call file. From the probabilistic models, the dual -variant-type call recalibration system 106 determines genotype calls (e.g., calls indicating differences or likenesses to reference bases from a reference genome) that indicates predicted nucleobases for the sample genome at a corresponding genomic coordinates.
[0082] Having generated an initial genotype call using a call generation model, the dual- variant-type call recalibration system 106 can confirm or update the initial genotype call — and/or corresponding sequencing metrics in various field — based on the genotype probabilities from the variant-call-recalibration machine-learning model. As described further below with respect to FIG. 4A, based on the output genotype probabilities, the dual -variant-type call recalibration system 106 can modify one or more of a base-call-quality metric, a genotype-probability metric, a genotype metric, a genotype-likelihood metric, or a genotype-quality metric for the initial genotype call — including calls corresponding to germline variants and somatic mosaic variants.
[0083] As mentioned above, in certain described embodiments, the dual-variant-type call recalibration system 106 determines or extracts sequencing metrics for nucleobase calls or genotype calls at particular genomic coordinates, such as genomic coordinates corresponding to candidate germline variants and/or candidate somatic mosaic variants. In particular, the dual- variant-type call recalibration system 106 determines or extracts sequencing metrics, such as readbased sequencing metrics, externally sourced sequencing metrics, and call-model-generated sequencing metrics from sequence data (e.g., one or more sequencing data files) for calls
corresponding to nucleotide reads of a sample nucleotide sequence. FIGS. 3A-3C illustrate determining sequencing metrics in accordance with one or more embodiments. Specifically, FIG. 3A illustrates determining read-based sequencing metrics, while FIG. 3B illustrates determining call-model-generated sequencing metrics, and FIG. 3C illustrates determining externally sourced sequencing metrics.
[0084] As illustrated in FIG. 3 A, the dual-variant-type call recalibration system 106 accesses, retrieves, or otherwise obtains nucleotide reads 302. In particular, in some embodiments, the nucleotide reads 302 are generated utilizing the sequencing device 114, the nucleotide reads 302 comprising nucleobase calls for regions from a sample nucleotide sequence (e.g., sample genome). For example, the nucleotide reads 302 can be generated utilizing sequencing-by-synthesis (SBS) techniques and/or Sanger sequencing techniques to determine nucleobase calls for oligonucleotide clusters from wells in a flow cell and/or via fluorescent tagging. More specifically, the nucleotide reads 302 are generated utilizing cluster generation and SBS chemistry to sequence millions or billions of clusters in a flow cell. During SBS chemistry, for each cluster, the call nucleobase calls from the nucleotide reads 302 are stored and, in some embodiments, provided directly to the dual- variant-type call recalibration system 106, for every cycle of sequencing via real-time analysis (RTA) software.
[0085] As further illustrated in FIG. 3A, in some embodiments, the dual-variant-type call recalibration system 106 performs read processing and mapping 304. For example, the read processing and mapping 304 can include utilizing real-time analysis (RTA) software to store base call data in the form of individual base call data files (or BCLs). In some cases, the read processing and mapping 304 further includes converting the BCL files into sequence data (e.g., via BCL to FASTQ conversion) to be analyzed by a call-generation model 310 to determine genotype calls for the nucleotide reads 302.
[0086] In particular, in certain embodiments, the read processing and mapping 304 includes aligning the nucleotide reads 302 with a reference genome or receiving information pertaining to the read alignment. Specifically, the read processing and mapping 304 determines which nucleobase(s) of a given read align with which genomic coordinate of a reference sequence (or receives information indicating alignment). Different reads have different lengths and include different nucleobases. Accordingly, in some cases, the read processing and mapping 304 includes analysis of each nucleotide of each read to determine (or receives information indicating) where the read “fits” in relation to a reference sequence — e.g., where the bases within the read align with bases in the reference. In some cases, the read processing and mapping 304 includes alignment of many reads at a single genomic coordinate, thus resulting in a read pileup.
[0087] In certain embodiments, the dual -variant-type call recalibration system 106 performs additional statistical tests to determine or detect differences between metrics associated with a reference nucleotide sequence (e.g., within a reference genome) and metrics associated with alternative supporting nucleotide reads. Through these statistical tests, the dual-variant-type call recalibration system 106 re-engineers raw sequencing metrics to determine read-based sequencing metrics. In some cases, the dual -variant-type call recalibration system 106 determines raw sequencing metrics that include one or more of (i) alignment metrics for quantifying alignment of sample nucleotide sequences with genomic coordinates of an example nucleotide sequence (e.g., a reference genome or a nucleotide sequence from an ancestral haplotype), (ii) depth metrics for quantifying depth of nucleobase calls for sample nucleotide sequences at genomic coordinates of the example nucleotide sequence, or (iii) call-quality metrics for quantifying quality of nucleobase calls for sample nucleotide sequences at genomic coordinates of the example nucleotide sequence. For instance, the dual -variant-type call recalibration system 106 determines mapping-quality metrics (e.g., the MAPQ metrics indicated in FIG. 3 A), soft-clipping metrics, or other alignment metrics that measure an alignment of sample sequences with a reference genome. As another example, the dual -variant-type call recalibration system 106 extracts forward-reverse-depth metrics (or other such depth metrics) or callability metrics for variant genotype calls (or other such callquality metrics).
[0088] As just mentioned, in some embodiments, the dual -variant-type call recalibration system 106 re-engineers the raw sequencing metrics to generate read-based sequencing metrics that are more informative for comparing metrics associated with a reference nucleotide sequence with metrics associated with various supporting alternative nucleotide reads. For example, the dual- variant-type call recalibration system 106 determines various metrics for a sample sequence in relation to a reference sequence and further determines various metrics for the sample sequence in relation to alternative supporting sequences. In addition, in some embodiments, the dual-variant- type call recalibration system 106 performs comparative analyses between metrics associated with the reference sequence and the metrics associated with the alternative supporting reads.
[0089] For instance, the dual -variant-type call recalibration system 106 compares how nucleobases of a sample nucleotide sequence (e.g., sample genome) map to a reference sequence with how the nucleobases map to various alternative supporting reads. In some cases, the dual- variant-type call recalibration system 106 determines mapping qualities associated with the reference sequence to compare with mapping qualities associated with alternative supporting reads. For example, the dual -variant-type call recalibration system 106 determines mapping quality statistics reflecting differences in the distribution of reads supporting a reference sequence versus reads supporting alternative alleles.
[0090] In these or other cases, the dual -variant-type call recalibration system 106 determines mismatch counts between the sample sequence and the reference sequence and between the reference sequence and alternative supporting reads. The dual-variant-type call recalibration system 106 further compares the mismatch counts to determine a comparative-mismatch-count metric. Further, the dual -variant-type call recalibration system 106 determines soft-clipping metrics for the sample sequence in relation to the reference sequence and further determines soft-clipping metrics in relation to alternative supporting reads. The dual-variant-type call recalibration system 106 also compares the soft clipping metrics between the reference sequence and the alternative supporting reads to generate a comparative-soft-clipping metric. Further still, the dual-variant-type call recalibration system 106 compares base-call-quality metrics in relation to the reference sequence and alternative supporting reads and/or compares query positions of the sample sequence in relation to the reference sequence with those in relation to alternative supporting reads.
[0091] As further illustrated in FIG. 3 A, the dual-variant-type call recalibration system 106 utilizes the comparisons and/or other statistical tests to generate the read-based sequencing metrics 306, including, for example: (i) a comparative-mapping-quality-distribution metric indicating a mapping quality distribution comparing mapping qualities in relation to the reference sequence and mapping qualities in relation to alternative supporting reads, (ii) a comparative-secondary- mapping-alignment metric indicating a comparison between secondary mapping in relation to bases in the reference sequence and bases in alternative supporting reads, (iii) a comparative-mismatchcount metric indicating a comparison between mismatched nucleobases in relation to the reference sequence and mismatched bases in relation to alternative supporting reads, (iv) a comparative-soft- clipping metric indicating a comparison between soft-clipping metrics in relation to the reference sequence and soft-clipping metrics in relation to alternative supporting reads, (v) one or more comparative-read-depth metrics indicating comparisons between read depths of the nucleotide reads 302 and one or more average read depths (e.g., local average read depths at a particular genomic coordinate and global average read depths across a number genomic coordinates in a region), (vi) one or more comparative-base-quality metric indicating comparisons between base qualities in relation to the reference sequence and base qualities in relation to alternative supporting reads (e.g., for overall base quality, early base quality, and late base quality in the nucleotide reads 302), (vii) a comparative-query -position metric indicating a comparison between query positions in relation to the reference sequence and query positions in relation to alternative supporting reads, (viii) one or more contextual-information metrics indicating homopolymers and periodicity of nucleobase calls, (ix) a strand-bias metric indicating a strand bias associated with one or more of the nucleotide reads 302, and (x) a read-direction-bias metric indicating a read direction bias
associated with the nucleotide reads 302. In some cases, the dual -variant-type call recalibration system 106 determines or re-engineers additional or alternative read-based sequencing metrics 306. [0092] In addition to the read-based sequencing metrics 306, as illustrated in FIG. 3B, the dual- variant-type call recalibration system 106 generates call-model-generated sequencing metrics 312 utilizing a call-generation model 310. In particular, the dual -variant-type call recalibration system 106 generates the call-model-generated sequencing metrics 312 from sequence data 308 utilizing the call-generation model 310. For example, the dual -variant-type call recalibration system 106 extracts or determines sequence data 308 based on the read processing and mapping 304 described in relation to FIG. 3 A. In some cases, the dual -variant-type call recalibration system 106 generates the sequence data 308 as part of one or more digital files, such as BCL and FASTQ files.
[0093] To generate such files, in some embodiments, the sequencing device 114 (or the callgeneration model 310) utilizes cluster generation and SBS chemistry to sequence millions or billions of clusters in a flow cell. During SBS chemistry, for each cluster, the sequencing device 114 (or the call -generation model 310) stores nucleobase calls from the nucleotide reads 302 for every cycle of sequencing via real-time analysis (RTA) software. The sequencing device 114 (or the call-generation model 310) utilizes RTA software to further store base call data in the form of individual base call data files (or BCLs). In some cases, the sequencing device 114 (or the callgeneration model 310) further converts the BCL files into sequence data (e.g., via BCL to FASTQ conversion). For instance, the sequencing device 114 (or the call-generation model 310) generates a FASTQ file from the nucleotide reads 302, where the FASTQ file includes the sequence data 308 (or a portion thereof).
[0094] In some cases, the call-generation model 310 generates the sequence data 308 for each cluster that passes an initial quality filter from a sample sequence. For example, the call-generation model 310 generates entries for each cluster, where each entry includes four lines (or four items of sequence data): (i) a sequence identifier with information about the sequencing run and the cluster, (ii) nucleobase calls that make up the sequence (e.g., a sequence of A, C, T, G, and/or N calls), (iii) a separator (e.g., a “+” sign), and (iv) base-call-quality metrics indicating probabilities of correctness for the nucleobase calls (Phred +33 encoded).
[0095] As further illustrated in FIG. 3B, the dual -variant-type call recalibration system 106 implements, utilizes, or applies the call-generation model 310 processes or analyzes the sequence data 308 to generate genotype calls. Indeed, in some embodiments, the dual -variant-type call recalibration system 106 determines the call-model-generated sequencing metrics 312 by utilizing the call-generation model 310 to re-engineer raw sequencing metrics (e.g., raw sequencing metrics within the sequence data 308). In particular, the call-generation model 310 includes mapping-and- alignment components to map and align nucleobase calls from the sequence data 308. In addition,
the call-generation model 310 includes variant calling components to generate genotype calls (e.g., reference-base calls such as variant calls or non-variant calls) from the sequence data 308. In some cases, the dual -variant-type call recalibration system 106 determines the call-model-generated sequencing metrics 312 that have been generated utilizing the mapping-and-alignment components and the variant calling components of the call-generation model 310.
[0096] To illustrate examples of the call-model-generated sequencing metrics 312, in some cases, the dual -variant-type call recalibration system 106 generates variant calling metrics including one or more of: (i) genotype metrics corresponding to a GT field of a VCF file and indicating a genotype of a genomic coordinate, (ii) base-call-quality metrics (e.g., DRAGEN QUAL scores) indicating quality scores for genotype calls generated via the call-generation model 310, (iii) genotype quality metrics (e.g., a GQ score) indicating a measure of confidence or quality of a predicted genotype for a genomic coordinate, (iv) genotype probability metrics indicating one or more probabilities of various genotypes occurring at a genomic coordinate, (v) PHRED-scaled- likelihood metrics or non-PHRED-scaled-likelihood metrics indicating probabilities of errors associated with genotype calls, (vi) a call-model-generated-foreign-read-detection metric (e.g., foreign read detection (FRD) score) indicating a probability that one or more of the nucleotide reads 302 in a pileup might be foreign reads (e.g., their true location is elsewhere in the reference sequence), (vii) a call-model-generated-base-quality-dropoff metric (e.g., base quality dropoff (BQD) score) indicating a probability of base quality dropoff based on one or more of strand bias, error position in a thread, or low mean base quality over a subset of the nucleotide reads 302, (viii) an average read depth, (ix) a normalized read depth, (x) a read depth with mapqO reads (e.g., a zero mapping quality metric), (xi) a read depth without mapqO reads, (xii) indel statistics (e.g., a polymerase chain reaction or “PCR” curve), (xiii) hidden Markov model (HMM) statistics (e.g., posterior genotype probabilities), (xiv) a secondary-alignment metric indicating a probability that a secondary genotype call is correct, (xv) a base-context metric indicating contextual information for nucleotide around a genotype call, (xvi) a nearby-call metric indicating nearby (e.g., adjacent or within a threshold degree of separation from) a genotype call, (xvii) a joint-detection metric indicating a probability of detecting a joint corresponding to two or more overlapping genotype calls, and/or (xviii) read-filtering metrics indicating threshold quality metrics or other metrics for filtering out genotype calls with low mapping quality, base quality, or other quality metrics, or others. In some cases, the dual -variant-type call recalibration system 106 generates the call-model- generated sequencing metrics 312 from internal (e.g., proprietary, and model-specific) variables that reflect interacting processing paths, comer cases, and difficult predictions/decisions.
[0097] As indicated above, in some cases, the dual -variant-type call recalibration system 106 determines FRD scores according to the methods described in U.S. Patent Application No.
16/280,022 to Eric Jon Ojard, entitled System and Method for Correlated Error Event Mitigation for Variant Calling, filed February 19, 2019, which is incorporated by reference herein in its entirety. In certain implementations, the dual -variant-type call recalibration system 106 also (or alternatively) determines BQD scores, FRD scores, HMM statistics, and/or other variant calling metrics according to the methods described in U.S. Patent Application Nos. 17/165,828, 15/643,381, and 14/811,836, which are incorporated by reference herein in their entireties.
[0098] As illustrated in FIG. 3B, the call-model-generated sequencing metrics 312 include, but are not limited to, variant calling metrics determined via the variant calling components of the callgeneration model 310. In addition or in the alternative to the examples of the call-model-generated sequencing metrics described above, in some cases, the dual-variant-type call recalibration system 106 determines or generates (e.g., via metric re-engineering) variant calling metrics including one or more of: (i) a number of samples in a population, (ii) a number of reads processed for generating genotype calls, (iii) a number of variants (e.g., SNPs, indels, and MNPs), (iv) a number of biallelic sites (e.g., genomic coordinates that contain two observed alleles), (v) a number of multiallelic sites (e.g., a number of sites in a variant call fde that contain three or more observed alleles), (vi) a number of SNPs, (vii) numbers of different types of indels (e.g., homozygous insertions, heterozygous insertions, and heterozygous deletions), (viii) a total number of heterozygous indels (e.g., insertion + deletion, insertion + SNP, or deletion + SNP), (ix) a number of de novo SNPs (e.g., SNPs with de novo quality metrics that satisfy a threshold level), (x) a number of de novo indels (e.g., indels with de novo quality metrics that satisfy a threshold level), (xi) a number of de novo MNPs (e.g., MNPs with de novo quality metrics that satisfy a threshold level, (xii) a number of SNPs in a first chromosome divided by a number of SNPs in a second chromosome, (xiii) a number of SNP transitions, (xiv) a number of SNP transversions, (xv) a number of heterozygous variants, (xvi) a number of homozygous variants, (xvii) a ratio between the number of heterozygous variants and the number of homozygous variants, (xviii) a number of variants detected within a dbSNP reference file, and/or (xix) a total number of variants minus the number detected within the dbSNP file.
[0099] Additionally, the call-model-generated sequencing metrics can include mapping-and- alignment sequencing metrics determined via the mapping-and-alignment components of the callgeneration model 310. For instance, the dual -variant-type call recalibration system 106 determines or generates (e.g., via metric re-engineering) mapping-and-alignment metrics including one or more of: (i) a number of total input reads, (ii) a number of duplicate marked reads, (iii) a number of duplicate marked and mate reads removed, (iv) a number of unique reads, (v) a number of reads with mate sequenced, (vi) a number of reads without mate sequenced, (vii) indications of reads that fail quality checks, (viii) indications of mapped reads, (ix) a number of unique and mapped reads,
(x) a number of unmapped reads, (xi) a number of singleton reads (e.g., where the read is mapped but the paired mate could not be read), (xii) a number of paired reads, (xiii) a number of properly paired reads (e.g., where both reads in a pair are mapped and fall within an acceptable range from each other based on an estimated insert length distribution), (xiv) a number of discordant reads (e.g., not properly paired reads), (xv) a number of paired reads mapped to different chromosomes, (xvi) a number of paired reads mapped to different chromosomes that also have a mapping-quality metric of 10 or greater, (xvii) percentages of reads within indels R1 and R2, (xviii) percentages of bases in R1 and R2 that are soft clipped, (xix) a number of mismatched bases in R1 and R2, (xx) a number of bases with a base quality of at least 30 (e.g., total and/or in R1 or R2), (xxi) a number of alignments (e.g., total alignments, secondary alignments, and/or supplementary alignments), (xxii) an estimated read length, and (xxiii) an estimated sample contamination.
[0100] Turning now to FIG. 3C, in some cases, the dual-variant-type call recalibration system 106 generates, extracts, or determines externally sourced sequencing metrics 316. In particular, the dual -variant-type call recalibration system 106 determines externally sourced sequencing metrics 316 from one or more databases external to the dual -variant-type call recalibration system 106, such as a sequencing information database 314 (e.g., the database 116). For example, the dual- variant-type call recalibration system 106 accesses sequencing metrics that are generic or applicable to sequencing nucleotides generally. In addition, the dual-variant-type call recalibration system 106 accesses or determines sequencing information about a particular reference sequence (e.g., stored within the sequencing information database 314).
[0101] In some cases, the dual -variant-type call recalibration system 106 determines externally sourced sequencing metrics 316 including: (i) a mappability metric indicating an ease or difficult of mapping a particular nucleotide sequence (or a particular nucleotide read or nucleobase call), (ii) a guanine-cytosine-content metric indicating a count (or a dropout or a mean) of guanine- cytosine content in a reference nucleotide sequence (e.g., reference genome), (iii) a replicationtiming metric indicating a time required to replicate a particular number of nucleotides from a reference sequence, (iv) one or more DNA-structure-metrics indicating DNA structures of a reference sequence (e.g., reference genome), (v) a conservation metric indicating a measure of sequence conservation across multiple species (e.g., a measure of change relative to an average), (vi) a confidence classification indicating a degree to which nucleobases at the one or more genomic coordinates can be accurately determined, (vii) a repeat classification indicating a category of repetitive genomic region for the one or more genomic coordinates, (viii) a cytosine quadruplex indicator indicating that one or more genomic coordinates are part of a cytosine quadruplex, (ix) a guanine quadruplex indicator indicating that one or more genomic coordinates are part of a guanine
quadruplex, (x) a homopolymer indicator indicating that one or more genomic coordinates are part of a homopolymer within a reference genome, and/or others.
[0102] In some embodiments, the dual -variant-type call recalibration system 106 determines the externally sourced sequencing metrics 316 by analyzing one or more genomic regions of a reference genome corresponding to (or aligning with) the one or more genomic coordinates for an initial genotype call. Many challenging variant calls occur in low complexity genomic regions of the reference genome. In some cases, these genomic regions are characterized by some combination of multiple instances of long repeat sequences (e.g., more than 50 base pairs), very high number (e.g., more than 10) of shorter repeat sequences (e.g., 4-8 repeated bases), and on occasion containing a subset of the bases (e.g., As and Ts but no Cs or Gs). The nucleotide reads that are aligned correctly to such low complexity genomic regions often have portions or fragments of the nucleotide reads that map to a more unique sequence flanking a repeat-heavy region. Alternatively, a reference genome or genomic sample may include some intermediate breaks (e.g., single bases in between the primary repeat pattern that breaks the repetitiveness) that help with alignment of nucleotide reads with a low complexity genomic region of a reference genome. However, when combined with SNPs, indels, and sequencing errors, the alignment and the collection of reads with sufficient evidence to compare reference versus alternate allele support becomes problematic. Thus, in some embodiments, the dual -variant-type call recalibration system 106 monitors externally sourced sequencing metrics 316 (associated with complexity) which can be augmented with read-based sequencing metrics to provide an overall assessment of the likelihood of the presence of a variant (for both Bayesian and machine-learning approaches).
[0103] For example, the dual-variant-type call recalibration system 106 accesses or determines sequencing information about a particular reference genome (e.g., stored within the sequencing information database 314). In some cases, the dual -variant-type call recalibration system 106 determines externally sourced sequencing metrics 316 including a tandem repeat length in nucleobases of a target genomic region within a reference genome corresponding to a candidate region of a genomic sample. Specifically, the dual -variant-type call recalibration system 106 analyzes portions of a reference genome that correspond to variant regions of a genomic sample to identify tandem repeats (e.g., sequences of two or bases that are repeated numerous times in a head- to-tail manner) and to further determine lengths (e.g., numbers of base pairs) within the tandem repeats.
[0104] In certain embodiments, the dual -variant-type call recalibration system 106 determines an externally sourced sequencing metric in the form of a repetitiveness metric or homopolymer metric. Indeed, one indicator of a likelihood of a mis-mapping that needs to be corrected (e.g., a mis-mapping that results in a false positive) is based on repetitiveness of bases within a reference
sequence. Thus, the dual -variant-type call recalibration system 106 can utilize various sequencing metrics to measure this repetitiveness, including: (i) a maximum repeat pattern length that indicates the maximum length of a sequence of bases that is repeated at least two times over the span of the (reference genome corresponding to the) candidate region, (ii) a maximum repeat length percentage that indicates the percentage of the (portion of the reference genome corresponding to the) region that is consumed or occupied by the maximum repeat pattern length, and (iii) a maximum homopolymer length that indicates the length of the longest sequence of the same base in the (portion of the reference genome corresponding to the) candidate region.
[0105] In addition or in the alternative to a repetitiveness metric, in some cases, the dual- variant-type call recalibration system 106 determines an externally sourced sequencing metric in the form of a permutation entropy of nucleobases. For example, the dual-variant-type call recalibration system 106 determines a measure of randomness of nucleotide sequences, which can be predictive of mapping/alignment accuracy. In some cases, the dual-variant-type call recalibration system 106 determines a permutation entropy by determining an entropy over permutations of a nucleotide sequence of a given length. For instance, the dual-variant-type call recalibration system 106 can determine permutation entropy according to the following formula:
Si G {A, C, G, T}
S2 G {AA, AC, AG, AT, GA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT}
S3 G {AAA, AAC, AAG, AAT, ACT, ... , TTA, TTC, TTG, TTT)
S4 G {AAAA, AAAC, AAAG, AAAT, AACA, ... , TTGT, TTT A, TTTC, TTTG, TTTT) where SN is a set of all permutations of length N base sequences, and where:
I = 4W such that the probability of permutation element sN k occurring from set SN is given by:
= Ck PN’k M — N + 1 where ck is the number of occurrences of permutation element sN k in a sequence of length M. In some cases, the dual -variant-type call recalibration system 106 normalizes the permutation entropy as:
where K {0, . . . , 4W — 1} is the set of indices such that pN k > 0.
[0106] As mentioned above, the dual -variant-type call recalibration system 106 can further determine an externally sourced sequencing metric in the form of identifying a presence or absence of a cytosine quadruplex (C-quadruplex) or a guanine quadruplex (G-quadruplex) in a target genomic region. To elaborate, the dual -variant-type call recalibration system 106 determines counts
of cytosine calls and guanine calls within a target genomic region of a reference genome corresponding to a variant region of a genomic sample or genomic region under consideration for an initial variant call. To identify a cytosine quadruplex, the dual-variant-type call recalibration system 106 identifies occurrences (within the target genomic region) of four or more instantiations of three consecutive cytosine bases separated by one or more different nucleobases (e.g., a pattern of CCC A CCC A CCC A CCC). Similarly, to identify a guanine quadruplex, the dual-variant-type call recalibration system 106 identifies occurrences (within the target genomic region) of four or more instantiations of three consecutive guanine bases separated by one or more different nucleobases (e g., a pattern of GGG T GGG T GGG T GGG).
[0107] In one or more embodiments, the dual -variant-type call recalibration system 106 identifies a C-quadruplex or a G-quadruplex where up to a threshold number of nucleobases (e.g., up to 7 nucleobases) occur between instantiations of triple Cs or triple Gs. For instance, the dual- variant-type call recalibration system 106 identifies GGGTACC GGGTGTACA GGG AAGTCT GGG as a G-quadruplex. In some cases, G-quadruplexes (and C-quadruplexes) are known to cause issues with sequencing. Accordingly, the dual -variant-type call recalibration system 106 uses the presence of such sequences to adjust the confidence in the mapping and alignment of reads and the accuracy of subsequent contiguous sequence construction.
[0108] In certain embodiments, the dual -variant-type call recalibration system 106 determines a data compression metric as part of the externally sourced sequencing metrics 316. In particular, the dual -variant-type call recalibration system 106 determines a data compression metric that quantifies a measure of randomness of a sequence using one or more data compression algorithms. One such data compression algorithm for lossless compression is the Liv-Zempel-Welch algorithm. Using this algorithm, the dual -variant-type call recalibration system 106 builds a dictionary of unique k-mers starting with length of one and comes up with an encoding for each entry in the dictionary. The dual -variant-type call recalibration system 106 can utilize the number of keys in the dictionary for the structural variant and the flanking regions in the reference genome as a sequencing metric.
[0109] In addition or in the alternative to the externally sourced sequencing metrics 316 noted above, in some embodiments, the dual -variant-type call recalibration system 106 determines a structural variant sequence alignment metric as part of the externally sourced sequencing metrics 316. For instance, the dual -variant-type call recalibration system 106 uses gapless alignment scoring and Smith-Waterman alignment scoring of a proposed deletion sequence against the left/right flanking genomic regions in the reference. If there are multiple alignments that score above a threshold gapless alignment score and/or a threshold Smith-Waterman alignment score,
the variant-call-integration machine-learning model may process a variant sequence alignment metrics as an indicator that there is a higher likelihood of an imprecise variant call.
[0110] Further, the dual -variant-type call recalibration system 106 can also determine a simulated read alignment metric as an externally sourced sequencing metric. Assuming that the contiguous sequence representing or including a variant is accurate, there should theoretically be many nucleotide reads with good alignment to the contiguous sequence, even for heterozygous deletions. However, for low evidence true-positive cases of variants, there is a likelihood of missing reads because the reads corresponding to the structural variant (SV) region were either mapped elsewhere or unmapped. The dual -variant-type call recalibration system 106 can thus determine a likelihood of missing reads by simulating reads.
[0111] Specifically, the dual-variant-type call recalibration system 106 chooses segments from the contiguous sequence equal in length to the SBS reads. The dual-variant-type call recalibration system 106 chooses segments of the contiguous sequence that cross the breakend(s), that are equivalent to SBS read length, and that are aligned to the reference sequence in the SV region. For cases where alignment is ambiguous, alternate alignment scores will be higher and can serve as a possible guide for expected read depth. The dual -variant-type call recalibration system 106 can further use the segment of the contiguous sequence equivalent to read length that is symmetric about the breakend to obtain the highest alignment scores. The dual-variant-type call recalibration system 106 can further determine additional offsets from this symmetric point to check alternate alignment scores for a range of overlaps.
[0112] In one or more embodiments, the dual -variant-type call recalibration system 106 determines, receives, or extracts additional or alternative sequencing metrics, including read-based sequencing metrics, call-model-generated sequencing metrics, and/or externally sourced sequencing metrics. For example, the dual -variant-type call recalibration system 106 determines, extracts, or receives the sequencing metrics in following table, where each of the metrics belongs to one or more of the read-based sequencing metrics, call-model-generated sequencing metrics, and/or externally sourced sequencing metrics.
[0113] As mentioned above, in one or more embodiments, the dual-variant-type call recalibration system 106 generates genotype probabilities for variants in genomic regions corresponding to germline and somatic mosaic variants. In particular, the dual-variant-type call recalibration system 106 utilizes a variant-call-recalibration machine-learning model to generate genotype probabilities corresponding to various genomic coordinates. In addition, the call recalibration system 106 updates of modifies a genotype call by generating an updated genotypecall data file, such as variant call file (e.g., a recalibrated variant call file) based on the genotype probabilities and/or the variant-call classifications. Moreover, the dual-variant-type call recalibration system 106 determines output genotype calls and generates or modifies a genotypecall file, such as a variant call file, with various information corresponding to the output genotype calls.
[0114] In accordance with one or more embodiments, FIGS. 4A-4B illustrate the dual -variant- type call recalibration system 106 generating genotype probabilities and determining output genotype calls according to one or more embodiments. In certain described embodiments, the dual- variant-type call recalibration system 106 utilizes a variant-call-recalibration machine-learning model together with a call-generation model to generate genotype calls in genomic regions corresponding to germline and/or somatic mosaic variants. In particular, the dual-variant-type call recalibration system 106 utilizes the variant-call-recalibration machine-learning model to modify data fields corresponding to a variant call file representing one or more genotype calls. FIG. 4A illustrates generating variant calls by modifying a variant call file utilizing a variant-call- recalibration machine-learning model and a call-generation model in accordance with one or more embodiments.
[0115] As illustrated in FIG. 4A, the dual -variant-type call recalibration system 106 accesses a sequencing information database 402 (e.g., the sequencing information database 314), a reference sequence 403, and sequence data 404 (e.g., the sequence data 308) extrapolated from one or more nucleotide reads (e.g., the nucleotide reads 302). Indeed, the dual-variant-type call recalibration system 106 performs sequencing-metric extraction 410 to extract or re-engineer sequencing metrics as described above in relation to FIGS. 3A-3C. For example, the dual-variant-type call recalibration system 106 generates read-based sequencing metrics, externally sourced sequencing metrics, and call model generated sequencing metrics. In some cases, the dual-variant-type call recalibration
system 106 utilizes mapping-and-alignment components 406 of a call-generation model 420 (e.g., the call-generation model 310) to determine mapping-and-alignment sequencing metrics as described above. In addition, the dual -variant-type call recalibration system 106 utilizes variantcaller components 408 of the call-generation model 420 to generate variant calling metrics as described above. Further, the dual -variant-type call recalibration system 106 determines read-based sequencing metrics and externally source sequencing metrics as well (e.g., from sequencing information database 402 and/or the reference sequence 403).
[0116] As further illustrated in FIG. 4A, the dual -variant-type call recalibration system 106 generates genotype probabilities 414. More specifically, the dual -variant-type call recalibration system 106 utilizes a variant-call-recalibration machine-learning model 412 to generate the genotype probabilities 414 from the sequencing metrics extracted via the sequencing-metric extraction 410. For example, the variant-call-recalibration machine-learning model 412 generates genotype probabilities 414 for variants within genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants. While not shown in FIG. 4A, in some embodiments, the dual -variant-type call recalibration system 106 utilizes the variant-call- recalibration machine-learning model 412 (or a different machine-learning model) to generate variant-call classifications in place of the genotype probabilities 414 (e.g., when identifying indels and/or variants at multiallelic genomic coordinates).
[0117] As mentioned, in some embodiments, the dual-variant-type call recalibration system 106 generates genotype probabilities for genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants. For example, as shown in FIG. 4A, the dual- variant-type call recalibration system 106 utilized the variant-call-recalibration machine-learning model 412 to generate the genotype probabilities 414 for a candidate genomic coordinate (e.g., “chr5:4”), including: (i) a first genotype probability that the genomic sample includes a homozygous reference genotype (e.g., “L(0/0)@chr5:4”) at the candidate genomic coordinate, (ii) a second genotype probability that the candidate genomic coordinate includes a heterozygous variant genotype (e.g., “L(0/l)@chr5:4”), and (iii) a third genotype probability that the candidate genomic coordinate includes ahomozygous variant genotype (e.g., “L(l/l)@chr5:4”). Specifically, in the example illustrated, the first genotype probability indicates a likelihood of 0.10 that the genotype call is a homozygous reference genotype, the second genotype probability indicates a likelihood of 0.76 that the genotype call is a heterozygous variant genotype, and the third genotype probability indicates a likelihood of 0.14 that the genotype call is a homozygous variant genotype. [0118] As indicated above, genotype probabilities output by the variant-call-recalibration machine-learning model 412 (e.g., the genotype probabilities 414) account for both candidate germline variants and candidate somatic mosaic variants. For example, in some cases and for
certain genomic coordinates of a genomic sample, the variant indicated by a “1” in a candidate genotype (e.g., “0/1” or “1/1”) for a genotype probability corresponds to a germline variant. In contrast, in some cases and depending on the genomic sample and the candidate genomic coordinate, the variant indicated by a “1” in a candidate genotype (e.g., “0/1” or “1/1”) for a genotype probability corresponds to a somatic mosaic variant. Indeed, across different genomic coordinates, both types of variants (i.e., germline variants and somatic mosaic variants) are accounted for by the genotype probabilities 414 output by the variant-call-recalibration machinelearning model 412.
[0119] While FIG. 4A shows a particular formatting of the genotype probabilities 414, alternative embodiments can include additional or alternative formatting, such as genotype probabilities for candidate genotypes in which the format specifies a candidate germline variant at a particular position(s) and a candidate somatic mosaic variant at another position. For instance, the output genotype probabilities may take the form of probabilities for candidate genotypes represented by three positions (e.g., “L(0/l/0)” or “L(0/0/l)”) in which (i) the initial two positions of binary code separated by a slash represent a candidate germline genotype corresponding to either a reference base (i.e., designated as “0” in either the first or second position) or a germline variant (i.e., designated as “1” in the first or second position) and (ii) the last position of binary code following the second slash represents a presence or absence of a candidate somatic mosaic variant (i.e., no somatic mosaic variant designated as “0” and a somatic mosaic variant designated as “1” in the third position).
[0120] As indicated above, in some embodiments, the dual-variant-type call recalibration system 106 generates a genotype call for a haploid genomic coordinate. In some such embodiments, the variant-call-recalibration machine-learning model 412 generates the genotype probabilities 414 for a haploid genomic coordinate as follows: (i) a first genotype probability of a first genotype at the genomic coordinate and (ii) a second genotype probability of a second genotype at the genomic coordinate. As mentioned previously, the first genotype probability can be a probability that a genotype at a genomic coordinate is a haploid reference genotype, and the second genotype probability can be a probability that a genotype at the genomic coordinate is a haploid alternate genotype. In these or other embodiments, such as embodiments for generating genotype calls for genomic coordinates indicated to exhibit homozygous reference genotypes, the variant-call-recalibration machine-learning model 412 can also generate variant-call classifications including: (i) a false-positive probability or a homozygous reference classification indicating a probability that a genotype call is a false positive or a homozygous reference genotype, respectively; (ii) a zygosity-error probability or a heterozygous genotype classification indicating a probability that a genotype (e.g., an indication of a heterozygous or homozygous genotype for a
variant call at a particular location) is incorrect or a heterozygous genotype, respectively; and/or (iii) a true-positive classification or a homozygous alternate classification indicating a probability that a genotype call is a true positive or a homozygous alternate genotype, respectively. In some cases, the generated variant-call classifications accordingly represent intermediate scoring metrics and/or a predicted probability that a genotype for a genotype call is accurate.
[0121] In some cases, the variant-call-recalibration machine-learning model 412 is an ensemble of gradient boosted trees that processes the sequencing metrics to generate the genotype probabilities 414. For instance, the variant-call-recalibration machine-learning model 412 can include a series of weak learners such as non-linear decision trees that are trained in a logistic regression to generate the genotype probabilities 414. In some cases, the variant-call-recalibration machine-learning model 412 includes metrics within various trees that define how the variant-call- recalibration machine-learning model 412 processes the sequencing metrics to generate the genotype probabilities 414. Additional detail regarding the training of the variant-call-recalibration machine-learning model 412 is provided below with reference to FIGS. 5A-5B and 6.
[0122] In certain embodiments, the variant-call-recalibration machine-learning model 412 is a different type of machine learning model such as a neural network, a support vector machine, or a random forest. For example, in cases where the variant-call-recalibration machine-learning model 412 is a neural network, the variant-call-recalibration machine-learning model 412 includes one or more layers each with neurons that make up the layer for processing the sequencing metrics. In some cases, the variant-call-recalibration machine-learning model 412 generates the genotype probabilities 414 by extracting latent vectors from the sequencing metrics, passing the latent vectors from layer to layer (or neuron to neuron) to manipulate the vectors until utilizing an output layer (e.g., one or more fully connected layers) to generate the genotype probabilities.
[0123] As an example of generating the genotype probabilities 414, in some embodiments, the dual -variant-type call recalibration system 106 utilizes statistics to summarize a mapping quality distribution of reference supporting reads and alternative supporting reads (e.g., for a comparative- mapping-quality-distribution metric). The dual -variant-type call recalibration system 106 can determine and utilize the mean of the MAPQ for reads supporting an alternative allele from SBS reads and from assembled nucleotide reads. In these or other embodiments, the variant-call- recalibration machine-learning model 412 leams from the data that, when the MAPQ of an alternative allele (indicated by SBS reads or assembled nucleotide reads) is low and a depth metric is high relative to other MAPQ and depth metrics in distributions, a resultant genotype call is more likely to be a false positive. Indeed, as the probability of a false positives increases, the MAPQ metrics would likely decrease.
[0124] As a further example, in some cases, the dual -variant-type call recalibration system 106 compares a mapping quality (e.g., MAPQ) associated with an SBS read and/or an assembled nucleotide read with a mapping-quality threshold. For instance, the dual-variant-type call recalibration system 106 utilizes a mapping-quality threshold such as a threshold difference between best and second-best alignment scores. Upon determining that one or more of mapping qualities for the different read types does not satisfy the threshold, the dual-variant-type call recalibration system 106 adjusts one or more of the genotype probabilities 414 accordingly (e.g., to select a read with a higher MAPQ).
[0125] In addition (or in the alternative), the dual -variant-type call recalibration system 106 can determine the genotype probabilities 414 by utilizing an accumulation of statistical analyses over complex functions (depending on the architecture of the variant-call-recalibration machinelearning model 412) to determine how to best fit the data. For example, as described above, the dual -variant-type call recalibration system 106 trains the variant-call-recalibration machinelearning model 412 to minimize a loss generated from a number of (different types of) sequencing metrics to determine weights and biases that best fit the data (e.g., that result in a reduced or minimized loss).
[0126] As further illustrated in FIG. 4A, in addition to generating the genotype probabilities 414, the dual -variant-type call recalibration system 106 performs data field generation 416. More specifically, the dual -variant-type call recalibration system 106 generates or modifies data fields for a variant call file 418. To generate (or modify) the variant call file 418, the dual-variant-type call recalibration system 106 utilizes the variant-caller components 408 of the call-generation model 420 and modifies or maintains values for such data fields based on the genotype probabilities 414 generated by the variant-call-recalibration machine-learning model 412.
[0127] For instance, the dual -variant-type call recalibration system 106 modifies various metrics such as quality metrics, mapping metrics, or other metrics associated with the genotype call. As mentioned, in some cases, the dual -variant-type call recalibration system 106 selects metrics associated with nucleotide reads and/or associated with the genotype probabilities 414. In other cases, the dual -variant-type call recalibration system 106 generates new metrics from the data generated by the call-generation model 420 and/or the variant-call-recalibration machine-learning model 412. In certain embodiments, the genotype call is represented or defined by the variant call file 418 which includes metrics corresponding to the data fields, such as a call-quality metric corresponding to a call-quality field, a genotype metric corresponding to a genotype field, and a genotype-quality metric corresponding to a genotype-quality field.
[0128] In some embodiments, the dual -variant-type call recalibration system 106 indicates variant calls within the variant call file 418 (or other sequencing data file) without an indication of
a germline variant or a somatic mosaic variant. Alternatively, in some embodiments, the dual- variant-type call recalibration system 106, having predicted or otherwise determined that an identified variant call corresponds to a somatic mosaic variant, includes an indication within the variant call file 418 that the identified variant is a somatic mosaic variant. Such an indicator may take the form of an acronym (e.g., “GV” for germline variant and “SMV” for somatic mosaic variant), a color (e.g., green for germline variant and red for somatic mosaic variant), a code (e.g., “7” for germline variant and “9” for somatic mosaic variant), or any combination or other suitable indicator.
[0129] By contrast, in some embodiments, the dual -variant-type call recalibration system 106 generates and reports variant calls in a VCF or other sequencing data file (e.g., a first variant call at a first genomic coordinate and a second variant call at a second genomic coordinate different than the first genomic coordinate) but does not include a specific indication that a particular variant call is either a germline variant or a somatic mosaic variant. For instance, the VCF or other sequencing data file includes neither acronym, neither color, nor code indicating that a particular variant call is a germline variant or a somatic mosaic variant.
[0130] In certain embodiments, the dual -variant-type call recalibration system 106 generates (data fields for) a genotype call utilizing the variant-caller components 408 together with the genotype probabilities 414. For instance, the dual -variant-type call recalibration system 106 generates, for inclusion within the variant call file 418, data fields for various metrics of a genotype call such as nucleotide(s) included in the call, a call quality (QUAL), a genotype (GT), a genotype quality (GQ), one or more normalized PHRED-scale likelihoods (PL), a genotype probability (GP), an allele frequency (AF), allele count (AC), and/or total number of alleles (AN). In some embodiments, for example, the allele frequency (AF) can indicate whether a variant call corresponds to a germline variant (e.g., with an AF of approximately 0.5 or 1.0) or a somatic mosaic variant (e.g., with a relatively low AF, such an AF less than 0.5). In some embodiments, for instance, the dual -variant-type call recalibration system 106 can require a threshold AF before including a particular call within the variant call file 418, wherein the threshold AF is sufficiently high to allow for identification of somatic mosaic variants of relatively low AF (e.g., a threshold AF of 0.05, 0.1, or 0.15).
[0131] In one or more embodiments, the dual -variant-type call recalibration system 106 recalibrates or modifies a genotype call (or generates a new genotype call) using the genotype probabilities 414 from the variant-call-recalibration machine-learning model 412. As described, the dual -variant-type call recalibration system 106 modifies the genotype call by modifying or recalibrating data fields for one or more of the metrics associated with the genotype call (e.g., as included within the variant call file 418).
[0132] To update or recalibrate the call-quality metric (QUAL) associated with a genotype call, for instance, the dual -variant-type call recalibration system 106 determines how each of the genotype probabilities 414 impact or affect the base-call-quality metric. For example, the dual- variant-type call recalibration system 106 determines that a high probability for a genotype error results in a lower overall genotype quality and possibly a different overall call quality. As another example, the dual -variant-type call recalibration system 106 determines that a high probability for a false positive variant results in a lower overall call quality. As yet another example, the dual- variant-type call recalibration system 106 determines that a high probability for a true positive variant results in a higher overall (variant) call quality. The dual-variant-type call recalibration system 106 accordingly updates the genotype along with the genotype quality and the call quality associated with the genotype call.
[0133] In one or more implementations, the dual -variant-type call recalibration system 106 generates a combination (e.g., a weighted combination or an average) of the genotype probabilities 414 to recalibrate the call-quality metric. In particular, the dual -variant-type call recalibration system 106 weights the various predictions of the genotype probabilities 414 according to their respective impact on (variant) call quality. In some cases, the dual-variant-type call recalibration system 106 weights each genotype probability, while in other cases the dual -variant-type call recalibration system 106 determines different weights for each. In any event, the dual -variant-type call recalibration system 106 determines a weighted combination or a weighted average of the genotype probabilities 414 to recalibrate (increase or decrease) a call-quality metric for a genotype call (e.g., an initial variant call).
[0134] To update or recalibrate the genotype metric (e.g., within the GT field of the variant call file 418) associated with a genotype call, the dual -variant-type call recalibration system 106 utilizes one or more of the genotype probabilities 414. For example, the dual -variant-type call recalibration system 106 compares the various constituent predictions of each to determine which of the genotype probabilities 414 has a highest probability. In some cases, the dual -variant-type call recalibration system 106 utilizes the genotype probability with the highest probability to recalibrate the genotype metric (e.g., from 0 as corresponding to the reference base to 1 as corresponding to a first alternative supporting read).
[0135] To update or recalibrate the genotype-quality metric (e.g., within the GQ field of the variant call file 418) associated with a genotype call, the dual -variant-type call recalibration system 106 utilizes one or more of the genotype probabilities 414. More specifically, the dual -variant-type call recalibration system 106 determines how each of the genotype probabilities 414 affect the genotype-quality metric. The dual -variant-type call recalibration system 106 recalibrates the genotype-quality metric accordingly (e.g., by increasing or decreasing the quality score between 0
to 10 or 0 to 100 or on some other scale). For example, the dual -variant-type call recalibration system 106 determines that a higher genotype error probability (generally) indicates a lower genotype-quality metric, and the dual -variant-type call recalibration system 106 reduces the metric accordingly.
[0136] In some cases, the dual -variant-type call recalibration system 106 determines a combination (e.g., a weighted combination or a weighted average) of the genotype probabilities 414 to modify the genotype-quality metric. For example, the dual -variant-type call recalibration system 106 determines a combined effect that the genotype probabilities 414 have on the genotypequality metric. As another example, the dual -variant-type call recalibration system 106 determines individual impacts that each constituent prediction of the genotype probabilities 414 has on the genotype-quality metric and weights each accordingly. The dual-variant-type call recalibration system 106 further recalibrates the genotype-quality metric by increasing or decreasing its value based on the indicated probabilities.
[0137] As described, the dual -variant-type call recalibration system 106 generates an output genotype call from the same set of sequencing metrics (or a subset of the sequencing metrics that are shared between the variant-call-recalibration machine-learning model 412 and the callgeneration model 420). Indeed, the dual -variant-type call recalibration system 106 can operate the variant-call-recalibration machine-learning model 412 in parallel with the call-generation model 420 to generate metrics for an output genotype call and genotype probabilities 414 for recalibrating the generated metrics.
[0138] In one or more implementations, the dual -variant-type call recalibration system 106 updates or otherwise modifies the data fields for the variant call file 418 according to particular algorithms. After modifying such data fields, the dual -variant-type call recalibration system 106 can generate the variant call file 418 (e.g., a post-filter variant call file) to include metrics reflecting the updated data fields. For instance, in some cases, the dual-variant-type call recalibration system 106 updates the QU AL field for every variant based on the probability of a false positive variant. As indicated above, in some cases, QU AL indicates the probability that there is some kind of variant (or other nucleobase call) at a given location, measured in PHRED scale.
[0139] As suggested above, in some embodiments, the dual-variant-type call recalibration system 106 increases or decreases a base-call-quality metric (e.g., Q score) for a genotype call. Based on the genotype probabilities 414, for example, the dual -variant-type call recalibration system 106 increases base-call-quality metrics for genotype calls that would not have previously passed a quality filter and determines that the increased base-call-quality metrics now passes the quality filter. In some such cases, the dual -variant-type call recalibration system 106 includes genotype calls with such increased base-call-quality metrics (passing the quality filter) in a post-
filter variant call file. By contrast, in other cases, the dual-variant-type call recalibration system 106 decreases base-call-quality metrics for genotype calls that previously would have passed a quality filter and determines that the decreased base-call-quality metrics now fail the quality filter. In some such cases, the dual -variant-type call recalibration system 106 excludes genotype calls with decreased base-call-quality metrics (failing the quality filter) from a post-filter variant call file but includes the genotype calls with such decreased base-call-quality metrics in a pre-filter variant call file.
[0140] For example, the dual -variant-type call recalibration system 106 can remove false positive variant calls and recover false negative variant calls by changing corresponding base-call- quality metrics. To remove a false positive, in some cases, the dual -variant-type call recalibration system 106 decreases the base-call-quality metric of a genotype call that initially passed a quality filter — based on the genotype probabilities 414 from the variant-call-recalibration machinelearning model 412. Based on determining the decreased base-call-quality metric falls below a threshold metric (e.g., a Q score of 3.0 or 10.0), the dual-variant-type call recalibration system 106 determines that the genotype call no longer passes the quality filter. The dual-variant-type call recalibration system 106 thus filters out, or removes, the false positive-genotype call that initially passed the filter by changing its base-call-quality metric.
[0141] In addition to removing false positive variant calls based on changes to base-call- quality metrics, the dual -variant-type call recalibration system 106 can remove false positive variant calls based on changes to genotype. To remove a false positive, in some cases, the dual- variant-type call recalibration system 106 changes a genotype of an initial genotype call indicating a different nucleobase than a reference base (e.g., GT = 1 or 2) to a genotype of an updated genotype call indicating a same nucleobase as the reference base (e.g., GT = 0). Based on the genotype being the same as the reference base, the dual -variant-type call recalibration system 106 does not identify the genotype call as a variant and, in some cases, excludes data for the genotype call from the variant call file 418. For instance, the dual-variant-type call recalibration system 106 can use anull- data indicator for a genotype call (or a particular field) of the variant call file 418. In some cases, the dual -variant-type call recalibration system 106 uses a null-data indicator in cases where a certain sequencing metric does not apply to a particular variant call or VCF field (e.g., where SBS- based calls use different metrics than assembled-nucleotide-read-based calls).
[0142] To recover a false negative, the dual -variant-type call recalibration system 106 increases the base-call-quality metric of a genotype call that initially failed a quality filter. Based on determining the increased base-call-quality metric exceeds a threshold metric, the dual-variant- type call recalibration system 106 determines that the genotype call passes the quality filter. The
dual -variant-type call recalibration system 106 thus recovers a false-negative-genotype call that was initially filtered out by changing its base-call-quality metric.
[0143] In addition to recovering false negative variant calls based on changes to base-call- quality metrics, the dual -variant-type call recalibration system 106 can recover false negative variant calls based on changes to genotype. To recover a false negative, in some cases, the dual- variant-type call recalibration system 106 changes a genotype of an initial genotype call indicating the same nucleobase as a reference base (e.g., GT = 0) to a different genotype of an updated genotype call indicating a different nucleobase than the reference base (e.g., GT = 1 or 2). Based on the differing genotype of the updated genotype call and a passing base-call-quality metric, the dual -variant-type call recalibration system 106 identifies the genotype call as a variant and includes the genotype call within the variant call file 418.
[0144] Indeed, in some implementations, the dual -variant-type call recalibration system 106 operates in a specific sequential order utilizing the call-generation model 420 and the variant-call- recalibration machine-learning model 412. For example, the dual -variant-type call recalibration system 106 generates a FASTQ file by converting a BCL file to FASTQ. In addition, the dual- variant-type call recalibration system 106 (subsequently) utilizes the mapping-and-alignment components 406 of the call-generation model 420 to map and align nucleobases from a sample nucleotide sequence. In some cases, the dual -variant-type call recalibration system 106 maps and aligns the nucleobases of the sample sequence in relation to the reference sequence 403 (e.g., reference genome) and/or various alternative supporting reads.
[0145] After mapping and aligning, as described herein, the dual-variant-type call recalibration system 106 utilizes the variant-caller components 408 of the call-generation model 420 to generate an initial genotype call for the sample sequence corresponding to a particular genomic coordinate — based on various sequencing metrics. After or at the same time, the dual-variant-type call recalibration system 106 also applies the variant-call-recalibration machine-learning model 412 to generate the genotype probabilities 414 from sequencing metrics extracted via the mapping and aligning, the variant calling, and/or from other sources as described above. Based on the genotype probabilities 414, the dual -variant-type call recalibration system 106 recalibrates the genotype call (e.g., by modifying various data fields corresponding to specific metrics of the nucleobase call, such as QU AL, GT, GQ, GP, AF, and/or PL), as described above.
[0146] In some cases, the dual -variant-type call recalibration system 106 further applies a quality filter to the genotype call to determine whether the genotype call passes the quality filter (e.g., a hard pass filter of Q20 or other Q score). The dual-variant-type call recalibration system 106 subsequently identifies a subset of genotype calls that represent variants from reference bases and pass the quality filter. The dual -variant-type call recalibration system 106 further generates a
modified or updated variant call file (e.g., the variant call file 418) that includes the subset of genotype calls and recalibrated metrics for the subset of genotype calls, such as updated QU AL metrics, updated GT metrics, updated GQ metrics, updated GP metrics, and/or updated PL metrics. [0147] As suggested above, in some embodiments, the dual-variant-type call recalibration system 106 can utilize multiple call-recalibration sub-machine-leaming models together to generate genotype probabilities and/or genotype calls for nucleotide reads within regions corresponding to germline and somatic mosaic variants. For example, FIG. 4B illustrates the dual- variant-type call recalibration system 106 utilizing a machine-learning sub-model 424a (e.g., a germline-specific sub-model) of a variant-call-recalibration machine-learning model 422 to generate a first set of genotype probabilities 430a and a machine-learning sub-model 424b (i.e., a dual-variant-type sub-model or a somatic-mosaic-specific sub-model) of the variant-call- recalibration machine-learning model 422 (e.g., with the same or a different architecture) to generate a second set of genotype probabilities 430b.
[0148] As shown in FIG. 4B, for example, the dual -variant-type call recalibration system 106 utilizes two (or more) different call-recalibration machine-learning models in parallel, each trained with different truth datasets, resulting in different genotype probabilities from the same sequencing metrics. For instance, in some implementations, the dual -variant-type call recalibration system 106 trains the machine-learning sub-model 424a to identify germline variants within sample nucleotide sequences by training the machine-learning sub-model 424a with truth datasets comprising ground truth germline variants. Further, in some implementations, the dual-variant-type call recalibration system 106 trains the machine-learning sub-model 424b to identify variants corresponding to germline variants and somatic mosaic variants by training the machine-learning sub-model 424b with truth datasets comprising ground truth germline variants and ground truth somatic mosaic variants. Alternatively, in one or more embodiments, the dual-variant-type call recalibration system 106 trains the machine-learning sub-model 424b to exclusively identify somatic mosaic variants. Additional details with respect to training a variant-call-recalibration machine-learning model (or sub-models thereof) are provided below in relation to FIGS. 5A-5B and 6.
[0149] As further shown in FIG. 4B, the dual -variant-type call recalibration system 106 determines or extracts sequencing metrics 426 from sequence data 428 corresponding to one or more nucleotide reads of a genomic sample (e.g., as described above in relation to FIGS. 3A-3C and 4A). In some embodiments, for example, the dual -variant-type call recalibration system 106 determines or extracts sequencing metrics 426 including read-based metrics, call-model-generated metrics, and externally sourced metrics from the sequence data 428 and utilizes the variant-call- recalibration machine-learning model 422 to generate genotype probabilities for variants within
genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants based on the determined or extracted sequencing metrics 426.
[0150] In particular, the variant-call-recalibration machine-learning model 422 utilizes the machine-learning sub-model 424a to generate the first set of genotype probabilities 430a corresponding to one or more germline variants and, based on the first set of genotype probabilities 430a, determines one or more genotype calls 432a comprising one or more germline variants. Further, the variant-call-recalibration machine-learning model 422 utilizes the machine-learning sub-model 424b to generate the second set of genotype probabilities 430b corresponding to one or more germline variants and/or one or more somatic mosaic variants and, based on the second set of genotype probabilities 430b, determines one or more genotype calls 432b comprising one or more somatic mosaic variants and, in some cases, one or more germline variants. As mentioned, in one or more embodiments, the variant-call-recalibration machine-learning model 422 utilizes the second machine-learning sub-model 424b to exclusively determine genotype calls corresponding to somatic mosaic variants.
[0151] As also shown, the dual -variant-type call recalibration system 106 can generate (or modify) a genotype call fde, such as a variant call file 434, to include genotype calls from the first set of genotype calls 432a and/or the second set of genotype calls 432b. In some embodiments, for example, the variant call file 434 includes germline variants from the first set of genotype calls 432a and somatic mosaic variants and/or germline variants from the second set of genotype calls 432b. Additionally, or alternatively, the dual-variant-type call recalibration system 106 can compare the genotype probabilities 430a generated by the machine-learning sub-model 424a with the genotype probabilities 430b generated by the machine-learning sub-model 424b to identify one or more genotype calls as somatic mosaic variant calls 432c.
[0152] As shown in FIG. 4B, for example, the dual -variant-type call recalibration system 106 compares, for a genotype call at a candidate genomic coordinate, a first genotype probability of the genotype probabilities 430a generated by the machine-learning sub-model 424a with a second genotype probability of the genotype probabilities 430b generated by the machine-learning submodel 424b. Based on the comparison, the dual -variant-type call recalibration system 106 identifies the genotype call comprising a variant at the candidate genomic coordinate as a somatic mosaic variant. For instance, the dual -variant-type call recalibration system 106 may determine the second genotype probability of the genotype probabilities 430b exceeds (or exceeds to a threshold percentage or fixed number) the first genotype probability of the genotype probabilities 430a and, therefore, identify the genotype call comprising a variant at the candidate genomic coordinate as a somatic mosaic variant. In some embodiments, having identified one or more genotype calls as the somatic mosaic variant calls 432c, the dual -variant-type call recalibration system 106 can
specifically identify the somatic mosaic variant calls 432c within the variant call file 434. Conversely, the dual -variant-type call recalibration system 106 may determine the first genotype probability of the genotype probabilities 430a exceeds (or exceeds to a threshold percentage or fixed number) the second genotype probability of the genotype probabilities 430b and, therefore, identify the genotype call comprising a variant at the candidate genomic coordinate as a germline variant. In some embodiments, having identified one or more genotype calls as germline variants, the dual -variant-type call recalibration system 106 can specifically identify the germline variants within the variant call file 434.
[0153] As mentioned, in some embodiments, the dual-variant-type call recalibration system 106 further generates a combined set of genotype probabilities from the different genotype probabilities generated via the different sub-models of the variant-call-recalibration machinelearning model 422. In some cases, the dual -variant-type call recalibration system 106 selects genotype probabilities from the set of genotype probabilities 430a generated by the machinelearning sub-model 424a and the set of genotype probabilities 430b generated by the machinelearning sub-model 424b. For instance, in some embodiments, the dual-variant-type call recalibration system 106 determines an average or a weighted combination of the respective sets of genotype probabilities to generate combined genotype probabilities for recalibrating a genotype call. In some embodiments, the dual -variant-type call recalibration system 106 determines a mean for each genotype probability across each sub-model of the variant-call-recalibration machinelearning model 422 and renormalizes the mean genotype probability. In other embodiments, the dual -variant-type call recalibration system 106 leams linear weights and adapts the weights to minimize overall error or loss for the genotype probabilities. In still other embodiments, the dual- variant-type call recalibration system 106 weights the genotype probabilities for each sub-model based on the inverse of average error across the models.
[0154] Moreover, in some embodiments, the dual -variant-type call recalibration system 106 provides a selectable option to a user for adjustment a variant-sensitivity of the variant-call- recalibration machine-learning model 422. For instance, the variant-sensitivity of the variant-call- recalibration machine-learning model in generating genotype probabilities can be adjusted to implement detection of candidate somatic mosaic variants. Upon selection or user input, such variant sensitivity may be set to detect and report variants that equal or exceed a particular genotype probability (e.g., 0.45 or 0.50) and/or that that equal or exceed a particular allele frequency (e.g., 0.15 or 0.20) as supported by nucleotide reads covering a genomic coordinate. In particular, in implementations where the variant-sensitivity option corresponding to detection of candidate somatic mosaic variants is not selected by the user, the dual-variant-type call recalibration system 106 can exclusively generate genotype probabilities corresponding to germline variants (e.g., by
solely utilizing the machine-learning sub-model 424a of the variant-call-recalibration machinelearning model 422). By contrast, when the dual -variant-type call recalibration system 106 receives an indication of a user selection of the variant-sensitivity option corresponding to detection of candidate somatic mosaic variants, the dual -variant-type call recalibration system 106 executes the variant-call-recalibration machine-learning model 422 to generate the genotype probabilities corresponding to candidate germline variants and candidate somatic mosaic variants (e.g., by executing the machine-learning sub-model 424b or both machine-learning sub-models of the variant-call-recalibration machine-learning model 422 as described above).
[0155] As mentioned above, the dual -variant-type call recalibration system 106 can utilize various sources of truth data to train a variant-call-recalibration machine-learning model to generate genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants. For example, FIG. 5A illustrates the dual-variant- type call recalibration system 106 generating training data by synthetically modifying existing sequencing files to include somatic mosaic variants at various allele frequencies. In some cases, the dual -variant-type call recalibration system 106 synthetically modifies existing sequencing files to include variants in nucleotide reads at different read depths to simulate somatic mosaic variant’s relatively lower depths in read data and/or to include somatic mosaic variants in exome regions that may include somatic mosaic variants.
[0156] As shown in FIG. 5 A, the dual -variant-type call recalibration system 106 identifies or receives a genome sample 502 comprising sample nucleotide reads 504. In some embodiments, for example, the sample nucleotide reads 504 include one or more known germline variants usable as ground truth for training a call-recalibration machine-learning model to identify such variants within sample nucleotide sequences. To generate ground truth data for training the variant-call- recalibration machine-learning model to identify somatic mosaic variant, the dual-variant-type call recalibration system 106 generates multiple synthetic nucleotide reads 506 within the genome sample 502 by altering a portion of the sample nucleotide reads 504. In particular, the dual-variant- type call recalibration system 106 generates the synthetic nucleotide reads 506 within the genome sample 502 at one or more allele frequencies representative of one or more ground-truth somatic mosaic variants 512. The synthetic nucleotide reads 506 can likewise be generated from methods using or not using polymerase chain reaction (PCR). Further, in some embodiments, the dual- variant-type call recalibration system 106 generates the synthetic nucleotide reads 506 at one or more read depths (e.g., 10X, 15X) to mimic the relatively lower read depths of somatic mosaic variants relative to germline variants. Also, in some embodiments, the dual-variant-type call recalibration system 106 can generate the synthetic nucleotide reads 506 in one or more exome regions.
[0157] In some embodiments, for example, the dual -variant-type call recalibration system 106 utilizes one or more editing tools (e.g., BAMSurgeon or similar applications/tools) to add simulated variants to existing alignment data files (e.g., binary alignment map (BAM) files or similar files). Indeed, by utilizing such editing tools, the dual -variant-type call recalibration system 106 can add single nucleotide variants (SNVs), insertions or deletions (INDELs), and/or several forms of structural variants (SV) to existing alignment data files to generate sample data and corresponding ground truth for training of a variant-call-recalibration machine-learning model, as further illustrated in FIG. 5A.
[0158] As also shown in FIG. 5 A, the dual -variant-type call recalibration system 106 generates modified sample sequencing data 508, including the synthetic nucleotide reads 506 and at least a portion of the original sample nucleotide reads 504 (i.e., any remaining unaltered reads of the sample nucleotide reads 504). Further, the dual-variant-type call recalibration system 106 determines or extracts sample sequencing metrics 510 based on the modified sample sequencing data 508 (e.g., such as described above in relation to FIGS. 3A-3C). Thus, in some embodiments, the sample sequencing metrics 510 include sample-read-based sequencing metrics based on the remaining unaltered reads of the sample nucleotide reads 504, as well as synthetic-read-based sequencing metrics based on the synthetic nucleotide reads 506. Accordingly, the dual-variant-type call recalibration system 106 can utilize the modified sample sequencing data 508, the ground-truth somatic mosaic variants 512, and any ground-truth germline variants remaining after modifying the genome sample 502 to train a variant-call-recalibration machine-learning model to generate genotype probabilities for germline and somatic mosaic variants. The dual-variant-type call recalibration system 106 can also utilize other sample nucleotide reads and corresponding sequencing data — which have not been synthetically modified — as ground truth training data for germline variants.
[0159] As mentioned, in one or more embodiments, the dual-variant-type call recalibration system 106 utilizes an admixture of germline truth sets to simulate somatic mosaicisms in ground truth data for training a variant-call-recalibration machine-learning model to generate genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants. For example, FIG. 5B illustrates the dual-variant-type call recalibration system 106 determining subsets (e.g., percentages) of sample genomic sequences from a combination of male and female genomic samples that together simulate variant-allele frequencies of a genome sample with somatic mosaicism.
[0160] As shown in FIG. 5B, for instance, the dual -variant-type call recalibration system 106 determines subsets of sample nucleotide sequences from different genomic samples forming an admixture genome. When the corresponding subsets are mixed together, the admixture genome
simulates a genomic sample with somatic mosaicism. To simulate such a genomic sample with somatic mosaicism, for instance, the dual -variant-type call recalibration system 106 determines a percentage of sample nucleotide sequences 522a from a first genome sample 520a and a percentage of sample nucleotide sequences 522b from a second genome sample 520b that, when mixed together, simulate variant-allele frequencies of a genomic sample exhibiting characteristics of somatic mosaicism. As part of determining the subsets of sample nucleotide sequences 522a and 522b, the dual -variant-type call recalibration system 106 estimates the variant-allele frequencies of different subset mixtures (or percentage mixtures) from truth set bases of Platinum Genomes for the first genome sample 520a and the second genome sample 520b.
[0161] FIG. 5B illustrates an example of the dual -variant-type call recalibration system 106 determining subsets of sample nucleotide sequences for one such admixture genome and determining corresponding variant allele frequencies. As depicted in FIG. 5B, the dual-variant-type call recalibration system 106 determines the variant-allele frequencies for SNPs of both heterozygous and homozygous alleles for an admixture genome. According to the percentages reflected by the subset of sample nucleotide sequences 522a (here, 60%) and the subset of sample nucleotide sequences 522b (here, 40%), the dual -variant-type call recalibration system 106 determines or predicts the relevant variant allele frequencies by referencing the truth set bases of the first genome sample 520a (e.g., NA12877) and the second genome sample 520b (e.g., NA12878) from Platinum Genomes. While FIG. 5B depicts variant allele frequencies for SNPs from an admixture genome, the dual -variant-type call recalibration system 106 can determine admixture genomes and variant allele frequencies for other specific variant types, such as insertions, deletions, or structural variants. Also, the dual -variant-type call recalibration system 106 can determine admixture genomes utilizing nucleotide reads from sample library fragments generated by various means, such as nucleotide reads generated using or not using PCR techniques. [0162] As shown in an allele-frequency table 524 presented in FIG. 5B, for instance, the dual- variant-type call recalibration system 106 determines that unique homozygous alleles and unique heterozygous alleles from the second genome sample 520b occur at variant allele frequencies of 0.4 and 0.2, respectively, in the admixture genome. As further shown, the dual-variant-type call recalibration system 106 determines that unique homozygous alleles and unique heterozygous alleles from the first genome sample 520a occur at variant allele frequencies of 0.6 and 0.3, respectively, in the admixture genome. By contrast, the dual-variant-type call recalibration system 106 determines that common alleles present in the 60%-and-40% admixture genome as homozygous-homozygous combinations, heterozygous-homozygous combinations, homozygous- heterozygous combinations, and heterozygous-heterozygous combinations — according to the
corresponding allele zygosities in the second genome sample 520b and the first genome sample 520a — occur at variant allele frequencies of 1.0, 0.8, 0.7 and 0.5, respectively.
[0163] To select a suitable admixture genome representative of a genome sample with somatic mosaicism, the dual -variant-type call recalibration system 106 can determine variant allele frequencies from truth set bases of various combinations (and percentages) of genome samples in a given admixture genome. In addition to the variant allele frequencies present in the 60%-and- 40% admixture genome depicted in FIG. 5B, in some embodiments, the dual-variant-type call recalibration system 106 determines variant allele frequencies for other possible admixture genomes to simulate a genomic sample with somatic mosaicism. For example, the dual-variant- type call recalibration system 106 determines that 30% of sample nucleotide sequences from the first genome sample 520a and 70% of sample nucleotide sequences from the second genome sample 520b would produce unique homozygous alleles from the first genome sample 520a and from the second genome sample 520b at variant allele frequencies of 0.7 and 0.3, respectively, as well as unique heterozygous alleles from the first genome sample 520a and from the second genome sample 520b at variant allele frequencies of 0.35 and 0.15, respectively. By contrast, the dual -variant-type call recalibration system 106 determines or predicts that common alleles present in such a 30%-and-70% admixture genome as homozygous-homozygous combinations, heterozygous-homozygous combinations, homozygous-heterozygous combinations, and heterozygous-heterozygous combinations — according to the same 30% and 70% admixture — would produce variant allele frequencies of 1.0, 0.85, 0.65 and 0.5, respectively.
[0164] In addition to determining various admixture genomes from the first genome sample 520a and the second genome sample 520b, in certain implementations, the dual-variant-type call recalibration system 106 determines variant allele frequencies from combinations of different sample genomes to identify a suitable admixture genome simulating a genomic sample with somatic mosaicism. By determining variant allele frequencies for a variety of admixture genomes, the dual -variant-type call recalibration system 106 can select the admixture genome that more closely (or most closely) simulates the variant allele frequencies of a target somatic mosaicism (e.g., a somatic mosaic variant in a genomic region of interest) and use data from such a simulated genomic sample as ground truth data for training a variant-call-recalibration machine-learning model. Further, in some embodiments, the synthetic nucleotide reads 506 within the genome sample 502 selects an admixture genome that include somatic mosaic variants of relatively lower depths and in particular exome regions.
[0165] As mentioned, in addition to generating truth data including somatic mosaic variants at various variant allele frequencies for training a variant-call-recalibration machine-learning model, the dual -variant-type call recalibration system 106 can generate training data implementing various
other features for intelligently training the variant-call-recalibration machine-learning model. For example, the dual-variant-type call recalibration system 106 can generate training data (e.g., according to the methods described above in relation to FIGS. 5A and 5B) to include simulated read data of varying depth, read data with variants mimicking somatic mosaic variants in exome regions, and so forth.
[0166] As indicated above, in certain embodiments, the dual-variant-type call recalibration system 106 trains or tunes a variant-call-recalibration machine-learning model (e.g., the variant- call-recalibration machine-learning model 412 or one or more sub-models of the variant-call- recalibration machine-learning model 422). In particular, the dual-variant-type call recalibration system 106 utilizes an iterative training process to fit a variant-call-recalibration machine-learning model by adjusting or adding decision trees or learning parameters that result in accurate genotype probabilities (e.g., genotype probabilities 414, 430a, or 430b). For example, FIG. 6 illustrates the dual -variant-type call recalibration system 106 training a variant-call-recalibration machinelearning model in accordance with one or more embodiments.
[0167] As illustrated in FIG. 6, the dual -variant-type call recalibration system 106 accesses modified sample sequencing data 603 and determines or extracts sample sequencing metrics 604 from the modified sample sequencing data 603 and receives or obtains some metrics (e.g., externally sourced metrics) from a database 602 (e.g., the database 116, the sequencing information database 314, or the sequencing information database 402). As described above, the dual -variant- type call recalibration system 106 can access the modified sample sequencing data 603 in the form of the modified sample sequencing data 508 generated in FIG. 5A or in the form of simulated admixture data generated in FIG. 5B. From such modified sample sequencing data, the dual- variant-type call recalibration system 106 determines or extracts the sample sequencing metrics 604 as part of training a variant-call-recalibration machine-learning model 606. For example, the dual -variant-type call recalibration system 106 determines or extracts sample sequencing metrics in the form of sample read-based metrics, sample externally sourced sequencing metrics, and sample call-model-generated sequencing metrics.
[0168] In some cases, the modified sample sequencing data 603 has corresponding ground truth data 616 indicating ground truth genotype calls corresponding to somatic mosaic variants and germline variants. In some embodiments, the ground truth data 616 also includes various ground truth metrics that result from the set of sample sequencing metrics 604. In addition to the source described above for simulated ground truth data for somatic mosaic variants, the dual-variant-type call recalibration system 106 also access or extracts sample sequencing metrics from genomic data comprising ground truth germline variants. For instance, the dual-variant-type call recalibration system 106 utilizes ground truth data from a training dataset from the food and drug administration,
called the PrecisionFDA dataset, for ground truth data comprising germline variants alongside ground truth data corresponding to synthesized somatic mosaic variants (e.g., such as described above in relation to FIGS. 5A or 5B).
[0169] As further illustrated in FIG. 6, the dual -variant-type call recalibration system 106 generates predicted genotype probabilities 608 based on the determined or extracted sample sequencing metrics 604. Specifically, the dual -variant-type call recalibration system 106 utilizes the variant-call-recalibration machine-learning model 606 to generate the predicted genotype probabilities 608. Indeed, in some embodiments, the variant-call-recalibration machine-learning model 606 generates a set of three predicted genotype probabilities 608, as described above (e.g., probabilities for homozygous reference calls, homozygous variant calls, or heterozygous calls at a given genomic coordinate). Indeed, the predicted genotype probabilities 608 can accordingly take the form of any of the variant-call classifications described above.
[0170] Based on the predicted genotype probabilities 608, the dual-variant-type call recalibration system 106 determines one or more predicted genotype calls 610 and, in some implementations, data field entries corresponding to predicted genotype calls 610. As indicated above, the dual -variant-type call recalibration system 106 can utilize (i) existing genotype calls generated by a call generation model and included with the modified sample sequencing data 603 and (ii) the variant-call-recalibration machine-learning model 606 to modify data fields corresponding to a variant call file (e.g., data fields corresponding to initial genotype calls of the modified sample sequencing data 603). Such modified or recalibrated values are output in the by, for example, the variant-call-recalibration machine-learning model 606. For example, the dual- variant-type call recalibration system 106 determines recalibrated values for particular metrics corresponding to the predicted genotype calls 610, including a base-call-quality metric (QUAL), a genotype metric (GT), a genotype-quality metric (GQ), allele frequency (AF), allele count (AC), and total number of alleles (AN), and so forth.
[0171] As further illustrated in FIG. 6, the dual -variant-type call recalibration system 106 performs a comparison 612. Specifically, the dual -variant-type call recalibration system 106 performs the comparison 612 between (i) predicted genotype calls 610 and/or corresponding data fields output by the variant-call-recalibration machine-learning model 606 and (ii) genotype calls and/or corresponding data fields in the ground truth data 616. In some embodiments, the dual- variant-type call recalibration system 106 utilizes a loss function 614 to compare genotype calls and/or corresponding data fields (e.g., to determine an error or a measure of loss between them). For instance, in cases where the variant-call-recalibration machine-learning model 606 is an ensemble of gradient boosted trees, the dual -variant-type call recalibration system 106 utilizes a
mean squared error loss function (e.g., for regression) and/or a logarithmic loss function (e.g., for classification) as the loss function 614.
[0172] By contrast, in embodiments where the variant-call-recalibration machine-learning model 606 is a neural network, the dual -variant-type call recalibration system 106 can utilize a cross entropy loss function, an LI loss function, or a mean squared error loss function as the loss function 614. For example, the dual -variant-type call recalibration system 106 utilizes the loss function 614 to determine a difference between predicted genotype calls and/or corresponding data fields and the ground truth data 616.
[0173] As further illustrated in FIG. 6, the dual -variant-type call recalibration system 106 performs model fitting 618. In particular, the dual-variant-type call recalibration system 106 fits the variant-call-recalibration machine-learning model 606 based on the comparison 612. For instance, the dual -variant-type call recalibration system 106 performs modifications or adjustments to the variant-call-recalibration machine-learning model 606 to reduce the measure of loss from the loss function 614 for a subsequent training iteration.
[0174] For gradient boosted trees or treelite, for example, the dual-variant-type call recalibration system 106 trains the variant-call-recalibration machine-learning model 606 on the gradients of the errors determined by the loss function 614. For instance, the dual -variant-type call recalibration system 106 solves a convex optimization problem (e.g., of infinite dimensions) while regularizing the objective to avoid overfitting. In certain implementations, the dual -variant-type call recalibration system 106 scales the gradients to emphasize corrections to under-represented classes (e.g., where there are significantly more true positives than false positive variant calls).
[0175] In some embodiments, the dual -variant-type call recalibration system 106 adds a new weak learner (e.g., a new boosted tree) to the variant-call-recalibration machine-learning model 606 for each successive training iteration as part of solving the optimization problem. For example, the dual-variant-type call recalibration system 106 finds a feature (e.g., a sequencing metric) that minimizes a loss from the loss function 614 and either adds the feature to the current iteration’s tree or starts to build a new tree with the feature.
[0176] In addition or in the alternative to gradient boosted decision trees, the dual-variant-type call recalibration system 106 trains a logistic regression to leam parameters for generating one or more genotype probabilities and/or other variant call classifications. To avoid overfitting, the dual- variant-type call recalibration system 106 further regularizes based on hyperparameters such as the learning rate, stochastic gradient boosting, the number of trees, the tree-depth(s), complexity penalization, and Ll/L2 regularization.
[0177] In some embodiments where the variant-call-recalibration machine-learning model 606 is a neural network, the dual -variant-type call recalibration system 106 performs the model fitting
618 by modifying internal parameters (e.g., weights) of the variant-call-recalibration machinelearning model 606 to reduce the measure of loss for the loss function 614. Indeed, the dual-variant- type call recalibration system 106 modifies how the variant-call-recalibration machine-learning model 606 analyzes and passes data between layers and neurons by modifying the internal network parameters. Thus, over multiple iterations, the dual -variant-type call recalibration system 106 improves the accuracy of the variant-call-recalibration machine-learning model 606.
[0178] Indeed, in some cases, the dual -variant-type call recalibration system 106 repeats the training process illustrated in FIG. 6 for multiple iterations. For example, the dual-variant-type call recalibration system 106 repeats the iterative training by selecting a new set of sequencing metrics for each genotype call along with a corresponding ground truth genotype call in corresponding ground truth data. The dual -variant-type call recalibration system 106 further generates a new set of predicted genotype probabilities for each iteration. As described above, the dual-variant-type call recalibration system 106 also compares genotype calls and/or corresponding data fields from at each iteration with the corresponding genotype calls and/or data fields from the corresponding ground truth data and further performs model fitting 618. The dual-variant-type call recalibration system 106 repeats this process until the variant-call-recalibration machine-learning model 606 generates predicted genotype probabilities that result in variant calls that satisfies a threshold measure of loss.
[0179] As mentioned above, in certain described embodiments, the dual-variant-type call recalibration system 106 provides improvements in flexibility and accuracy over existing systems. In particular, in certain implementations, the dual -variant-type call recalibration system 106 provides the flexibility of calling variants corresponding to germline variants and somatic mosaic variants while identifying somatic mosaic variants with increased accuracy. To illustrate, FIGS. 7A-7F show experimental results of the dual -variant-type call recalibration system 106 in identifying somatic mosaic variants within sample genomic sequences.
[0180] For example, FIGS. 7A-7B illustrate graphs illustrating experimental results of utilizing the dual -variant-type call recalibration system 106 to identify mosaic variants within two modified whole genome sequence (WGS) PrecisionFDA datasets (specifically, HG002 and HG003) comprising synthesized nucleotide reads simulating somatic mosaic variants at various allele frequencies. In particular, the two modified datasets comprise synthetic nucleotide reads simulating various SNPs at allele frequencies between 5% and 25%, as particularly shown in FIG. 7B. Indeed, as shown in FIG. 7B, the dual -variant-type call recalibration system 106 recalls a significant percentage of somatic mosaic variants within the modified datasets.
[0181] Similarly, FIGS. 7C-7D show graphs illustrating experimental results of utilizing the dual -variant-type call recalibration system 106 to identify mosaic variants within four modified
whole exome sequence (WES) PrecisionFDA datasets (specifically, HG002 from four different exome libraries) comprising synthesized nucleotide reads simulating somatic mosaic variants at various allele frequencies. In particular, the four modified datasets comprise synthetic nucleotide reads simulating various SNPs at allele frequencies between 5% and 25%, as particularly shown in FIG. 7D. Indeed, as shown in FIG. 7C, the dual -variant-type call recalibration system 106 recalls a significant percentage of somatic mosaic variants within the modified datasets, with additional improvements in accuracy when analyzing WES sequences (in comparison with WGS sequences as shown in FIG. 7A).
[0182] Likewise, FIGS. 7E-7F show graphs illustrating experimental results of utilizing the dual -variant-type call recalibration system 106 to identify mosaic variants within four additional modified whole exome sequence (WES) PrecisionFDA datasets (specifically, HG003 from four different exome libraries) comprising synthesized nucleotide reads simulating somatic mosaic variants at various allele frequencies. In particular, the four modified datasets comprise synthetic nucleotide reads simulating various SNPs at allele frequencies between 5% and 25%, as particularly shown in FIG. 7F. Indeed, as shown in FIG. 7E, the dual -variant-type call recalibration system 106 recalls a significant percentage of somatic mosaic variants within the modified datasets, with additional improvements in accuracy when analyzing WES sequences (in comparison with WGS sequences as shown in FIG. 7A). To further illustrate, the table below illustrates numerical results corresponding to the results shown in FIGS. 7C-7F.
[0183] As also mentioned, in certain embodiments, the dual-variant-type call recalibration system 106 improves the computing efficiency with which somatic mosaic variants are identified within a genomic sequence. In particular, in certain implementations, the dual-variant-type call recalibration system 106 generates accurate variant calls corresponding to somatic mosaic variants with increased speed and requiring fewer computational resources compared to existing sequencing systems. To illustrate, FIG. 8 shows experimental results of the dual-variant-type call recalibration system 106 identifying variants within a genomic dataset comprising mosaic variants of various variant allele frequencies (VAF).
[0184] For example, FIG. 8 includes a bar graph 800 illustrating a number of variants within a genomic dataset (indicated as “SetA M3 -12”) with variant allele frequencies between 4% and 32%. Specifically, as shown in the bar graph 800, the genomic dataset comprises approximately 6,000 variants with a corresponding allele frequency of 0.04 (4%), approximately 350 variants with a corresponding allele frequency of 0.08 (8%), approximately 2,750 variants with a corresponding allele frequency of 0.096 (9.6%), approximately 2,100 variants with a corresponding allele frequency of 0.016 (1.6%), approximately 100 variants with a corresponding allele frequency of 0.192 (19.2%), and approximately 150 variants with a corresponding allele frequency of 0.32 (32%). The variants represented by the bar graph 800 according exhibit relatively low variant allele frequencies consistent with (or that mimic) those of somatic mosaic variants.
[0185] In addition to the bar graph 800, FIG. 8 includes a table 810 of experimental results for run time required by the dual -variant-type call recalibration system 106, in comparison with two existing deep-leaming-based sequencing systems (indicated as “Prior System A” and “Prior System B”), to identify variants within the genomic dataset represented by the bar graph 800. As represented in the table 810, the run-time results for the dual -variant-type call recalibration system 106 include time for mapping and alignment of reads from the genomic dataset, as well as variant calling of variants, whereas the run-time results provided for the two existing sequencing systems only include the computation time utilized for mosaic variant calling. Significantly, each of the
three listed sequencing systems utilized the same read alignment method and candidate genomic coordinates for the genomic dataset to determine their respective variant calls. In addition, the table 810 indicates the computation hardware upon which each respective sequencing system was implemented. As shown in FIG. 8, the dual -variant-type call recalibration system 106 determines variant calls for the provided dataset within a significantly reduced computational run-time compared to the existing deep-leaming-based sequencing systems.
[0186] Because the run-time results for the dual -variant-type call recalibration system 106 also include the time for read mapping and read alignment — and the run-time results for the two existing sequencing systems exclude such time for read mapping and read alignment — the run time of approximately 0.3 hours in the table 810 underestimates the superior speed with which the dual- variant-type call recalibration system 106 determines somatic mosaic variants relative to the 5.5 hours and 12.8 hours consumed by the existing sequencing systems to determine somatic mosaic variants. As noted above, the extensive statistical data analysis performed by such existing deep- leaming-based sequencing systems requires excessive computation time relative to the dual- variant-type call recalibration system 106.
[0187] Turning now to FIG. 9, this figure illustrates an example flowchart of a series of acts of generating variant calls corresponding to germline variants and somatic mosaic variants in accordance with one or more embodiments. While FIG. 9 illustrates acts according to one embodiment, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIG. 9. The acts of FIG. 9 can be performed as part of a method. Alternatively, a non- transitory computer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device to perform the acts depicted in FIG. 9. In still further embodiments, a system comprising at least one processor and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform the acts of FIG. 9.
[0188] As shown in FIG. 9, the series of acts 900 includes an act 902 of determining sequencing metrics for nucleotide reads, an act 904 of generating genotype probabilities for variants corresponding to candidate germline variants and candidate somatic mosaic variants, and an act 906 of generating a first variant call corresponding to a germline variant and a second variant call corresponding to a somatic mosaic variant. For example, the series of acts 900 can include acts to perform any of the operations described in the following clauses: CLAUSE 1. A method comprising: determining sequencing metrics for nucleotide reads corresponding to genomic regions of a genomic sample;
generating, utilizing a variant-call-recalibration machine-learning model and based on the sequencing metrics, genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants; and generating, for the genomic regions and based on the genotype probabilities, at least a first variant call corresponding to a germline variant in the genomic sample and at least a second variant call corresponding to a somatic mosaic variant in the genomic sample.
CLAUSE 2. The method of clause 1, further comprising: generating, within a sequencing data file, a germline-variant indicator identifying the first variant call as a germline variant; and generating, within the sequencing data file, a somatic-mosaic-variant indicator identifying the second variant call as a somatic mosaic variant.
CLAUSE 3. The method of any of clauses 1-2, further comprising generating, within a sequencing data file, a variant indicator identifying the first variant call or the second variant call as a variant without an indication of a germline variant or a somatic mosaic variant.
CLAUSE 4. The method of any of clauses 1-3, wherein the first variant call corresponds to a first genomic coordinate of the genomic sample and the second variant call corresponds to a second genomic coordinate of the genomic sample different than the first genomic coordinate.
CLAUSE 5. The method of any of clauses 1-4, wherein the first variant call and the second variant call correspond to a same genomic coordinate of the genomic sample.
CLAUSE 6. The method of any of clauses 1-5, wherein the genomic regions comprise one or more target genomic regions comprising one or more candidate somatic mosaic variants for which the variant-call-recalibration machine-learning model was trained to generate predicted genotype probabilities.
CLAUSE 7. The method of any of clauses 1-6, further comprising: generating, utilizing a germline-variant-call-recalibration machine-learning model and based on the sequencing metrics, additional genotype probabilities for germline variants within the genomic regions corresponding to the candidate germline variants; and generating, for the genomic regions and based on the additional genotype probabilities, one or more additional candidate variant calls corresponding to one or more germline variants in the genomic sample.
CLAUSE 8. The method of any of clauses 1-7, further comprising: comparing, for a genomic coordinate for the second variant call, a genotype probability generated by the variant-call-recalibration machine-learning model with an additional genotype probability generated by the germline-variant-call-recalibration machine-learning model; and
identifying the second variant call as a somatic mosaic variant based on a comparison of the genotype probability and the additional genotype probability.
CLAUSE 9. The method of any of clauses 1-8, wherein the variant-call-recalibration machine-learning model comprises a first machine-learning sub-model configured to generate a first type of genotype probabilities accounting for a set of candidate germline variants and a second machine-learning sub-model configured to generate a second type of genotype probabilities accounting for a set of candidate somatic mosaic variants.
CLAUSE 10. The method of any of clauses 1-9, further comprising: accessing, based on user input, sequencing data comprising sample nucleotide reads and synthetic nucleotide reads comprising modified nucleobases representing ground-truth somatic mosaic variants; determining the sequencing metrics for the sequencing data by determining sample-read- based sequencing metrics for the sample nucleotide reads and synthetic-read-based sequencing metrics for the synthetic nucleotide reads; and training the variant-call-recalibration machine-learning model to generate, based on the sample-read-based sequencing metrics and the synthetic-read-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants based on comparisons of variant calls and the ground-truth somatic mosaic variants.
CLAUSE 11. The method of any of clauses 1-10, further comprising causing the system to generate the synthetic nucleotide reads by modifying existing nucleotide reads to include the ground-truth somatic mosaic variants at one or more variant allele frequencies representative of one or more somatic mosaic variants.
CLAUSE 12. The method of any of clauses 1-11, further comprising: identifying an admixture of genomic samples that simulates variant-allele frequencies of ground-truth somatic mosaic variants and ground-truth germline variants; accessing a mixture of nucleotide reads comprising a first set of nucleotide reads from a first genomic sample of the admixture of genomic samples and a second set of nucleotide reads from a second genomic sample of the admixture of genomic samples; determining the sequencing metrics for the nucleotide reads by determining admixturebased sequencing metrics for the mixture of nucleotide reads; and training the variant-call-recalibration machine-learning model to generate, based on the admixture-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants and germline variants based on comparisons of predicted variant calls with the ground-truth somatic mosaic variants and the ground-truth germline variants.
CLAUSE 13. The method of any of clauses 1-12, further comprising:
receiving an indication of a user selection of a variant-sensitivity option corresponding to detection of the candidate somatic mosaic variants; and executing the variant-call-recalibration machine-learning model to generate the genotype probabilities instead of a germline-variant-call-recalibration machine-learning model configured to generate a different type of genotype probabilities for candidate germline variants.
CLAUSE 14. The method of any of clauses 1-13, wherein the variant-call-recalibration machine-learning model comprises one or more of a gradient boost decision tree or a random forest model.
[0189] The methods described herein can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid (i.e., a nucleic acid polymer) can be an automated process. Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
[0190] SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand. In traditional methods of SBS, a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery. However, in the methods described herein, more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
[0191] SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties. Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below. In methods using nucleotide monomers lacking terminators, the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery. For SBS techniques that utilize nucleotide monomers having a terminator moiety, the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).
[0192] SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as
molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like. In embodiments, where two or more different nucleotides are present in a sequencing reagent, the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used. For example, the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by the sequencing methods developed by Solexa (now Illumina, Inc.).
[0193] Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release." Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing." Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P. (1998) “A sequencing method based on realtime pyrophosphate.” Science 281(5375), 363; U.S. Pat. No. 6,210,891; U.S. Pat. No. 6,258,568 and U.S. Pat. No. 6,274,320, the disclosures of which are incorporated herein by reference in their entireties). In pyrosequencing, released PPi can be detected by being immediately converted to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of ATP generated is detected via luciferase-produced photons. The nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array. An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images. The images can be stored, processed and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminatorbased sequencing methods.
[0194] In another exemplary type of SBS, cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference. This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744, each of which is incorporated herein by reference. The availability of fluorescently labeled terminators in which both the termination can be reversed, and the fluorescent label cleaved
facilitates efficient cyclic reversible termination (CRT) sequencing. Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
[0195] Preferably in reversible terminator-based sequencing embodiments, the labels do not substantially inhibit extension under SBS reaction conditions. However, the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features. In particular embodiments, each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step. In such embodiments, each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator- SBS methods can be stored, processed and analyzed as set forth herein. Following the image capture step, labels can be removed, and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
[0196] In particular embodiments some or all of the nucleotide monomers can include reversible terminators. In such embodiments, reversible terminators/cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference). Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety). Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension but could easily be deblocked by a short treatment with a palladium catalyst. The fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light. Thus, either disulfide reduction or photocleavage can be used as a cleavable linker. Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP. The presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and/or electrostatic hindrance. The presence of one incorporation event prevents further incorporations unless the dye is removed. Cleavage of the dye removes the fluor and effectively reverses the termination. Examples of modified
nucleotides are also described in U.S. Pat. No. 7,427,673, and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference in their entireties.
[0197] Additional exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Patent Application Publication No. 2007/0166705, U.S. Patent Application Publication No. 2006/0188901, U.S. Pat. No. 7,057,026, U.S. Patent Application Publication No. 2006/0240439, U.S. Patent Application Publication No. 2006/0281109, PCT Publication No. WO 05/065814, U.S. Patent Application Publication No. 2005/0100900, PCT Publication No. WO 06/064199, PCT Publication No. WO 07/010,251, U.S. Patent Application Publication No. 2012/0270305 and U.S. Patent Application Publication No. 2013/0260372, the disclosures of which are incorporated herein by reference in their entireties.
[0198] Some embodiments can utilize detection of four different nucleotides using fewer than four different labels. For example, SBS can be performed utilizing methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232. As a first example, a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair. As a second example, three of four different nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e.g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into a nucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal. As a third example, one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels. The aforementioned three exemplary configurations are not considered mutually exclusive and can be used in various combinations. An exemplary embodiment that combines all three examples, is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g. dTTP having at least one label that is detected in both channels when excited by the first and/or second excitation wavelength) and a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).
[0199] Further, as described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232, sequencing data can be obtained using a single channel. In such so- called one-dye sequencing approaches, the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated. The third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
[0200] Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides. The oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize. As with other SBS methods, images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images. Images obtained from ligation-based sequencing methods can be stored, processed and analyzed as set forth herein. Exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,969,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
[0201] Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing." Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, "Characterization of nucleic acids by nanopore analysis". Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope" Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties). In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin. As the target nucleic acid passes through the nanopore, each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore. (U.S. Pat. No. 7,001,792; Soni, G. V. & Meller, "A. Progress toward ultrafast DNA sequencing using solid-state nanopores." Clin. Chem. 53, 1996-2001 (2007); Healy, K. "Nanopore-based single-molecule DNA analysis." Nanomed. 2, 459-481 (2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution." J. Am. Chem. Soc. 130, 818-820 (2008), the disclosures of which are incorporated herein by reference in their entireties). Data obtained from nanopore sequencing can be stored, processed and analyzed as set forth herein. In
particular, the data can be treated as an image in accordance with the exemplary treatment of optical images and other images that is set forth herein.
[0202] Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No. 7,315,019 (which is incorporated herein by reference) and using fluorescent nucleotide analogs and engineered polymerases as described, for example, in U.S. Pat. No. 7,405,281 and U.S. Patent Application Publication No. 2008/0108082 (each of which is incorporated herein by reference). The illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations." Science 299, 682-686 (2003); Lundquist, P. M. et al. "Parallel confocal detection of single molecules in real time." Opt. Lett. 33, 1026-1028 (2008); Korlach, J. et al. "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures." Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), the disclosures of which are incorporated herein by reference in their entireties). Images obtained from such methods can be stored, processed and analyzed as set forth herein.
[0203] Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 2010/0137143 Al; or US 2010/0282617 Al, each of which is incorporated herein by reference. Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.
[0204] The above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously. In particular embodiments, different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner. In embodiments using surface-bound target nucleic acids, the target nucleic acids can be in an array format. In an
array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner. The target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface. The array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
[0205] The methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features/cm2, 100 features/cm2, 500 features/cm2, 1,000 features/cm2, 5,000 features/cm2, 10,000 features/cm2, 50,000 features/cm2, 100,000 features/cm2, 1,000,000 features/cm2, 5,000,000 features/cm2, or higher.
[0206] An advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly, the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above. Thus, an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and/or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like. A flow cell can be configured and/or used in an integrated system for detection of target nucleic acids. Exemplary flow cells are described, for example, in US 2010/0111768 Al and US Ser. No. 13/273,666, each of which is incorporated herein by reference. As exemplified for flow cells, one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method. Taking a nucleic acid sequencing embodiment as an example, one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above. Alternatively, an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods. Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13/273,666, which is incorporated herein by reference.
[0207] The sequencing system described above sequences nucleic acid polymers present in samples received by a sequencing device. As defined herein, "sample" and its derivatives, is used in its broadest sense and includes any specimen, culture and the like that is suspected of including a target. In some embodiments, the sample comprises DNA, RNA, PNA, LNA, chimeric or hybrid forms of nucleic acids. The sample can include any biological, clinical, surgical, agricultural,
atmospheric or aquatic-based specimen containing one or more nucleic acids. The term also includes any isolated nucleic acid sample such a genomic DNA, fresh-frozen or formalin-fixed paraffin-embedded nucleic acid specimen. It is also envisioned that the sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA. In some embodiments, the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
[0208] The nucleic acid sample can include high molecular weight material such as genomic DNA (gDNA). The sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples. In another embodiment, low molecular weight material includes enzymatically or mechanically fragmented DNA. The sample can include cell-free circulating DNA. In some embodiments, the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples. In some embodiments, the sample can be an epidemiological, agricultural, forensic or pathogenic sample. In some embodiments, the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source. In another embodiment, the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus or fungus. In some embodiments, the source of the nucleic acid molecules may be an archived or extinct sample or species.
[0209] Further, the methods and compositions disclosed herein may be useful to amplify a nucleic acid sample having low-quality nucleic acid molecules, such as degraded and/or fragmented genomic DNA from a forensic sample. In one embodiment, forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel. The nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric or other substrate that may be impregnated with saliva, blood, or other bodily fluids. As such, in some embodiments, the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA. In some embodiments, target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine and serum. In some embodiments, target
sequences can be obtained from hair, skin, tissue samples, autopsy or remains of a victim. In some embodiments, nucleic acids including one or more target sequences can be obtained from a deceased animal or human. In some embodiments, target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA. In some embodiments, target sequences or amplified target sequences are directed to purposes of human identification. In some embodiments, the disclosure relates generally to methods for identifying characteristics of a forensic sample. In some embodiments, the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein. In one embodiment, a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
[0210] The components of the dual -variant-type call recalibration system 106 can include software, hardware, or both. For example, the components of the dual-variant-type call recalibration system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device 108). When executed by the one or more processors, the computer-executable instructions of the dual -variant-type call recalibration system 106 can cause the computing devices to perform the bubble detection methods described herein. Alternatively, the components of the dual-variant- type call recalibration system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions. Additionally, or alternatively, the components of the dual -variant-type call recalibration system 106 can include a combination of computer-executable instructions and hardware.
[0211] Furthermore, the components of the dual -variant-type call recalibration system 106 performing the functions described herein with respect to the dual-variant-type call recalibration system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model. Thus, components of the dual-variant- type call recalibration system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device. Additionally, or alternatively, the components of the dual -variant-type call recalibration system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and/or other countries.
[0212] Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non- transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
[0213] Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices). Computer- readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
[0214] Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
[0215] A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
[0216] Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically
from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a NIC), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer- readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
[0217] Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
[0218] Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
[0219] Embodiments of the present disclosure can also be implemented in cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
[0220] A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.
[0221] FIG. 10 illustrates a block diagram of a computing device 1000 that may be configured to perform one or more of the processes described above. One will appreciate that one or more computing devices such as the computing device 1000 may implement the dual -variant-type call recalibration system 106 and the sequencing system 104. As shown by FIG. 10, the computing device 1000 can comprise a processor 1002, a memory 1004, a storage device 1006, an I/O interface 1008, and a communication interface 1010, which may be communicatively coupled by way of a communication infrastructure 1012. In certain embodiments, the computing device 1000 can include fewer or more components than those shown in FIG. 10. The following paragraphs describe components of the computing device 1000 shown in FIG. 10 in additional detail.
[0222] In one or more embodiments, the processor 1002 includes hardware for executing instructions, such as those making up a computer program. As an example, and not by way of limitation, to execute instructions for dynamically modifying workflows, the processor 1002 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1004, or the storage device 1006 and decode and execute them. The memory 1004 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processor(s). The storage device 1006 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein.
[0223] The I/O interface 1008 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1000. The I/O interface 1008 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces. The I/O interface 1008 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I/O interface 1008 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
[0224] The communication interface 1010 can include hardware, software, or both. In any event, the communication interface 1010 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1000 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1010 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
[0225] Additionally, the communication interface 1010 may facilitate communications with various types of wired or wireless networks. The communication interface 1010 may also facilitate communications using various communication protocols. The communication infrastructure 1012 may also include hardware, software, or both that couples components of the computing device 1000 to each other. For example, the communication interface 1010 may use one or more networks and/or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein. To illustrate, the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
[0226] In the foregoing specification, the present disclosure has been described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the present disclosure(s) are described with reference to details discussed herein, and the accompanying drawings illustrate the various embodiments. The description above and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure.
[0227] The present disclosure may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with less or more steps/acts or the steps/acts may be performed in differing orders. Additionally, the steps/acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar steps/acts. The scope of the present application is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Claims
1. A system comprising: at least one processor; and a non-transitory computer readable medium storing instructions that, when executed by the at least one processor, cause the system to: determine sequencing metrics for nucleotide reads corresponding to genomic regions of a genomic sample; generate, utilizing a variant-call-recalibration machine-learning model and based on the sequencing metrics, genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants; and generate, for the genomic regions and based on the genotype probabilities, at least a first variant call corresponding to a germline variant in the genomic sample and at least a second variant call corresponding to a somatic mosaic variant in the genomic sample.
2. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: generate, within a sequencing data file, a germline-variant indicator identifying the first variant call as a germline variant; and generate, within the sequencing data file, a somatic-mosaic-variant indicator identifying the second variant call as a somatic mosaic variant.
3. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate, within a sequencing data file, a variant indicator identifying the first variant call or the second variant call as a variant without an indication of a germline variant or a somatic mosaic variant.
4. The system of claim 1, wherein the first variant call corresponds to a first genomic coordinate of the genomic sample and the second variant call corresponds to a second genomic coordinate of the genomic sample different than the first genomic coordinate.
5. The system of claim 1, wherein the first variant call and the second variant call correspond to a same genomic coordinate of the genomic sample.
6. The system of claim 1, wherein the genomic regions comprise one or more target genomic regions comprising one or more candidate somatic mosaic variants for which the variant- call-recalibration machine-learning model was trained to generate predicted genotype probabilities.
7. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to:
generate, utilizing a germline-variant-call-recalibration machine-learning model and based on the sequencing metrics, additional genotype probabilities for germline variants within the genomic regions corresponding to the candidate germline variants; and generate, for the genomic regions and based on the additional genotype probabilities, one or more additional candidate variant calls corresponding to one or more germline variants in the genomic sample.
8. The system of claim 7, further comprising instructions that, when executed by the at least one processor, cause the system to: compare, for a genomic coordinate for the second variant call, a genotype probability generated by the variant-call-recalibration machine-learning model with an additional genotype probability generated by the germline-variant-call-recalibration machine-learning model; and identify the second variant call as a somatic mosaic variant based on a comparison of the genotype probability and the additional genotype probability.
9. The system of claim 1, wherein the variant-call-recalibration machine-learning model comprises a first machine-learning sub-model configured to generate a first type of genotype probabilities accounting for a set of candidate germline variants and a second machine-learning sub-model configured to generate a second type of genotype probabilities accounting for a set of candidate somatic mosaic variants.
10. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: access, based on user input, sequencing data comprising sample nucleotide reads and synthetic nucleotide reads comprising modified nucleobases representing ground-truth somatic mosaic variants; determine the sequencing metrics for the sequencing data by determining sample-read- based sequencing metrics for the sample nucleotide reads and synthetic-read-based sequencing metrics for the synthetic nucleotide reads; and train the variant-call-recalibration machine-learning model to generate, based on the sample-read-based sequencing metrics and the synthetic-read-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants based on comparisons of variant calls and the ground-truth somatic mosaic variants.
11. The system of claim 10, further comprising instructions that, when executed by the at least one processor, cause the system to generate the synthetic nucleotide reads by modifying existing nucleotide reads to include the ground-truth somatic mosaic variants at one or more variant allele frequencies representative of one or more somatic mosaic variants.
12. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: identify an admixture of genomic samples that simulates variant-allele frequencies of ground-truth somatic mosaic variants and ground-truth germline variants; access a mixture of nucleotide reads comprising a first set of nucleotide reads from a first genomic sample of the admixture of genomic samples and a second set of nucleotide reads from a second genomic sample of the admixture of genomic samples; determine the sequencing metrics for the nucleotide reads by determining admixture-based sequencing metrics for the mixture of nucleotide reads; and train the variant-call-recalibration machine-learning model to generate, based on the admixture-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants and germline variants based on comparisons of predicted variant calls with the ground-truth somatic mosaic variants and the ground-truth germline variants.
13. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: receive an indication of a user selection of a variant-sensitivity option corresponding to detection of the candidate somatic mosaic variants; and execute the variant-call-recalibration machine-learning model to generate the genotype probabilities instead of a germline-variant-call-recalibration machine-learning model configured to generate a different type of genotype probabilities for candidate germline variants.
14. The system of claim 1, wherein the variant-call-recalibration machine-learning model comprises one or more of a gradient boost decision tree or a random forest model.
15. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing device to: determine sequencing metrics for nucleotide reads corresponding to genomic regions of a genomic sample; generate, utilizing a variant-call-recalibration machine-learning model and based on the sequencing metrics, genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants; and generate, for the genomic regions and based on the genotype probabilities, at least a first variant call corresponding to a germline variant in the genomic sample and at least a second variant call corresponding to a somatic mosaic variant in the genomic sample.
16. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to:
generate, within a sequencing data file, a germline-variant indicator identifying the first variant call as a germline variant; and generate, within the sequencing data file, a somatic-mosaic-variant indicator identifying the second variant call as a somatic mosaic variant.
17. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to generate, within a sequencing data file, a variant indicator identifying the first variant call or the second variant call as a variant without an indication of a germline variant or a somatic mosaic variant.
18. The non-transitory computer-readable medium of claim 15, wherein the first variant call corresponds to a first genomic coordinate of the genomic sample and the second variant call corresponds to a second genomic coordinate of the genomic sample different than the first genomic coordinate.
19. The non-transitory computer-readable medium of claim 15, wherein the first variant call and the second variant call correspond to a same genomic coordinate of the genomic sample.
20. The non-transitory computer-readable medium of claim 15, wherein the genomic regions comprise one or more target genomic regions comprising one or more candidate somatic mosaic variants for which the variant-call-recalibration machine-learning model was trained to generate predicted genotype probabilities.
21. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to: generate, utilizing a germline-variant-call-recalibration machine-learning model and based on the sequencing metrics, additional genotype probabilities for germline variants within the genomic regions corresponding to the candidate germline variants; and generate, for the genomic regions and based on the additional genotype probabilities, one or more additional candidate variant calls corresponding to one or more germline variants in the genomic sample.
22. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to: compare, for a genomic coordinate for the second variant call, a genotype probability generated by the variant-call-recalibration machine-learning model with an additional genotype probability generated by the germline-variant-call-recalibration machine-learning model; and identify the second variant call as a somatic mosaic variant based on a comparison of the genotype probability and the additional genotype probability.
23. The non-transitory computer-readable medium of claim 15, wherein the variant- call-recalibration machine-learning model comprises a first machine-learning sub-model configured to generate a first type of genotype probabilities accounting for a set of candidate germline variants and a second machine-learning sub-model configured to generate a second type of genotype probabilities accounting for a set of candidate somatic mosaic variants.
24. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to: access, based on user input, sequencing data comprising sample nucleotide reads and synthetic nucleotide reads comprising modified nucleobases representing ground-truth somatic mosaic variants; determine the sequencing metrics for the sequencing data by determining sample-read- based sequencing metrics for the sample nucleotide reads and synthetic-read-based sequencing metrics for the synthetic nucleotide reads; and train the variant-call-recalibration machine-learning model to generate, based on the sample-read-based sequencing metrics and the synthetic-read-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants based on comparisons of variant calls and the ground-truth somatic mosaic variants.
25. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the synthetic nucleotide reads by modifying existing nucleotide reads to include the ground-truth somatic mosaic variants at one or more variant allele frequencies representative of one or more somatic mosaic variants.
26. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to: identify an admixture of genomic samples that simulates variant-allele frequencies of ground-truth somatic mosaic variants and ground-truth germline variants; access a mixture of nucleotide reads comprising a first set of nucleotide reads from a first genomic sample of the admixture of genomic samples and a second set of nucleotide reads from a second genomic sample of the admixture of genomic samples; determine the sequencing metrics for the nucleotide reads by determining admixture-based sequencing metrics for the mixture of nucleotide reads; and train the variant-call-recalibration machine-learning model to generate, based on the admixture-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants and germline variants based on comparisons of predicted variant calls with the ground-truth somatic mosaic variants and the ground-truth germline variants.
27. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to: receive an indication of a user selection of a variant-sensitivity option corresponding to detection of the candidate somatic mosaic variants; and execute the variant-call-recalibration machine-learning model to generate the genotype probabilities instead of a germline-variant-call-recalibration machine-learning model configured to generate a different type of genotype probabilities for candidate germline variants.
28. The non-transitory computer-readable medium of claim 15, wherein the variant- call-recalibration machine-learning model comprises one or more of a gradient boost decision tree or a random forest model.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202363511605P | 2023-06-30 | 2023-06-30 | |
US63/511,605 | 2023-06-30 | ||
US202363607446P | 2023-12-07 | 2023-12-07 | |
US63/607,446 | 2023-12-07 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2025006874A1 true WO2025006874A1 (en) | 2025-01-02 |
Family
ID=91961715
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2024/036003 WO2025006874A1 (en) | 2023-06-30 | 2024-06-28 | Machine-learning model for recalibrating genotype calls corresponding to germline variants and somatic mosaic variants |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2025006874A1 (en) |
Citations (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1991006678A1 (en) | 1989-10-26 | 1991-05-16 | Sri International | Dna sequencing |
US6172218B1 (en) | 1994-10-13 | 2001-01-09 | Lynx Therapeutics, Inc. | Oligonucleotide tags for sorting and identification |
US6210891B1 (en) | 1996-09-27 | 2001-04-03 | Pyrosequencing Ab | Method of sequencing DNA |
US6258568B1 (en) | 1996-12-23 | 2001-07-10 | Pyrosequencing Ab | Method of sequencing DNA based on the detection of the release of pyrophosphate and enzymatic nucleotide degradation |
US6274320B1 (en) | 1999-09-16 | 2001-08-14 | Curagen Corporation | Method of sequencing a nucleic acid |
US6306597B1 (en) | 1995-04-17 | 2001-10-23 | Lynx Therapeutics, Inc. | DNA sequencing by parallel oligonucleotide extensions |
WO2004018497A2 (en) | 2002-08-23 | 2004-03-04 | Solexa Limited | Modified nucleotides for polynucleotide sequencing |
US20050100900A1 (en) | 1997-04-01 | 2005-05-12 | Manteia Sa | Method of nucleic acid amplification |
WO2005065814A1 (en) | 2004-01-07 | 2005-07-21 | Solexa Limited | Modified molecular arrays |
US6969488B2 (en) | 1998-05-22 | 2005-11-29 | Solexa, Inc. | System and apparatus for sequential processing of analytes |
US7001792B2 (en) | 2000-04-24 | 2006-02-21 | Eagle Research & Development, Llc | Ultra-fast nucleic acid sequencing device and a method for making and using the same |
US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
WO2006064199A1 (en) | 2004-12-13 | 2006-06-22 | Solexa Limited | Improved method of nucleotide detection |
US20060240439A1 (en) | 2003-09-11 | 2006-10-26 | Smith Geoffrey P | Modified polymerases for improved incorporation of nucleotide analogues |
US20060281109A1 (en) | 2005-05-10 | 2006-12-14 | Barr Ost Tobias W | Polymerases |
WO2007010251A2 (en) | 2005-07-20 | 2007-01-25 | Solexa Limited | Preparation of templates for nucleic acid sequencing |
US7211414B2 (en) | 2000-12-01 | 2007-05-01 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
WO2007123744A2 (en) | 2006-03-31 | 2007-11-01 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
US7315019B2 (en) | 2004-09-17 | 2008-01-01 | Pacific Biosciences Of California, Inc. | Arrays of optical confinements and uses thereof |
US7329492B2 (en) | 2000-07-07 | 2008-02-12 | Visigen Biotechnologies, Inc. | Methods for real-time single molecule sequence determination |
US20080108082A1 (en) | 2006-10-23 | 2008-05-08 | Pacific Biosciences Of California, Inc. | Polymerase enzymes and reagents for enhanced nucleic acid sequencing |
US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
US20090026082A1 (en) | 2006-12-14 | 2009-01-29 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
US20090127589A1 (en) | 2006-12-14 | 2009-05-21 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
US20100137143A1 (en) | 2008-10-22 | 2010-06-03 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes |
US20100282617A1 (en) | 2006-12-14 | 2010-11-11 | Ion Torrent Systems Incorporated | Methods and apparatus for detecting molecular interactions using fet arrays |
US20120270305A1 (en) | 2011-01-10 | 2012-10-25 | Illumina Inc. | Systems, methods, and apparatuses to image a sample for biological or chemical analysis |
US20130079232A1 (en) | 2011-09-23 | 2013-03-28 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
US20130260372A1 (en) | 2012-04-03 | 2013-10-03 | Illumina, Inc. | Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing |
CA3065784A1 (en) * | 2018-04-12 | 2019-10-17 | Illumina, Inc. | Variant classifier based on deep neural networks |
US20210257050A1 (en) * | 2018-08-13 | 2021-08-19 | Roche Sequencing Solutions, Inc. | Systems and methods for using neural networks for germline and somatic variant calling |
US20220415443A1 (en) * | 2021-06-29 | 2022-12-29 | Illumina, Inc. | Machine-learning model for generating confidence classifications for genomic coordinates |
-
2024
- 2024-06-28 WO PCT/US2024/036003 patent/WO2025006874A1/en unknown
Patent Citations (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1991006678A1 (en) | 1989-10-26 | 1991-05-16 | Sri International | Dna sequencing |
US6172218B1 (en) | 1994-10-13 | 2001-01-09 | Lynx Therapeutics, Inc. | Oligonucleotide tags for sorting and identification |
US6306597B1 (en) | 1995-04-17 | 2001-10-23 | Lynx Therapeutics, Inc. | DNA sequencing by parallel oligonucleotide extensions |
US6210891B1 (en) | 1996-09-27 | 2001-04-03 | Pyrosequencing Ab | Method of sequencing DNA |
US6258568B1 (en) | 1996-12-23 | 2001-07-10 | Pyrosequencing Ab | Method of sequencing DNA based on the detection of the release of pyrophosphate and enzymatic nucleotide degradation |
US20050100900A1 (en) | 1997-04-01 | 2005-05-12 | Manteia Sa | Method of nucleic acid amplification |
US6969488B2 (en) | 1998-05-22 | 2005-11-29 | Solexa, Inc. | System and apparatus for sequential processing of analytes |
US6274320B1 (en) | 1999-09-16 | 2001-08-14 | Curagen Corporation | Method of sequencing a nucleic acid |
US7001792B2 (en) | 2000-04-24 | 2006-02-21 | Eagle Research & Development, Llc | Ultra-fast nucleic acid sequencing device and a method for making and using the same |
US7329492B2 (en) | 2000-07-07 | 2008-02-12 | Visigen Biotechnologies, Inc. | Methods for real-time single molecule sequence determination |
US7211414B2 (en) | 2000-12-01 | 2007-05-01 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
US7427673B2 (en) | 2001-12-04 | 2008-09-23 | Illumina Cambridge Limited | Labelled nucleotides |
US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
US20060188901A1 (en) | 2001-12-04 | 2006-08-24 | Solexa Limited | Labelled nucleotides |
US20070166705A1 (en) | 2002-08-23 | 2007-07-19 | John Milton | Modified nucleotides |
WO2004018497A2 (en) | 2002-08-23 | 2004-03-04 | Solexa Limited | Modified nucleotides for polynucleotide sequencing |
US20060240439A1 (en) | 2003-09-11 | 2006-10-26 | Smith Geoffrey P | Modified polymerases for improved incorporation of nucleotide analogues |
WO2005065814A1 (en) | 2004-01-07 | 2005-07-21 | Solexa Limited | Modified molecular arrays |
US7315019B2 (en) | 2004-09-17 | 2008-01-01 | Pacific Biosciences Of California, Inc. | Arrays of optical confinements and uses thereof |
WO2006064199A1 (en) | 2004-12-13 | 2006-06-22 | Solexa Limited | Improved method of nucleotide detection |
US20060281109A1 (en) | 2005-05-10 | 2006-12-14 | Barr Ost Tobias W | Polymerases |
WO2007010251A2 (en) | 2005-07-20 | 2007-01-25 | Solexa Limited | Preparation of templates for nucleic acid sequencing |
US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
US20100111768A1 (en) | 2006-03-31 | 2010-05-06 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
WO2007123744A2 (en) | 2006-03-31 | 2007-11-01 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
US20080108082A1 (en) | 2006-10-23 | 2008-05-08 | Pacific Biosciences Of California, Inc. | Polymerase enzymes and reagents for enhanced nucleic acid sequencing |
US20090026082A1 (en) | 2006-12-14 | 2009-01-29 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
US20090127589A1 (en) | 2006-12-14 | 2009-05-21 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
US20100282617A1 (en) | 2006-12-14 | 2010-11-11 | Ion Torrent Systems Incorporated | Methods and apparatus for detecting molecular interactions using fet arrays |
US20100137143A1 (en) | 2008-10-22 | 2010-06-03 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes |
US20120270305A1 (en) | 2011-01-10 | 2012-10-25 | Illumina Inc. | Systems, methods, and apparatuses to image a sample for biological or chemical analysis |
US20130079232A1 (en) | 2011-09-23 | 2013-03-28 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
US20130260372A1 (en) | 2012-04-03 | 2013-10-03 | Illumina, Inc. | Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing |
CA3065784A1 (en) * | 2018-04-12 | 2019-10-17 | Illumina, Inc. | Variant classifier based on deep neural networks |
US20210257050A1 (en) * | 2018-08-13 | 2021-08-19 | Roche Sequencing Solutions, Inc. | Systems and methods for using neural networks for germline and somatic variant calling |
US20220415443A1 (en) * | 2021-06-29 | 2022-12-29 | Illumina, Inc. | Machine-learning model for generating confidence classifications for genomic coordinates |
Non-Patent Citations (14)
Title |
---|
COCKROFT, S. LCHU, JAMORIN, MGHADIRI, M. R: "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution", J. AM. CHEM. SOC., vol. 130, 2008, pages 818 - 820, XP055097434, DOI: 10.1021/ja077082c |
DEAMER, D. WAKESON, M: "Nanopores and nucleic acids: prospects for ultrarapid sequencing", TRENDS BIOTECHNOL, vol. 18, 2000, pages 147 - 151, XP004194002, DOI: 10.1016/S0167-7799(00)01426-8 |
DEAMER, DD. BRANTON: "Characterization of nucleic acids by nanopore analysis", ACC. CHEM. RES, vol. 35, 2002, pages 817 - 825, XP002226144, DOI: 10.1021/ar000138m |
HEALY, K: "Nanopore-based single-molecule DNA analysis", NANOMED, vol. 2, 2007, pages 459 - 481, XP009111262, DOI: 10.2217/17435889.2.4.459 |
KORLACH, J. ET AL.: "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures", PROC. NATL. ACAD. SCI. USA, vol. 105, 2008, pages 1176 - 1181 |
LEVENE, M. J. ET AL.: "Zero-mode waveguides for single-molecule analysis at high concentrations", SCIENCE, vol. 299, 2003, pages 682 - 686, XP002341055, DOI: 10.1126/science.1079700 |
LI, J.M. GERSHOWD. STEINE. BRANDINJ. A. GOLOVCHENKO: "DNA molecules and configurations in a solid-state nanopore microscope", NAT. MATER, vol. 2, 2003, pages 611 - 615, XP009039572, DOI: 10.1038/nmat965 |
LUNDQUIST, P. M. ET AL.: "Parallel confocal detection of single molecules in real time", OPT. LETT, vol. 33, 2008, pages 1026 - 1028, XP001522593, DOI: 10.1364/OL.33.001026 |
METZKER, GENOME RES, vol. 15, 2005, pages 1767 - 1776 |
RONAGHI, M: "Pyrosequencing sheds light on DNA sequencing", GENOME RES, vol. 11, no. 1, 2001, pages 3 - 11, XP000980886, DOI: 10.1101/gr.11.1.3 |
RONAGHI, MKARAMOHAMED, SPETTERSSON, BUHLEN, MNYREN, P: "Real-time DNA sequencing using detection of pyrophosphate release", ANALYTICAL BIOCHEMISTRY, vol. 242, no. 1, 1996, pages 84 - 9, XP002388725, DOI: 10.1006/abio.1996.0432 |
RONAGHI, MUHLEN, MNYREN, P: "A sequencing method based on real-time pyrophosphate", SCIENCE, vol. 281, no. 5375, 1998, pages 363, XP002135869, DOI: 10.1126/science.281.5375.363 |
RUPAREL ET AL., PROC NATL ACAD SCI USA, vol. 102, 2005, pages 5932 - 7 |
SONI, G. VMELLER: "A. Progress toward ultrafast DNA sequencing using solid-state nanopores", CLIN. CHEM., vol. 53, 2007, pages 1996 - 2001, XP055076185, DOI: 10.1373/clinchem.2007.091231 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240120027A1 (en) | Machine-learning model for refining structural variant calls | |
US20220415443A1 (en) | Machine-learning model for generating confidence classifications for genomic coordinates | |
US20220319641A1 (en) | Machine-learning model for detecting a bubble within a nucleotide-sample slide for sequencing | |
US20240127905A1 (en) | Integrating variant calls from multiple sequencing pipelines utilizing a machine learning architecture | |
US20230021577A1 (en) | Machine-learning model for recalibrating nucleotide-base calls | |
WO2025006874A1 (en) | Machine-learning model for recalibrating genotype calls corresponding to germline variants and somatic mosaic variants | |
US20240371469A1 (en) | Machine learning model for recalibrating genotype calls from existing sequencing data files | |
US20230207050A1 (en) | Machine learning model for recalibrating nucleotide base calls corresponding to target variants | |
US20230313271A1 (en) | Machine-learning models for detecting and adjusting values for nucleotide methylation levels | |
US20240404624A1 (en) | Structural variant alignment and variant calling by utilizing a structural-variant reference genome | |
US20230368866A1 (en) | Adaptive neural network for nucelotide sequencing | |
US20240177802A1 (en) | Accurately predicting variants from methylation sequencing data | |
US20230095961A1 (en) | Graph reference genome and base-calling approach using imputed haplotypes | |
US20240127906A1 (en) | Detecting and correcting methylation values from methylation sequencing assays | |
US20230420080A1 (en) | Split-read alignment by intelligently identifying and scoring candidate split groups | |
WO2024249973A2 (en) | Linking human genes to clinical phenotypes using graph neural networks |