[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2024129844A1 - Techniques for designing patient-specific panels and methods of use thereof for detecting minimal residual disease - Google Patents

Techniques for designing patient-specific panels and methods of use thereof for detecting minimal residual disease Download PDF

Info

Publication number
WO2024129844A1
WO2024129844A1 PCT/US2023/083809 US2023083809W WO2024129844A1 WO 2024129844 A1 WO2024129844 A1 WO 2024129844A1 US 2023083809 W US2023083809 W US 2023083809W WO 2024129844 A1 WO2024129844 A1 WO 2024129844A1
Authority
WO
WIPO (PCT)
Prior art keywords
patient
variant
tsvs
tsv
data
Prior art date
Application number
PCT/US2023/083809
Other languages
French (fr)
Inventor
Peter Matthew DEFORD
Laura Anne JOHNSON
Aaron Timothy GARNETT
Nirav MALANI
Original Assignee
Invitae Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Invitae Corporation filed Critical Invitae Corporation
Publication of WO2024129844A1 publication Critical patent/WO2024129844A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • MRD minimal residual disease
  • ctDNA circulating tumor DNA
  • the method may comprise: using at least one computer hardware processor to perform: obtaining variant data indicative of a plurality of variants present in tumor cells of the patient, the variant data being derived from at least one biological sample obtained from the patient; identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient; and identifying a subset of the plurality of TSVs for use in the patient- specific panel for use in detecting MRD in the patient, the identifying comprising: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features; processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the predicted detectability of a corresponding TSV in tumor-derived polynucleotides of the patient to be monitored using the patient-specific panel; and selecting, using the plurality of scores and from among the at least
  • the method further comprises identifying primers for use in detecting presence, in a biological sample, of at least some variants in the subset of the plurality of TSVs.
  • obtaining the variant data indicative of the plurality of variants of the patient comprises: obtaining at least one data structure encoding variant genomic location data, variant type data, variant sequence data, variant sequence context data, variant
  • variant sequence context data comprises sequence context homopolymer data, sequence context splice site data, sequence context mutation data, and/or sequence context conservation data.
  • obtaining variant data indicative of a plurality of variants of the patient comprises obtaining the variant data previously- generated by analyzing sequence data generated by sequencing at least one biological sample obtained from the patient, optionally wherein obtaining variant data comprises sequencing the at least one biological sample obtained from the patient and analyzing sequencing data produced by the sequencing.
  • the variant data indicative of a plurality of variants present in tumor cells of the patient comprises data characterizing a variant derived from sequencing data from a sample comprising genomic material derived from tumor cells of the patient.
  • sequencing the at least one biological sample comprises sequencing using whole genome sequencing (WGS) or whole exome sequencing (WES).
  • obtaining variant data comprises obtaining sequence data of a tumor cell sample and a non-tumor cell sample of the patient.
  • the tumor cell sample comprises melanoma cells or lung cancer cells.
  • obtaining the variant data indicative of the plurality of variants of the patient comprises using at least one variant caller to identify the plurality of variants.
  • obtaining the variant data indicative of the plurality of variants of the patient comprises analyzing sequence data generated by sequencing the tumor cells obtained from the patient and using at least one variant caller to identify the plurality of variants.
  • identifying the plurality of TSVs comprises: selecting variants from among the plurality of variants using at least one feature selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and tumor cell variant allele frequency.
  • identifying the plurality of TSVs comprises identifying the plurality of TSVs in a biological sample of a tumor comprising the tumor cells of the patient.
  • identifying the plurality of TSVs comprises selecting variants using at least two features described herein.
  • identifying the plurality of TSVs comprises selecting variants using at least three features described herein. In some embodiments, identifying the plurality of TSVs comprises selecting variants using at least four described herein. In some embodiments, identifying the plurality of TSVs comprises selecting variants using at least five features described herein. In some embodiments, identifying the plurality of TSVs comprises selecting variants using all the features in the group consisting of variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and tumor cell variant allele frequency.
  • identifying the plurality of TSVs comprises selecting variants using variant bi-directional support, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant is observed at least a threshold number of times in plus strand sequencing reads and minus strand sequencing reads of the variant data. In some embodiments, the threshold number of times is between 2 and 15. In some embodiments, identifying the plurality of TSVs comprises selecting variants using the healthy population variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant has a variant allele frequency in a healthy population, as defined by at least one genomic database, of less than a threshold percentage.
  • the threshold percentage is 1%.
  • identifying the plurality of TSVs comprises selecting variants using sequence context homopolymer size, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether a homopolymer sequence exceeding a threshold size is present between the variant and a binding site of a primer designed to detect presence of the variant.
  • selecting variants using sequence context homopolymer size comprises selecting variants using sequence data derived from a biological sample of a tumor comprising the tumor cells of the patient.
  • the threshold size is 6 nucleotides.
  • identifying the plurality of TSVs comprises selecting variants using sequence coverage in non-tumor cells, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether sequencing coverage of the variant in the non-tumor cells of the patient exceeds a threshold.
  • the threshold is between 45X and 100X.
  • the plurality of TSVs comprises selecting variants using the ratio of variant allele frequency between tumor cells and non-tumor cells, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the ratio of the variant exceeds a threshold ratio.
  • identifying the plurality of TSVs comprises determining the ratio of variant allele frequency between sequence data of a biological sample of a tumor comprising the tumor cells of the patient and sequence data of non-tumor cells of the patient.
  • the threshold ratio is between a ratio of 20:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency.
  • identifying the plurality of TSVs comprises selecting variants using the tumor cell variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the tumor cell variant allele frequency exceeds a threshold.
  • selecting variants using the tumor cell variant allele frequency comprises selecting using sequence data a biological sample of a tumor comprising the tumor cells of the patient.
  • the threshold is between a 0.05 and a 0.1 tumor cell variant allele frequency.
  • generating the set of features comprises generating: at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature.
  • plurality of TSVs comprises a first TSV, wherein generating the respective set of features comprises generating a first set of features for the first TSV, and wherein generating the first set of features for the first TSV comprises generating at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature.
  • generating the first set of features for the first TSV comprises generating the at least one sequencing coverage feature for the first TSV, and wherein generating the at least one sequencing coverage feature comprises determining sequencing depth of coverage of plus strands and minus strands for the first TSV, and/or a ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV. In some embodiments, generating the at least one sequencing coverage feature for the first TSV
  • 4 10940863.511975645.1 comprises generating the at least one sequencing coverage feature using sequence data of a biological sample of a tumor comprising the tumor cells of the patient.
  • generating the first set of features for the first TSV comprises generating the at least one allele frequency feature, and wherein generating the at least one allele frequency feature comprises determining non-tumor cell depth coverage for the first TSV, a number of observations of the first TSV in tumor cells of the patient, and/or a tumor allele frequency of the first TSV.
  • generating the at least one allele frequency feature comprises generating the at least one allele frequency feature using sequence data of a biological sample of a tumor comprising the tumor cells of the patient.
  • generating the first set of features for the first TSV comprises generating the at least one primer feature, and wherein generating the at least one primer feature comprises determining a distance between a first TSV and a binding site for a primer designed to detect the first TSV. In some embodiments, generating the at least one primer feature comprises determining a distance between a first TSV and a PCR primer designed to amplify a portion of a polynucleotide comprising the first TSV.
  • generating the at least one primer feature comprises determining a maximum distance between the first TSV and a binding site for a first primer designed to detect the first TSV and/or a maximum distance between the first TSV and binding site for a second primer, different from the first primer, designed to detect the first TSV. In some embodiments, generating the at least one primer feature comprises determining a minimum distance between the first TSV and a binding site for a first primer designed to detect the first TSV and/or a minimum distance between the first TSV and binding site for a second primer designed to detect the first TSV.
  • a first primer and/or a second primer are PCR primers designed to amplify a portion of a polynucleotide comprising the first TSV.
  • generating the first set of features for the first TSV comprises generating the at least one sequence context feature, and wherein generating the at least one sequence context feature comprises determining a conservation score of a polynucleotide of the patient comprising the first TSV, a distance between the first TSV and a nearest splice site on the polynucleotide, and/or a splice site score of the polynucleotide.
  • generating the conservation score comprises generating a phastCons conservation score and/or a phyloP conservation score.
  • generating the first set of features for the first TSV comprises determining: the sequencing depth of coverage of plus strands and minus strands for the first TSV, the non-tumor cell depth coverage for the first TSV, the number of observations of the first TSV in tumor cells of the
  • the method further comprises determining one or more of the maximum distance between the first TSV and a binding site for the second primer designed to detect the first TSV, the ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV, the tumor allele frequency of the first TSV, the phastCons conservation score of the first TSV, the maximum distance between the first TSV and a binding site for the first primer designed to detect the first TSV, the distance between the first TSV and the nearest splice site on a polynucleotide of the patient comprising the first TSV, and a phyloP conservation score.
  • the method further comprises determining one or more of the C to A variant mutation feature, the minimum distance between the first TSV and a binding site for the second primer designed to detect the first TSV, the splice site score of the polynucleotide, the minimum distance between the first TSV and the binding site for the second primer designed to detect the first TSV.
  • processing the plurality of sets of features using the trained machine learning model to obtain a corresponding plurality of scores comprises processing the plurality of sets of features using a trained nonlinear classification model.
  • the trained nonlinear classification model comprises a random forest model.
  • the trained machine learning model comprises a plurality of parameters having respective values and wherein processing a set of features of the plurality of sets of features comprises computing a score using the set of features and the respective values of the plurality of parameters.
  • the score is the predicted likelihood that the TSV will be observed in the biological sample of an MRD positive patient.
  • selecting the TSVs for inclusion into the subset of the plurality of TSVs comprises selecting a threshold number of TSVs based on their respective scores. In some embodiments, selecting a threshold number of TSVs based on their respective scores comprises selecting TSVs with the highest scores. In some embodiments, selecting a threshold number of TSVs based on their respective scores comprises selecting 50 TSVs with the highest scores.
  • the trained machine learning model is trained using TSVs from a plurality of MRD positive patients having a first cancer and is predictive of the likelihood of detecting a TSV in a biological sample from a MRD positive patient having a second cancer that is different from the first cancer.
  • the first cancer is lung cancer and the second cancer is melanoma.
  • the method further comprises: synthesizing primers corresponding to at least some of the TSVs in the subset of the plurality of TSVs.
  • this disclosure describes a method of training a machine learning model to generate a score indicative of the predicted detectability of a tumor-specific variant (TSV) in a biological sample of a minimal residual disease (MRD) positive patient, the machine learning model comprising a plurality of parameters, the method comprising: obtaining training data, the training data derived from data collected during previously performed monitoring for presence of a plurality of TSVs in a plurality of biological samples collected from MRD positive patients, the training data comprising: for each TSV in the plurality of TSVs and each biological sample in which the TSV was previously monitored, (i) variant data associated with the TSV; and (ii) and an indication of whether the TSV was present or absent in the biological sample; and training the machine learning model by using the training data to estimate values of the plurality of parameters to obtain
  • obtaining training data comprises obtaining variant data associated with each TSV, the variant data comprising at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature.
  • obtaining training data comprises obtaining an indication of whether the TSV is present or absent in the biological sample, the indication determined based on the TSV being present in the biological sample at an allele frequency that exceeds a threshold.
  • training a machine learning model to predict a score indicative of detectability of a TSV in a biological sample comprises training the machine learning model to predict a likelihood that the TSV will be observed in the biological sample of an MRD positive patient.
  • the MRD positive patients comprise patients that have been previously diagnosed with lung cancer and/or patients that have been previously diagnosed with melanoma.
  • the plurality of TSVs comprises at least 200 TSVs.
  • the MRD positive patients comprise at least 50 MRD positive patients.
  • the MRD positive patients comprise at least 500 MRD positive patients.
  • training the machine learning model comprises training a nonlinear machine learning model.
  • training the machine learning model comprises training a nonlinear regression machine learning model.
  • training the machine learning model comprises training a nonlinear
  • training the machine learning model comprises training a random forest model.
  • training the machine learning model to estimate values of the plurality of parameters comprises estimating the values of 5 parameters.
  • training the machine learning model comprises training the trained machine learning model as described herein.
  • this disclosure describes a method for determining whether patient- specific panel data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD), the method comprising: identifying primers for use in detecting a subset of a plurality of TSVs using the method described herein; generating sequence data from the biological sample of the patient, the generating comprising contacting the biological sample with the primers; detecting TSVs using the sequence data; and determining, using the detected TSVs, whether the biological sample provides an indication of MRD.
  • the biological sample is a blood, serum or plasma sample of the patient.
  • detecting the TSVs using the sequence data comprises determining the allele frequency of the TSVs in the biological sample.
  • determining whether the biological sample provides an indication of MRD comprises determining whether the allele frequency of at least some of the TSVs exceeds an error rate of generating sequencing data of the biological sample.
  • the method further comprises administering a therapeutic when the patient has a positive indication of MRD or continuing to collect biological samples from the patient for use in monitoring the patient for MRD when the patient has a negative indication of MRD.
  • administering a therapeutic comprises administering a therapeutic to treat a cancer and/or tumor associated with the indication of MRD.
  • determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with sensitivity greater than a 0.85 probability of detecting MRD in a patient that has MRD.
  • determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with specificity greater than a 0.98 probability of not detecting MRD in a patient that does not have MRD.
  • this disclosure describes a system for designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient.
  • MRD minimal residual disease
  • the system may comprise: at least one computer hardware processor; and at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining variant data indicative of a plurality of variants of the patient present in tumor cells of the patient; identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient; identifying a subset of the plurality of TSVs for use in the patient-specific panel for use in detecting MRD in the patient, the identifying comprising: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features; processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the predicted detectability of a corresponding TSV in tumor-derived polynucleotides of the
  • the at least one computer hardware processor stores processor executable instructions that cause the at least one computer hardware processor to perform the method of designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient, as described herein.
  • this disclosure describes at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining variant data indicative of a plurality of variants of the patient present in tumor cells of the patient; identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient; identifying a subset of the plurality of TSVs for use in the patient-specific panel for use in detecting MRD in the patient, the identifying comprising: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features; processing the plurality of sets of
  • the at least one computer hardware processor stores processor executable instructions that cause the at least one computer hardware processor to perform the method of designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient, as described herein.
  • MRD minimal residual disease
  • FIG.1 is a diagram depicting an illustrative technique 100 for using variant data from tumor cells and non-tumor cells of a patient to design a patient-specific panel for detecting MRD in the patient, according to some embodiments of the technology described herein.
  • FIG.2A is a flowchart of an illustrative process 200 for identifying a subset of a plurality of tumor specific variants (TSVs) for use in a patient-specific panel for identifying MRD, and optionally identifying and/or synthesizing one or more primers for inclusion in a patient-specific panel, according to some embodiments of the technology described herein. Steps enclosed with dashed lines are optional.
  • TSVs tumor specific variants
  • FIG.2B is a flowchart of an illustrative process 250 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using a trained machine learning model, according to some embodiments of the technology described herein.
  • FIG.3 is a diagram depicting an illustrative technique 300 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using the TSVs using a trained machine learning model, according to some embodiments of the technology described herein.
  • FIG.4 is a flowchart of an illustrative process 400 for identifying the subset of the plurality of TSVs for use in a patient-specific panel, according to some embodiments of the technology described herein.
  • FIG.5 is a diagram depicting an illustrative technique 500 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using variants identified by sequencing non-tumor cells and tumor cells of the patient to identify TSVs and exclude non- tumor-specific variants, scoring the TSVs using a trained machine learning model, and selecting TSVs for the patient-specific panel using the scores, according to some embodiments of the technology described herein.
  • FIG.6 is a scatter plot showing SHapley Additive exPlanations (SHAP) values of TSV features included when training and testing the machine learning model, according to some embodiments of the technology described herein.
  • FIG.7 is a table of TSV features selected for use in a trained machine learning model, according to some embodiments of the technology described herein.
  • FIG.8 is a beeswarm plot of the SHAP values of each feature of FIG.7 where a broader SHAP value distribution for a given feature indicates the feature impact on the scores of the trained machine learning model, according to some embodiments of the technology described herein.
  • FIG.9A is a scatter plot comparing variant max splice site score to SHAP values of the variant max splice site score where each point is colored by Random Forest (RF) Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9B is a scatter plot comparing minimum primer 1 distance to SHAP values of minimum primer 1 distance where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9C is a scatter plot comparing minimum primer 2 distance to SHAP values of the minimum primer 2 distance where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9D is a scatter plot comparing maximum primer 2 distance to SHAP values of maximum primer 2 distance where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9E is a scatter plot comparing tumor cell alternate observations to SHAP values of tumor cell alternate observations where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9F is a scatter plot comparing maximum primer 1 distance to SHAP values of maximum primer 2 distance where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9G is a scatter plot comparing phyloP conservation score to SHAP values of phyloP conservation score where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9H is a scatter plot comparing phastCons conservation score to SHAP values of the phastCons conservation score where each point is colored RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9I is a scatter plot comparing tumor cell allele frequency (FAF) to SHAP values of FAF where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9J is a scatter plot comparing non-tumor cell depth coverage to SHAP values of non-tumor cell depth coverage where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9K is a scatter plot comparing strand bias to SHAP values of strand bias where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9L is a scatter plot comparing minimum strand coverage to SHAP values of minimum strand coverage where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9M is a scatter plot comparing error rate corrected error bins to SHAP values of the error rate corrected error bins where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9N is a scatter plot comparing C to A mutations to SHAP values of the C to A mutations where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.9O is a scatter plot comparing distance to nearest splice site to SHAP values of the distance to nearest splice site where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein.
  • FIG.10A shows box and whisker plots of the sensitivity and specificity of a rules- based algorithm and the trained machine learning model for predicting MRD in lung and melanoma cancer patients, according to some embodiments of the technology described herein.
  • FIG.10B shows bar charts of the sensitivity and specificity of a rules-based algorithm and the trained machine learning model for predicting MRD in melanoma cancer patients using an iteration of the model trained solely on lung cancer data, according to some embodiments of the technology described herein.
  • FIG.11 depicts an illustrative implementation of a computer system that may be used in connection with some embodiments of the technology described herein.
  • FIG.12 is a diagram depicting an illustrative technique 1200 for training the trained machine learning model to generate a score indicative of the predicted detectability of a TSV, according to some embodiments of the technology described herein.
  • FIG.13 is a flowchart of an illustrative process 1300 for training the trained machine learning model, according to some embodiments of the technology described herein.
  • DETAILED DESCRIPTION Early detection of cancer relapse/recurrence is an important aspect of effective cancer treatment.
  • One strategy for detecting cancer relapse comprises using patient-specific panels (e.g., a selection of tumor specific variants and/or one or more primers used to detect them) to detect minimal residual disease (MRD) using biological samples (e.g., samples containing circulating tumor DNA (ctDNA)) collected from a patient, often after administration of a cancer therapy.
  • patient-specific panels e.g., a selection of tumor specific variants and/or one or more primers used to detect them
  • MRD minimal residual disease
  • biological samples e.g., samples containing circulating tumor DNA (ctDNA)
  • Circulating tumor DNA often comprises wild-type nucleic acid sequences (e.g., comprising somatic and/or germline mutations) as well as nucleic acid sequences comprising tumor-specific variants (TSVs), which are often indicative of MRD.
  • TSVs tumor-specific variants
  • This strategy may be implemented for a patient via a two-stage “panel design” process.
  • the first stage may involve identifying tumor-specific variants for a patient, for example, by sequencing a biological sample (e.g., a sample comprising tumor cells and/or non-tumor cells) obtained from the patient and analyzing the sequencing results.
  • the second stage may involve creating a customized panel (e.g., patient-specific panel) for that patient and may comprise a suitable technique for detecting the TSVs (e.g., untargeted sequencing, targeted sequencing, polynucleotide probes, polymerase chain reaction amplification of the TSVs, qPCR, hybrid/array capture, the like or a combination thereof).
  • a patient-specific panel is used to select and/or detect TSVs in a biological sample of a patient.
  • TSVs are detected using a suitable amplification method.
  • the detection may be performed by contacting a biological sample (or polynucleotides from the sample) with primers or probes (depending on the technique) for detecting the TSVs of a patient-specific panel.
  • primers e.g., sets of primers
  • amplicons resulting from an amplification may be analyzed using a suitable method.
  • amplicons of the selected polynucleotides may be sequenced (e.g., using next-generation sequencing) to detect TSVs.
  • a positive indication of MRD may be found when the total number of TSVs detected exceeds a suitable threshold (e.g., the threshold may be an expected number of TSVs to be detected due to error associated with sample preparation and sequencing).
  • a positive indication of MRD may also
  • fluorescent polynucleotide probes may be used to detect polynucleotides comprising TSVs in the biological sample (e.g., polynucleotides extracted from the biological sample).
  • a positive indication of MRD may be found when the fluorescent signal from the fluorescent polynucleotide probes exceeds a threshold (e.g., the threshold may be the expected fluorescent background signal).
  • the degree to which a patient-specific panel is effective in detecting MRD may be quantified using measures such as panel sensitivity and specificity. Sensitivity refers to the true positive rate of detecting MRD in a patient.
  • some panel design processes involve selecting tumor-specific variants using manually-designed rules (e.g., selecting TSVs for which features, such as allele frequency in tumor cells and/or non-tumor cells, sequencing coverage, and/or sequencing depth, exceed respective manually-set thresholds) and then designing a panel to detect the selected tumor-specific variants.
  • manually-designed rules e.g., selecting TSVs for which features, such as allele frequency in tumor cells and/or non-tumor cells, sequencing coverage, and/or sequencing depth, exceed respective manually-set thresholds
  • TSV selection rules encode subjective assumptions about the importance of TSV features in selecting TSVs that will actually help to detect MRD.
  • the rules may not accurately and faithfully represent the complex (e.g., non-linear and heterogeneous) relationship between various TSV characteristics and the likelihood that such TSVs can be subsequently detected in ctDNA of a patient with high sensitivity and specificity.
  • the inventors have developed a new patient-specific panel design process that improves upon previous panel design techniques in that it produces patient-specific panels that have higher sensitivity and specificity as compared to patient-specific panels produced using previous panel design techniques. The precise improvement can be quantified and is described in greater detail herein including with reference to FIGs.10A-10B. Additionally, the inventors have used objective and data driven criteria to select features for inclusion in the model that are predictive of the detectability of TSVs rather than rule-based criteria. Notably, the new panel design process involves using machine learning technology (e.g., instead of subjective rules) to select tumor-specific variants for inclusion in
  • the machine learning technology involves a machine learning model that is trained to represent the (e.g., non-linear and heterogeneous) relationship between various features of a TSV (see e.g., the features shown in FIG.7) and the likelihood that such a TSV will be detected in the circulating nucleic acids (e.g., ctDNA) of the patient during subsequent monitoring.
  • a machine learning model that is trained to represent the (e.g., non-linear and heterogeneous) relationship between various features of a TSV (see e.g., the features shown in FIG.7) and the likelihood that such a TSV will be detected in the circulating nucleic acids (e.g., ctDNA) of the patient during subsequent monitoring.
  • the new panel design process involves three stages: (1) identifying variants by analyzing sequence data obtained by sequencing one or more biological samples obtained from a patient; (2) identifying, among the identified variants, a set of tumor-specific variants for the patient; and (3) evaluating the tumor-specific variants using a trained machine learning model (e.g., a random forest model, a non-linear mixed- effects model, a logistic regression model, a support vector machine model, etc.) to identify a subset of the plurality of tumor-specific variants to use for the patient-specific panel.
  • a trained machine learning model e.g., a random forest model, a non-linear mixed- effects model, a logistic regression model, a support vector machine model, etc.
  • primers corresponding to at least some (e.g., all) of the TSVs in the identified subset may be synthesized and used for analyzing (e.g., amplifying and/or detecting, e.g., sequencing) another biological sample obtained from the patient at a later time (e.g., in part by contacting nucleic acids in the biological sample with the synthesized primers). Subsequent sequencing results may be analyzed, for example, to detect MRD.
  • this disclosure provides evidence that the relationships between TSV features (e.g., sequence context, allele frequency in tumor cells vs. healthy cells, etc.) and a TSV being indicative of MRD are much more complex (e.g., non-linear and heterogeneous).
  • some embodiments provide for a computer-implemented method of designing a patient-specific panel (e.g., a panel for use in detecting tumor-specific variants of the patient) for use in detecting minimal residual disease (MRD) in a (e.g., a human) patient.
  • a patient-specific panel e.g., a panel for use in detecting tumor-specific variants of the patient
  • MRD minimal residual disease
  • the method comprises: (A) obtaining variant data indicative of a plurality of variants of the patient present in tumor cells of the patient (e.g., the plurality of variants may include germline variants, somatic variants, and/or tumor-specific somatic variants); (B) identifying, using the variant data and from among the plurality of variants, a
  • TSVs tumor-specific variants
  • C identifying a subset of the plurality of the TSVs for use in the patient-specific panel for use in detecting MRD in the patient, the identifying comprising: (i) generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features (e.g., a set of features for each of the at least some of the TSVs of the plurality of TSVs); (ii) processing the plurality of sets of features using a trained machine learning model (e.g., a trained random forest model) to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the predicted detectability of a corresponding TSV in tumor-derived polynucleotides (e.g., circulating-tumor DNA) of the
  • the method further comprises identifying (e.g., designing or accessing previously-designed) primers for use in amplifying and/or detecting presence (or absence), in a biological sample (e.g., another biological sample obtained at a later time), of at least some variants in a subset of the plurality of the TSVs.
  • identifying e.g., designing or accessing previously-designed primers for use in amplifying and/or detecting presence (or absence), in a biological sample (e.g., another biological sample obtained at a later time), of at least some variants in a subset of the plurality of the TSVs. This may be done, for example, by designing and/or generating primers that are designed to amplify portions of a patient’s ctDNA which include TSVs in the subset of the plurality of TSVs.
  • primers may be identified after the plurality of TSVs for the patient is identified (e.g., by designing or accessing previously-designed primers for each of at least some of the TSVs in the plurality of TSVs).
  • information about the primer(s) identified for a TSV may be used to evaluate the TSV for inclusion into the subset of the plurality of TSVs for use in a patient-specific panel.
  • the primers may be identified for the TSVs in the subset of the plurality of TSVs after the subset of the plurality of TSVs are selected (e.g., in embodiments where information about the primers is not used to evaluate TSVs for inclusion into the subset of the plurality of TSVs for used in the patient- specific panel). Further discussion of identifying primers can be found herein including with reference to FIG.2A.
  • obtaining variant data indicative of a plurality of variants of a patient comprises: obtaining one or more data structures encoding variant genomic location data, variant type data, variant sequence data, variant sequence context data, variant sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data, and/or variant primer data. All these types
  • obtaining variant data may include obtaining at least one data structure encoding variant sequence context data.
  • Obtaining sequence context data may comprise obtaining one or more of sequence context homopolymer data (e.g., data indicative of the location and size of homopolymers within a threshold distance of a given variant), sequence context splice site data (e.g., data indicative of the location of any splice site within a threshold distance of a given variant), sequence context mutation data (e.g., data indicative of the location and type of mutations within a threshold distance of a given variant), and/or sequence context conservation data (e.g., data indicative of the degree of conservation of the ctDNA sequence within a threshold distance of a given variant).
  • sequence context homopolymer data e.g., data indicative of the location and size of homopolymers within a threshold distance of a given variant
  • sequence context splice site data e.g., data indicative of the location of any splice site within a threshold distance of a given variant
  • sequence context mutation data e.g., data indicative of the location and type of mutations within
  • the variant data indicative of a plurality of variants present in tumor cells of the patient may comprise data characterizing a variant derived from sequencing data from a sample comprising genomic material derived from tumor cells of the patient.
  • obtaining variant data indicative of a plurality of variants of the patient comprises obtaining variant data previously-generated by analyzing sequence data generated by sequencing at least one biological sample obtained from the patient (e.g., the tumor cells obtained from the patient).
  • obtaining variant data indicative of a plurality of variants of a patient comprises sequencing (e.g., using whole genome sequencing or whole exome sequencing) the at least one biological sample obtained from the patient (e.g., melanoma cells, lung cancer cells, or cells of any other type of cancer that the patient may have and/or may be monitored for) and analyzing sequencing data produced by the sequencing.
  • Obtaining variant data may comprise obtaining sequence data of a tumor cell sample and/or a non-tumor cell sample of the patient.
  • obtaining variant data indicative of a plurality of variants of a patient comprises generating the variant data or accessing (e.g., importing, downloading) previously-generated variant data.
  • variant data may be generated, in some embodiments, using at least one suitable variant caller to identify a plurality of variants (e.g., as described in Koboldt, D. C. (2020) Genome Med 12:91) and generate various information about the variants (e.g., variant genomic location data, variant sequence data, variant sequence context data, variant sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data and/or variant primer data).
  • generating the variant data may comprise obtaining sequence data corresponding to the at least one biological sample obtained from the patient
  • this method comprises identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient. Identifying a plurality of TSVs may comprise identifying the plurality of TSVs in a biological sample of a tumor comprising the tumor cells of the patient.
  • Identifying the plurality of TSVs may comprise: selecting variants from among the plurality of variants using at least one feature (e.g., at least two features, at least three features, at least four features, at least five features, or all the features) selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and/or non-tumor cells, and tumor cell variant allele frequency.
  • at least one feature e.g., at least two features, at least three features, at least four features, at least five features, or all the features selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and/or non-tumor cells, and tumor cell variant allele frequency.
  • Identifying the plurality of TSVs may comprise selecting variants using variant bi- directional support, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant is observed at least a threshold number of times in plus strand sequencing reads and minus strand sequencing reads of the variant data (e.g., 2-15 times). Additional methods for selecting variants using variant bi- directional support are described herein including in the section “Variant Bi-directional Support”.
  • Identifying the plurality of TSVs may comprise selecting variants using the healthy population variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant has a variant allele frequency in a healthy population, as defined by at least one genomic database, of less than a threshold percentage (e.g., 1%). Additional methods for selecting variants using healthy population variant allele frequency are described herein including the section “Healthy Population Variant Allele Frequency”.
  • Identifying the plurality of TSVs may comprise selecting variants using sequence context homopolymer size, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether a homopolymer sequence exceeding a threshold size (e.g., nucleotides) is present between the variant and a binding site of a primer designed to detect presence of the variant (e.g., in the genome of the tumor cells of the
  • a threshold size e.g., nucleotides
  • Identifying a plurality of TSVs may comprise selecting variants using the ratio of variant allele frequency between tumor cells and non-tumor cells, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the ratio of the variant exceeds a threshold ratio (e.g., a ratio between 20:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency).
  • a threshold ratio e.g., a ratio between 20:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency.
  • the ratio of VAF may be determined using sequence data of the tumor cells and sequence data of the non- tumor cells.
  • Identifying the plurality of TSVs may comprise selecting variants using the tumor cell variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the tumor cell variant allele frequency exceeds a threshold (e.g., between a 0.05 and a 0.1 tumor cell variant allele frequency).
  • Selecting variants using the tumor cell variant allele frequency may comprise selecting the variants using sequence data derived from a biological sample of a tumor comprising the tumor cells of the patient.
  • the method comprises identifying a subset of the plurality of the TSVs for use in the patient-specific panel for use in detecting MRD in the patient.
  • the plurality of TSVs comprises a first TSV, wherein generating the respective set of features (e.g., features to be provided as input into the trained
  • 19 10940863.511975645.1 machine learning model comprises generating a first set of features for the first TSV, and wherein generating the first set of features for the first TSV comprises generating at least one sequencing coverage feature (e.g., sequencing depth of coverage of plus strands and minus strands for the first TSV (e.g., minimum strand coverage), and/or a ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV (e.g., strand bias), at least one allele frequency feature (e.g., non-tumor cell depth coverage for the first TSV, a number of observations of the first TSV in tumor cells of the patient (e.g., tumor cell alternate observations), and/or a tumor allele frequency of the first TSV), a trinucleotide context (TNC) error rate feature (e.g., error rate in error corrected bins), a C to A variant mutation feature (e.g., the variant comprises a C to A mutation), at
  • Generating these features may comprise generating the features using sequence data of a biological sample of a tumor comprising the tumor cells of the patient. Additional methods for generating features are described herein including the section “Subset of the Plurality of Tumor Specific Variants” and with reference to FIG.2B. As discussed above, in some embodiments, generating the first set of features for the first TSV comprises generating at least one primer feature.
  • Generating the at least one primer feature may comprise determining a maximum distance between a first TSV and a binding site for a first primer (e.g., a PCR primer) designed to detect the first TSV (e.g., max primer 1 distance) and/or a maximum distance between the first TSV and binding site for a second primer (e.g., max primer 2 distance), different from the first primer, designed to detect the first TSV.
  • a first primer e.g., a PCR primer
  • max primer 1 distance e.g., max primer 1 distance
  • a second primer e.g., max primer 2 distance
  • generating the at least one primer feature may comprise determining a minimum distance between a first TSV and a binding site for a first primer designed to detect the first TSV (e.g., minimum primer 1 distance) and/or a minimum distance between the first TSV and binding site for a second primer (e.g., a PCR primer) designed to detect the first TSV (e.g., minimum primer 2 distance).
  • generating a first set of features for a first TSV comprises generating the at least one sequence context feature.
  • Generating the at least one sequence context feature may include generating a conservation score (e.g., generating a phastCons conservation score and/or a phyloP conservation score).
  • sequence context feature may also comprise generating distance to nearest splice site and/or a variant max splice site score.
  • this disclosure describes generating specific combinations of features for use in identifying a subset of the plurality of the TSVs for use in the patient-specific panel for use in detecting MRD in the patient.
  • generating the first set of features for the first TSV may comprise determining: the sequencing depth of coverage of plus strands and minus strands for the first TSV, the non-tumor cell depth coverage for the first TSV, the number of observations of the first TSV in tumor cells of the patient, and/or the trinucleotide context (TNC) error rate feature.
  • TTC trinucleotide context
  • generating a first set of features for a first TSV further comprises determining one or more of a distance (e.g., a minimum and/or maximum distance) between the first TSV and a binding site for a primer (e.g., a first prime and/or a second primer) designed to detect the first TSV, the ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV, the tumor allele frequency of the first TSV, the phastCons conservation score, the distance between the first TSV and the nearest splice site on the polynucleotide, and/or a phyloP conservation score.
  • a distance e.g., a minimum and/or maximum distance
  • a primer e.g., a first prime and/or a second primer
  • generating the first set of features for the first TSV further comprises determining one or more of the C to A variant mutation feature, the minimum distance between the first TSV and a binding site for the second primer designed to detect the first TSV, the splice site score of the polynucleotide, the minimum distance between the first TSV and/or the binding site for the second primer designed to detect the first TSV.
  • processing the plurality of sets of features using the trained machine learning model to obtain a corresponding plurality of scores comprises processing the plurality of sets of features using a trained nonlinear ML model (e.g., a random forest, a support-vector machine, or a neural network).
  • a trained nonlinear ML model e.g., a random forest, a support-vector machine, or a neural network.
  • Non-linear ML models like these are expected to capture the nonlinear relationships between the variant features described herein and the predicted likelihood that the TSV will be observed in the biological sample of an MRD positive patient.
  • the non-linear model may be a non-linear regression model (e.g., a model configured to output an estimated value, such as a likelihood or probability, in the 0-1 range).
  • the non-linear model may be a non-linear classification model (e.g., a model configured to output an indication of one or multiple discrete classes, for example, where each of the classes corresponds to a respective bin of likelihood or probability values in the 0- 1 range). Additional methods for processing the plurality of sets of features using the trained machine learning are described herein including with reference to FIG.2B.
  • the trained machine learning model comprises a plurality of parameters having respective values and wherein processing a set of features of the plurality of sets of features comprises computing a score using the set of features and the respective values of the plurality of parameters.
  • the score may represent the predicted likelihood that the TSV will be observed in the biological sample of an MRD positive patient. Selecting the TSVs for inclusion into the subset of the plurality of the TSVs may comprise selecting a threshold number of TSVs based on their respective scores (e.g., selecting a subset of the plurality of the TSVs with high scores (e.g., the top 50 high scores)).
  • the trained machine learning model is trained using TSVs from a plurality of MRD positive patients having a first cancer and is predictive of the likelihood of detecting a TSV in a biological sample from a MRD positive patient having a second cancer that is different from the first cancer.
  • the first cancer may be lung cancer and the second cancer may be melanoma.
  • the trained machine learning model is trained using TSVs from a plurality of MRD positive patients having one or more types of cancer, and is predictive of the likelihood of detecting a TSV in a biological sample from a MRD positive patient having the same cancer as the one or more types of cancer that the model was trained on. In some embodiments, the trained machine learning model is trained using TSVs from a plurality of MRD positive patients having one or more types of cancer, and is predictive of the likelihood of detecting a TSV in a biological sample from a MRD positive patient having a different type of cancer as the one or more types of cancer that the model was trained on.
  • selecting the TSVs for inclusion into the subset of the plurality of the TSVs further comprises: synthesizing primers corresponding to at least some of the TSVs in the subset of the plurality of TSVs (e.g., using a suitable primer synthesis method).
  • Some embodiments further provide a method of training a machine learning model (e.g., a nonlinear machine learning model described herein) to generate a score indicative of the predicted detectability of a tumor-specific variant (TSV) (e.g., a likelihood) in a biological sample (e.g., plasma) of a minimal residual disease (MRD) positive patient, the machine learning model comprising a plurality of parameters (e.g., 5 parameters), the method comprising: obtaining training data, the training data derived from data collected during previously performed monitoring for presence of a plurality of TSVs (e.g., at least 200 TSVs in each MRD positive patient) in a plurality of biological samples collected from MRD
  • TSV tumor-specific variant
  • MRD minimal residual disease
  • the training data comprising: for each TSV in the plurality of TSVs and each biological sample in which the TSV was previously monitored, (i) variant data associated with the TSV (e.g., variant data comprising the features described herein); and (ii) and an indication of whether the TSV was present or absent in the biological sample (e.g., allele frequency of the variant exceeds a threshold); and training the machine learning model by using the training data to estimate values of the plurality of parameters to obtain a trained machine learning model.
  • variant data associated with the TSV e.g., variant data comprising the features described herein
  • an indication of whether the TSV was present or absent in the biological sample e.g., allele frequency of the variant exceeds a threshold
  • the machine learning model is trained using TSVs from a plurality of MRD positive patients having a first cancer (e.g., lung cancer) and the machine learning model is predictive of the probability of detecting a TSV in a biological sample from a MRD positive patient having a second cancer (e.g., melanoma) that is different from the first cancer.
  • the machine learning model is trained using TSVs from a plurality of MRD positive patients having one or more types of cancer, and is predictive of the probability of detecting a TSV in a biological sample from a MRD positive patient having the same cancer as the one or more types of cancer that the model was trained on.
  • the machine learning model is trained using TSVs from a plurality of MRD positive patients having one or more types of cancer, and is predictive of the probability of detecting a TSV in a biological sample from a MRD positive patient having a different type of cancer as the one or more types of cancer that the model was trained on.
  • Training a machine learning model to predict a score indicative of detectability of a TSV in a biological sample may comprise training the machine learning model to predict a likelihood that the TSV will be observed in the biological sample of an MRD positive patient.
  • Some embodiments further provide for a method for determining whether sequence data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD), the method comprising: identifying primers for use in amplifying and/or detecting a subset of a plurality of TSVs using the methods described herein; generating sequence data from the biological sample of the patient (e.g., a bodily fluid, like plasma), the generating comprising contacting (e.g., a polymerase chain reaction solution) the biological sample with primers; detecting TSVs using the sequence data (e.g., using any suitable method, for example Illumina® sequencing); and determining, using the detected TSVs, whether the biological sample provides an indication of MRD (e.g., based on an abundance of each TSV detected using the patient specific panel and the expected error in identifying TSVs).
  • MRD minimal residual disease
  • Some embodiments provide a method for determining whether sequence data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD), the method comprising: identifying primers for use in amplifying and/or detecting a selected subset of a plurality of patient-specific TSVs; amplifying polynucleotides of the patient sample using the identified primers; generating sequence data (e.g., using any suitable method, for example Illumina® sequencing) from the amplified polynucleotides; and determining whether the biological sample provides an indication of MRD in the patient according to the sequence data (e.g., based on a presence, absence and/or amount of one or more, or all of the selected subset of TSVs in the biological sample).
  • MRD minimal residual disease
  • Determining whether the biological sample provides an indication of MRD may comprise determining whether the allele frequency of at least some of the TSVs exceeds an error rate associated with generating sequencing data of the biological sample. In some embodiments, determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with a sensitivity greater than a 0.80, greater than a 0.85, greater than a 0.90, greater than a 0.95, or greater than a 0.98 probability of detecting MRD in a patient that has MRD. In some embodiments, determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with specificity greater than a 0.98 probability of not detecting MRD in a patient that does not have MRD.
  • administering a therapeutic comprises administering a therapeutic to treat a disease (e.g., cancer) associated with the MRD.
  • Patient A “patient” refers to an animal (e.g., a human) that has or is suspected of having a disease (e.g., a cancer).
  • the patient may be a mammal (e.g., a human, a non-human primate, a dog, a cat, a horse, a goat, a sheep, a mouse, or a rat), a bird, a reptile, an amphibian, a fish, or a laboratory model organism (e.g., mice and rats).
  • the patient may be a human.
  • the patient may be an adult human (e.g., older than 18 years of age), a human child, or a human infant.
  • the patient may be a patient that has been treated for a disease.
  • the patient may have been treated for any type of cancer.
  • the patient may have been treated for lung cancer (e.g., non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC), or lung adenocarcinoma), brain cancer, liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, colon cancer, squamous cell carcinoma, melanoma, etc.
  • the patient may be in remission from a disease.
  • the patient may be in remission from cancer.
  • the patient may be in remission from lung cancer (e.g., non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC), lung adenocarcinoma), brain cancer, liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, colon cancer, squamous cell carcinoma, melanoma and etc.
  • lung cancer e.g., non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC), lung adenocarcinoma
  • brain cancer e.g., liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, colon cancer, squamous cell carcinoma, melanoma and etc.
  • the patient may be in remission of a cancer selected from NSCLC, colorectal cancer (CRC), bladder cancer, pancreatic cancer, head and neck squamous cell carcinomas (HNSCC), breast cancer, and hematological cancers (e.g., leukemia, lymphoma, and multiple myeloma). These cancers may be particularly likely to release nucleic acids (e.g., RNA or DNA) in bodily fluids.
  • a patient has been previously treated for a disease (e.g., cancer).
  • a patient may have been previously treated using one or more therapeutics such as surgery, chemotherapy, radiation therapy, immunotherapy, and/or hormone therapy.
  • treating a patient comprises removing a tumor.
  • the patient may be in remission from cancer.
  • Patient-Specific Panel A “patient-specific panel” may refer to a collection (e.g., a set or subset) of tumor specific variants (TSVs) selected for use in detecting MRD in a patient or to a technique for detecting the selected TSVs (e.g., untargeted sequencing, targeted sequencing, polynucleotide probes, polymerase chain reaction amplification of the TSVs, and/or qPCR), depending on the context.
  • a patient-specific panel comprises a selected subset of TSVs.
  • each selected TSV of a patient-specific panel is predicted to have a high likelihood or probability of being detected in ctDNA derived from a patient.
  • MRD minimal Residual Disease
  • Minimal residual disease may refer to any remaining disease (e.g., diseased cell or ctDNA) that may be present in a patient after the patient has received and/or completed a treatment for the disease.
  • minimal residual disease associated with cancer may be detected when cancer cells, or tumor-derived polynucleotides (e.g., tumor RNA, cell free tumor DNA and/or circulating tumor DNA (ctDNA)) are present in a patient after treatment.
  • MRD may be detected based on ctDNA detection before cancer relapse is detected using standard surveillance imaging (e.g., computerized tomography (CT), magnetic resonance imaging (MRI), or Positron Emission Tomography (PET)).
  • CT computerized tomography
  • MRI magnetic resonance imaging
  • PET Positron Emission Tomography
  • Some cancer types may shed DNA (e.g., ctDNA), which may end up in the bloodstream of a patient.
  • minimal residual disease may be monitored based on sequencing of ctDNA from biological samples (e.g., plasma).
  • the likelihood or probability of determining an indication of minimal residual disease may increase overtime. (e.g., cancer cells that survive treatment may continue to replicate and/or metastasize, which may result in additional ctDNA shedding).
  • Determining an indication of MRD may be based on the number and/or frequency of TSVs of the subset of the plurality of TSVs detected using the patient-specific panel.
  • An indication of MRD may be an estimate of the likelihood and/or probability that MRD is present in the ctDNA plasma sample of a patient.
  • the estimate of the likelihood and/or probability may be based on a statistical test.
  • the statistical test may be a Poisson test, a Binomial test, a T-Test, or any other suitable statistical test.
  • determining an indication of MRD comprises determining if the number of times each TSVs is observed (e.g., the number of times each TSV of a patient specific panel is detected) in sequence data of a biological sample of the patient exceeds the expected number of TSVs to be detected due to error associated with sample preparation (e.g., DNA extraction and amplification with primers of the patient-specific panel) and/or detection (e.g., sequencing).
  • a positive indication of MRD may indicate that MRD is present in a patient (i.e., an MRD positive patient).
  • a positive indication of MRD may be determined when TSVs of a patient-specific panel are detected in a biological sample of the patient.
  • a positive indication of MRD may be determined and/or confirmed using standard surveillance imaging (e.g.,
  • a positive indication of MRD may be determined when the number of TSVs identified in a patient exceeds the expected number of TSVs expected to be observed due to error associated with sample preparation and/or detection.
  • a positive indication of MRD is determined when at least 1 TSV (e.g., at least 5 TSVs, at least 10 TSVs, at least 15 TSVs, at least 20 TSVs, at least 25 TSVs, at least 30 TSVs, at least 35 TSVs, at least 40 TSVs, at least 45 TSVs, or at least 50 TSVs) of the patient specific panel are detected in a biological sample of the patient.
  • a negative indication of MRD may indicate that MRD is not present in a patient (i.e., an MRD negative patient).
  • a negative indication of MRD may be determined in the absence of a positive indication of MRD.
  • Sequence data may refer to data generated by sequencing nucleic acids in a biological sample (e.g., by using next-generation sequencing (NGS), nanopore-based sequencing or sequencing by synthesis) or obtaining sequence data of a biological sample by other means (e.g., quantitative polymerase chain reaction or hybridization of oligonucleotide probes).
  • NGS next-generation sequencing
  • nanopore-based sequencing or sequencing by synthesis e.g., quantitative polymerase chain reaction or hybridization of oligonucleotide probes
  • Sequence data may be collected using a suitable sequencing method and/or suitable sequencing equipment, which includes but is not limited to equipment manufactured by Illumina ® , SOLid ® , Ion Torrent ® , PacBio ® , nanopore-based, Sanger sequencing or 454 TM .
  • sequencing data is generated using an NGS method.
  • Sequence data may be collected using fluorescent probes that are designed to bind to a target polynucleotides (e.g., a polynucleotide comprising a TSV).
  • Sequence data may comprise whole exome sequence data (WES) or whole genome sequence data (WGS).
  • Sequence data may comprise sequence reads of polynucleotide sequences in a biological sample derived from a patient (e.g., reads covering the plus strand and the minus strand of the polynucleotide sequences). Sequence reads may be encoded in any suitable format.
  • a sequence read may encode a polynucleotide sequence that the sequence read represents.
  • a sequence read may encode a polynucleotide sequence in any suitable way (e.g., as a sequence of characters, with characters representing respective nucleotides in the polynucleotide sequence, as a sequence of numbers, with numbers representing respective nucleotides in the polynucleotide sequence, etc.), as aspects of the technology described herein are not limited in this respect.
  • the sequence data may comprise sequence reads of any suitable polynucleotide of the biological sample.
  • the sequence data may comprise sequence reads of tumor-derived polynucleotides of the biological sample.
  • the sequence data may comprise sequence reads of RNA of the biological sample.
  • the sequence data may comprise sequence reads of DNA of the biological sample.
  • the sequence data may comprise sequence reads of tumor DNA or tumor RNA of the biological sample.
  • the sequence data may comprise sequence reads of cell free DNA (e.g., from healthy cells and/or tumor cells).
  • the sequence data may comprise sequence reads of circulating tumor DNA (ctDNA) of the biological sample.
  • the sequence data may comprise sequence reads of whole exome sequencing of the biological sample.
  • the sequence data may comprise sequence reads of whole genome sequencing of the biological sample.
  • the sequence data may comprise sequence reads that cover TSVs (e.g., TSVs of the subset of the plurality of TSVs).
  • the sequence data may comprise sequence reads that were obtained using a targeted gene sequencing panel.
  • sequence data may refer to data that is used when identifying TSVs for use in a patient-specific panel.
  • sequence data may be whole genome sequencing data or whole exome sequence data (e.g., untargeted sequencing).
  • sequence data may be advantageous when identifying TSVs for use in a patient- specific panel at least because these types of sequence data broadly cover sequences from across the genome or exome and thus are favorable in identification of unknown TSVs in a patient and selectin TSVs for use in a patient-specific panel.
  • sequence data may refer to sequence data obtained using a patient-specific panel (e.g., targeted sequencing).
  • sequence data may be obtained by (1) amplifying polynucleotides of a biological sample of the patient using primers of a patient-specific panel to produce amplicons and (2) sequencing the amplicons (e.g., using next-generation sequencing).
  • sequence data may be advantageous in determining an indication of MRD because the sequencing is focused on detecting known TSVs in a biological sample of a patient using a targeted approach (e.g., the patient-specific panel may be used to amplify specific polynucleotides that are expected to contain TSVs when MRD is present), which may increase sequencing depth and in turn may increase the probability of observing a TSV that is at a low allele frequency.
  • a sequence read does not include a physical molecule but data representing the same.
  • a reference to a nucleotide in a sequence read is a reference to information about a nucleotide (e.g., information representing the type of nucleotide – for example “A”, or “G”,
  • a sequencing read of the sequence data may comprise hundreds to thousands of nucleotides, depending on the sequencing technique used. Sequence data may comprise tens of thousands to billions of sequencing reads. For example, sequence data may comprise at least 50,000 reads (e.g., at least 100,000 sequencing reads, at least 250,000 sequencing reads, at least 500,000 sequencing reads, at least 1,000,000 sequencing reads, at least 2,000,000 sequencing reads, at least 4,000,000 sequencing reads, at least 8,000,000 sequencing reads, at least 16,000,000 sequencing reads, at least 50,000,000 sequencing reads, at least 100,000,000 sequencing reads, at least 500,000,000 sequencing reads, or at least 1,000,000,000 sequencing reads). The sequence data may comprise at least 50,000 reads.
  • sequence data may comprise 50,000-250,000 reads.
  • Obtaining sequence data may involve accessing at least one data structure, in memory, storing the sequence reads part of the sequence data.
  • Sequence data may comprise sequence data of polynucleotides amplified using a patient-specific panel.
  • a patient-specific panel may be used to specifically sequence only certain polynucleotides from the biological sample (e.g., polynucleotides comprising TSVs of the subset of the plurality of TSVs) by using primers of the patient- specific panel to amplify specific polynucleotides (e.g., polynucleotides associated with locus that may comprise TSVs of the subset of the plurality of TSVs) and then sequencing the amplified polynucleotides.
  • Sequencing data may be generated by sequencing nucleic acids in at least one biological sample. In some embodiments, sequencing data is generated by sequencing nucleic acids derived from two biological samples.
  • Non-limiting examples of nucleic acids in a sample include tumor-derived polynucleotides, circulating nucleic acids (e.g., cellular or acellular nucleic acids), cellular nucleic acids, acellular or cell-free nucleic acids, circulating cell-free nucleic acids, RNA (e.g., mRNA), cell-free RNA (cfRNA), circulating cfRNA, cell- free DNA (cfDNA), circulating cfDNA, tumor RNA, cell-free tumor RNA, circulating cell- free tumor RNA, tumor DNA, cell-free tumor DNA, circulating cell-free tumor DNA, circulating tumor DNA (ctDNA), the like and combinations thereof.
  • RNA e.g., mRNA
  • cfRNA cell-free RNA
  • cfDNA cell-free DNA
  • tumor RNA cell-free tumor RNA
  • circulating cell-free tumor DNA circulating tumor DNA
  • sequencing data is generated by sequencing nucleic acids derived from one or more tumor cells, and/or nucleic acids derived from one or more normal cells. In some embodiments, sequence data is obtained by a suitable method comprising hybrid capture or array capture. Biological Samples
  • Biological sample(s) may refer to one or more specimens collected from the patient.
  • the biological sample(s) may comprise any cell, tissue, biological fluid, and/or bone from a patient, or any other suitable biological sample from the patient.
  • the biological sample(s) may comprise tumor cells and/or non-tumor cells from the patient (e.g., a tumor cell sample and separate non-tumor cell sample).
  • the tumor cells may be collected from any tumor or cancer of the patient including, but not limited to lung cancer, brain cancer, liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, or colon cancer.
  • the tumor cells may be collected from a solid tumor.
  • the tumor cells may be collected from a melanoma tumor.
  • the tumor cells may be collected from a lung tumor.
  • the non-tumor cells may be collected from healthy tissue of the same type as the tumor. For example, if the tumor cells collected are liver tumor cells then the healthy tissue is healthy liver.
  • the non-tumor cells may be collected from a healthy tissue that is different from the tumor tissue collected. For example, if the tumor cells collected are liver tumor cells then the healthy tissue may be a healthy lung.
  • the non-tumor cells may be collected from a blood sample (e.g., plasma).
  • the tumor cells may be collected from the patient and the non-tumor cells may be collected from a healthy subject (e.g., collecting non-tumor cells from a healthy subject by a third party).
  • the biological sample may comprise tumor cells of the patient (e.g., comprise a portion of a tumor of the patient) and/or the biological sample may comprise non- tumor cells of the patient.
  • the tumor cells of the patient are expected to comprise TSVs, thus sequencing these cells is expected to identify TSVs for use in the patient-specific panel.
  • the biological sample may be a sample that is expected to comprise tumor-derived polynucleotides (e.g., as described herein). Less invasive methods of monitoring MRD may be advantageous to promote patient comfort and simplify biological sample collection.
  • biological samples for use in methods of determining an indication of MRD may be collected from bodily fluids, as described herein.
  • biological samples for use in methods of determining an indication of MRD may be collected from blood (e.g., plasma).
  • the biological sample may be stored using cryopreservation.
  • cryopreservation include, but are not limited to, step-down freezing, blast freezing, direct plunge freezing, snap freezing, slow freezing using a programmable freezer, and vitrification.
  • the biological sample may be stored using lyophilization.
  • a biological sample may be placed into a container that already contains a preservant (e.g., RNALater to preserve RNA) and then frozen (e.g., by snap-freezing), after the collection of the biological sample from the patient. In some embodiments, such storage in frozen state may be done immediately after collection of the biological sample.
  • a preservant e.g., RNALater to preserve RNA
  • such storage in frozen state may be done immediately after collection of the biological sample.
  • a biological sample may be kept at either room temperature or 4 o C for some time (e.g., up to an hour, up to 8 h, or up to 1 day, or a few days) in a preservant or in a buffer without a preservant, before being frozen.
  • preservants include formalin solutions, formaldehyde solutions, RNALater or other equivalent solutions, TriZol or other equivalent solutions, DNA/RNA Shield or equivalent solutions, EDTA (e.g., Buffer AE (10 mM Tris ⁇ Cl; 0.5 mM EDTA, pH 9.0)) and other coagulants, and Acids-Citrate-Dextrose (e.g., for blood specimens).
  • a vacutainer may be used to store blood.
  • a vacutainer may comprise a preservant (e.g., a coagulant, or an anticoagulant).
  • a container in which a biological sample is preserved may be contained in a secondary container, for the purpose of better preservation, or for the purpose of avoiding contamination. Any of the biological samples from a patient described herein may be stored under any condition that preserves stability of the biological sample. In some embodiments, the biological sample may be stored at a temperature that preserves stability of the biological sample.
  • the sample may be stored between 18 and 28 ⁇ C (e.g., 25 ⁇ C). In some embodiments, the sample may be stored under refrigeration (e.g., 4 °C). In some embodiments, the sample is stored under freezing conditions (e.g., -20 °C). In some embodiments, the sample may be stored under ultralow temperature conditions (e.g., -50 °C to -800 °C). In some embodiments, the sample may be stored under liquid nitrogen (e.g., - 1700 °C). In some embodiments, a biological sample may be stored at -60°C to -80°C (e.g., - 70°C) for up to 5 years.
  • refrigeration e.g. 4 °C
  • the sample is stored under freezing conditions (e.g., -20 °C).
  • the sample may be stored under ultralow temperature conditions (e.g., -50 °C to -800 °C).
  • the sample may be stored under liquid nitrogen (e.g., -
  • a biological sample may be stored at -60°C to -80°C (e.g., -70°C) for up to 1 month, up to 2 months, up to 3 months, up to 4 months, up to 5 months, up to 6 months, up to 7 months, up to 8 months, up to 9 months, up to 10 months, up
  • a biological sample may be stored as described by any of the methods described herein for up to up to 5 years, up to 10 years, up to 15 years, or up to 20 years.
  • Variant A “variant” may refer to a mutation or genetic variation present, or suspected to be present (e.g., suspected based analysis of sequencing data) in a first genome compared to a second genome.
  • a variant is a mutation in a genome present in a patient (which genome may include nucleic acids derived from tumor cells and/or non-tumor cells) compared to a standard genome or reference genome (e.g., GRCh38 or hg19, or the like).
  • a variant is a mutation in a genome of a tumor cell of a patient as compared to the genome of healthy cells or non-cancerous cells of the patient.
  • a variant is a tumor-specific variant.
  • a variant is not a tumor-specific variant.
  • variant data may falsely indicate a presence of a variant in a genome of a tumor, where the variant was introduced by a polymerase error or a sequencing read error.
  • variant data may indicate a presence of a variant (e.g., a single nucleotide difference) in a genome of a tumor derived from a patient compared to a reference genome, where the variant is not tumor-specific because the same variant is also present in a non-tumor cell derived from the patient.
  • Variants may be of different variant types.
  • variant types include single nucleotide mutations, two or more single nucleotide mutations (e.g., 2, 3, 4 or more single nucleotide mutations), insertions, deletions, translocations, inversion, duplications, or a mutation resulting from a combination thereof.
  • a single nucleotide mutation is a single nucleotide substitution, single nucleotide deletion or single nucleotide insertion.
  • a single nucleotide mutation is a somatic mutation.
  • a variant may be a genetic variation or mutation having a length of less than 1000 base pairs (bp), less than 500 bp, less than 250 bp or less than 50 bp.
  • a variant is a genetic variation or mutation having a length in a range of 1 to 50 bp, 1 to 20 bp or 1 to 10 bp. In some embodiments, a variant comprises two or more mutations that are immediately adjacent, and/or that are separated by 1 or more intervening nucleotides. Variants may be identified from sequence data derived from a biological sample(s) of a patient using any suitable method including GNUmap, GATK, SOAPsnp, SAMTools, SNVer, TRELKA, EBcall, MuTect, ADIScan1, ADIScan2 and SomaticSniper e.g., as
  • variant data may refer to data (e.g., genetic data) indicating a presence or absence of variants in a biological sample of a patient and may comprise various types of data and/or information about the variants.
  • variant data may include variant genomic location data, variant type data, variant sequence data, variant sequence context data, variant sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data, variant primer data, and/or any other suitable type of data about the variants.
  • variant genomic location data includes, for each variant, data indicative of the location of the variant in a genome (e.g., the location in a standard genome or the genome of the patient).
  • variant genomic location data may include a chromosomal location of the variant or a locus of the variant.
  • Variant genomic location data may be in any suitable format, as aspects of the technology described herein are not limited in this respect.
  • variant type data includes, for each variant, data indicating the type of the variant (e.g., a single nucleotide mutation, an insertions, a deletion, a translocation, an inversion, a duplication, or any other type of mutation resulting from a combination thereof).
  • Variant type data may be in any suitable format, as aspects of the technology described herein are not limited in this respect.
  • variant sequence data includes, for each variant, data indicating a sequence of the variant.
  • a single nucleotide mutation variant sequence may be indicated by a wildtype trinucleotide context (AAA) and a mutant sequence (ATA), where the variant is an A>T mutation.
  • AAA wildtype trinucleotide context
  • ATA mutant sequence
  • variant sequence context data may be in any suitable format, as aspects of the technology described herein are not limited in this respect.
  • variant sequence context data includes, for each variant, data indicating the sequence context surrounding the variant (e.g., sequence context conservation and splice sites in the sequence context).
  • variant sequence context data may include sequences of the polynucleotides comprising the variants (e.g., the sequence contexts associated with each variant and/or the loci associated with each variant).
  • variant sequence depth includes, for each variant, data indicating the number of sequencing reads covering the locus comprising a given variant in sequence data obtained from a sample comprising tumor cells and/or non-tumor cells of the patient.
  • variant allele frequency includes, for each variant, data indicating the number of times a variant is observed in sequence data (e.g., sequence data of a biological sample comprising tumor cells and/or non-tumor cells) at a given locus divided by the number of times the locus is observed in the sequence data (e.g., the number of time any allele is observed at the locus).
  • variant sequencing error rate includes, for each variant, an error rate of the sequencing apparatus during generation of the sequence data.
  • variant primer data includes, for each variant, data about primers that are designed to amplify a polynucleotide comprising that variant.
  • Variant primer data may include primer binding site genomic location; primer sequence; primer length, primer melting temperature (e.g., for a portion of a primer or entire length of a primer); primer propensity for secondary structure; a score from a primer design algorithm (e.g., Primer3 and Primer-BLAST); site for a primer designed to detect the TSV.
  • Generating “Generating” as used in the context of generating a feature may refer to calculating or determining the feature (e.g., determining using variant data), obtaining the feature (e.g., from the variant data, a primer design algorithm or any other suitable source) or obtaining a previously determined and stored value.
  • Sequence Context A “sequence context” may refer to the nucleotides on either side of the TSV in the primary sequence of the genome of the patient within a given range of nucleotides.
  • a sequence context may refer to the nucleotides in the same locus as the TSV.
  • a sequence context may refer to the nucleotides within 1, 2, 3, 5, 10, 20, 50, 100, 150, 200, 250, 300, 350, 400, 450 or more nucleotides (upstream and/or downstream) of a variant.
  • a sequence context may refer to the nucleotides within 50 nucleotides (upstream and/or downstream) of a variant.
  • TSVs Tumor Specific Variants
  • TSVs Tumor Specific Variants
  • TSVs Tumor Specific Variants
  • a TSV may be a variant that is present in tumor cells and not present in non-tumor cells.
  • a TSV may be a variant that is present at a higher allele frequency (e.g., the frequency is 2, 3, 4, 5, 10, etc. times as high) in a biological sample of tumor cells of a patient as compared to a biological sample of non-tumor cells of the patient.
  • a TSV is a variant that is present at a higher allele frequency (e.g., the frequency is 2, 3, 4, 5, 10, etc. times as high) in tumor cells derived from a patient as compared to non-tumor cells derived from the patient.
  • a TSV may be a variant that is present in tumor cells and not present in a genomic database of healthy individuals (e.g., gnomAD, 1000 genomes, and ExACpopulations). Therefore, in some embodiments, a TSV may not be polymorphism found within a population of healthy individuals (e.g., a single nucleotide polymorphism (SNP)).
  • SNP single nucleotide polymorphism
  • a TSV may be a variant that is present at a higher allele frequency (e.g., the frequency is 2, 3, 4, 5, 10, etc. times as high) in tumor cells of a patient as compared a healthy population allele frequency (e.g., as determined from a genomic database of healthy individuals) (e.g., gnomAD, 1000 genomes, and ExACpopulations)).
  • TSVs for a patient may be identified by: (1) obtaining variant data; and (2) identifying TSVs from among the variants in the variant data using features.
  • Such features may include one or more of variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence context indel, neighboring variants, static variants, primer flags, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and/or tumor cell variant allele frequency.
  • the values of the features may be used to select the TSVs. For example, the values may be compared to certain thresholds or otherwise used (e.g., as part of more complex logic, such as rules or even machine learning models) to select the TSVs.
  • Variant Bi-directional support may refer to the number of times a variant is observed in plus strand sequencing reads and minus strand sequencing reads of the variant data of the tumor cells (e.g., a biological sample of the tumor cells of the patient). Variant bi-directional
  • 35 10940863.511975645.1 support may be calculated by determining the minimum of (1) the number of plus strand reads that cover a variant in the tumor cell sample of the patient and (2) the number of minus strands reads that cover the variant in the tumor cell sample of the patient. For example, if there are 8 plus strand reads covering the variant and 10 minus strand reads covering the variant then variant bi-directional support would be 8.
  • the variant may be identified as a TSV when variant bi-directional support exceeds a threshold.
  • Variant bi-directional support may be indicative of the detectability of the variant by sequencing, specifically detectability when using both plus strand reads and minus strand reads, which may increase confidence of detection.
  • Healthy Population Variant Allele Frequency may refer to the allele frequency of a variant in a healthy population, as defined by at least one genomic database. Healthy population variant allele frequency may be determined by obtaining the allele frequency of a variant in a database of healthy individuals (e.g., gnomAD, 1000 genomes, and ExACpopulations). In other words, healthy population allele frequency may be used to identify a variant(s) that is likely not a TSV because the variant is found in a healthy population above a threshold allele frequency (indicating that the variant is not tumor specific).
  • a threshold allele frequency indicating that the variant is not tumor specific.
  • Sequence Context Homopolymer Size may refer to the length and/or location of a homopolymer sequence (e.g., is the homopolymer located between a variant and a binding site of a primer designed to detect presence of the variant in the genome of the tumor cells).
  • a homopolymer may refer to a series of consecutive nucleotides in a polynucleotide all of the same type (e.g., AAAAA represents a homopolymer of 5 nucleotides).
  • Sequence context homopolymer size may be used to identify variants that are not TSVs based on location of the homopolymer and the length of the homopolymer exceeding a threshold. Homopolymers above a given length may interfere with amplification and/or sequencing of the polynucleotide comprising the variant and thus affect the detectability of a variant. Additional description of sequence context homopolymer size can be found herein and with reference to FIG.2A.
  • Sequence Context Indel may refer to the distance (e.g., in nucleotides) between (1) an insertion or deletion (indel) mutation that is located within the sequence context of the variant, and (2) the variant.
  • a sequence context indel may be calculated by determining the distance between the 3’ or 5’ end of the indel and the variant (if the variant is a single nucleotide variant) or the 3’ or 5’ end of the variant (if the variant comprises more than one nucleotide).
  • the variant may be a TSV when either (1) no indel is located within the sequence context of the variant and/or (2) an indel is not located within a threshold distance of the variant.
  • Neighboring Variants may refer to the number of neighboring variants (i.e., other variants that are not the variant currently being potentially identified as TSV) that are within the sequence context of the variant or within a threshold distance (e.g., in nucleotides) of the variant.
  • a sequence context of a variant may comprise two neighboring variants within 50 nucleotides of the variant.
  • Neighboring variants may be calculated by counting the number of variants within the sequence context of the variant or within a specified distance of the variant.
  • the variant may be a TSV when the number of neighboring variants is less than a threshold.
  • Static variants may refer to the number of normal samples (i.e., sequencing data of normal samples) that the variant is observed in. For example, if the variant is observed in sequence data collected from two normal samples, then the static variant would be two. The variant may be a TSV when the number of static variants is less than a threshold. Additional description of static variants can be found herein and with reference to FIG.2A.
  • Primer Flags may refer to characteristics of a primer (e.g., a primer of a patient- specific panel) that may indicate the detectability of a variant (e.g., TSV) using the primer. Primer flags may include, but are not limited to:
  • a homopolymer sequence that exceeds a threshold length found in the primer sequence a homopolymer sequence that exceeds a threshold length located between the binding site of the primer and the variant (e.g., TSV) (e.g., see Sequence Context Homopolymer Size); “TA” nucleotide repeats that exceed a threshold number of consecutive repeats present in the sequence expected to be amplified by the primer in a PCR reaction (e.g.
  • a variant may be a TSV when the number of primer flags for a primer for use in detecting a variant is less than a threshold. Additional description of primer flags can be found herein and with reference to FIG.2A.
  • Sequence coverage in non-tumor cells may refer to the number of sequencing reads covering a locus of a variant in the non-tumor cells of the patient (e.g., the number of sequencing reads covering the locus regardless of the allele in the locus). Sequence coverage in non-tumor cells may be calculated by determining the number of sequencing reads that cover a polynucleotide at a locus that has been observed to comprise a variant of the patient (e.g., in non-tumor cells of the patient). For example, 50X coverage refers to 50 sequencing reads covering a locus (e.g., a locus in a tumor cell of a patient observed to comprise a variant).
  • the variant may be a TSV when sequence coverage in non- tumor cells exceeds a threshold. Sequencing coverage in non-tumor cells may be indicative of the detectability of the variant because low coverage in non-tumor cells may indicate difficulty in amplifying and/or sequencing the variant. Additional description of sequence coverage in non-tumor cells can be found herein and with reference to FIG.2A. Ratio of Variant Allele Frequency between Tumor Cells and Non-tumor Cells The “ratio of variant allele frequency between tumor cells and non-tumor cells” may be calculated by (1) determining the allele frequency of the variant in the tumor cells (e.g., a biological sample comprising tumor cells of the patient), (2) determining the allele frequency
  • allele frequency may be calculated by dividing the total number of times a specific allele is observed at a locus (e.g., an allele comprising the variant), by the total number of times that locus is observed in the sequence data (e.g., the number observations of the allele comprising the variant plus the number of observations of all the other alleles at that locus).
  • the variant may be a TSV when the ratio of variant allele frequency between tumor cells and non-tumor cells exceeds a threshold.
  • Tumor Cell Variant Allele Frequency may be calculated based on (1) the total number of times the variant allele is observed in sequence data of the tumor cells (e.g., a biological sample comprising tumor cells of the patient) and (2) the total number of times the locus comprising the variant allele is observed in sequence data of the tumor cells of the patient (e.g., the variant allele observations plus all the other allele observations at that locus). Tumor cell variant allele frequency may be calculated by dividing (1) and (2) above. In some embodiments, the variant may be a TSV when tumor cell variant allele frequency exceeds a threshold.
  • designing a patient-specific panel comprises (1) obtaining variant data, (2) identifying a plurality of TSVs and (3) identifying a subset of the plurality of TSVs for inclusion in the patient-specific panel.
  • TSVs of the subset of the plurality of TSVs may be identified using features associated with the TSVs. For example, at least one sequence coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature.
  • TTC trinucleotide context
  • features corresponding to at least some of the TSVs may be processed using a trained ML model to produce scores that are indicative of the detectability of TSVs, which in turn may be used to select the subset of the plurality of TSVs for use in the patient-specific panel.
  • Sequencing coverage features may refer to features that are based on the number of sequencing reads covering a variant and/or a locus of a variant (e.g., raw Illumina® sequencing reads) covering a TSV (e.g., in a biological sample of normal cells or tumor cells of the patient). Sequence coverage features may include, but are not limited to sequencing depth of coverage of plus strands and minus strands for a TSV (i.e., minimum strand coverage), and/or a ratio of depth of coverage between plus strands and minus strands of the variant data for a TSV (i.e., strand bias).
  • Allele frequency features may refer to features that are based on or derivative of the allele frequency of a variant or TSV in tumor cells of the patient, non-tumor cells of the patient, or a database comprising genome sequences from healthy individuals and/or individuals having a disease (e.g., cancer). Allele frequency may be calculated by dividing the number of times a variant allele is present at a locus in the variant data by the number of times the locus (with any variant) is observed in variant data.
  • Allele frequency features may include, but are not limited to, non-tumor cell depth coverage of a TSV, number of observations of a TSV in tumor cells of the patient (i.e. tumor cell alternate observations), and/or a tumor allele frequency of the TSV. Additional description of allele frequency features can be found herein and in the section “Generating Allele Frequency Features”.
  • methods herein comprise determining a sequencing error rate (e.g., a value representing the rate of an incorrect nucleotide being identified at a position; incorrect nucleotides may be identified at a position due to events that take place during sample collection, preparation, sequencing, post-sequence analysis or any other occasion in which the sample or data is manipulated) by monitoring error rates in nucleotides or groups of nucleotides (i.e. nucleotide context (NC)).
  • a sequencing error rate e.g., a value representing the rate of an incorrect nucleotide being identified at a position; incorrect nucleotides may be identified at a position due to events that take place during sample collection, preparation, sequencing, post-sequence analysis or any other occasion in which the sample or data is manipulated
  • generating a set of features (e.g., a first set of features) for a first TSV comprises determining for each TSV of a subset of the plurality of TSVs a nucleotide context error rate.
  • a nucleotide context refers to a series of sequential nucleic acids with specific bases in a nucleic acid sequence or a sequence read.
  • error rates in single nucleotides single nucleotide
  • error rates in groups of two nucleotides are monitored.
  • error rates in groups of two nucleotides are monitored.
  • error rates in groups of trinucleotide context are monitored as described herein.
  • the estimated sequencing error rate may be compared to the actual number of mutations observed in the positions being monitored for mutations to determine an indication of MRD.
  • this technique involves estimating sequencing error from sequencing results at positions not being monitored for cancer-associated mutations (the collection of such sequence read positions may be termed “background regions” herein).
  • coverage and/or resolution play a significant role in determining an optimal context size (e.g., NC) for determining error rates, for example where coverage refers to a maximum number of observations for an error rate context, on average, given a depth of sequencing for the sample, and where resolution refers to a total number of error rate contexts of a given size.
  • an optimal context size e.g., NC
  • coverage refers to a maximum number of observations for an error rate context, on average, given a depth of sequencing for the sample
  • resolution refers to a total number of error rate contexts of a given size.
  • a trinucleotide context has a theoretical potential to detect error rates down to 1/52 (1.9%) on average while still increasing the overall resolution vs. di- or mono-nucleotide contexts. While any suitable NC length can be used for a method herein, the inventors herein have found that a trinucleotide context is often an optimal context size that yields acceptable detectable error rates across many sequencing depths.
  • TPC tri- nucleotide context
  • a “TNC error rate feature” or “Error rate in error corrected bins” may refer to the estimated probability of observing a TSV in the middle position of the TNC due to errors introduced during sample preparation and/or sequencing of a biological sample (e.g., the biological sample of the patient).
  • a TNC may refer to a series of three sequential nucleotides in a sequence read (e.g., AAA, TAT, GTA, etc.).
  • the TNC may comprise a variant (e.g., a TSV) in the middle position of the TNC.
  • C to A variant mutation feature may refer to an indicator (e.g., a binary indicator) for whether the variant is a C to A mutation. For example, a variant mutation from C to A may be indicated with a “1” and any other mutation (e.g., C to T) may be indicated by a “0”. Additional description of the C to A variant mutation feature can be found herein and with reference to the section entitled “Generating a C to A Variant Mutation Feature”.
  • Primer features may refer to one or more features associated with a set of primers designed to detect (e.g., amplify for detection) a variant or TSV.
  • Primer features may include, but are not limited to, primer genome location; primer sequence; primer melting temperature; primer propensity for secondary structure; a score from a primer design algorithm; a distance (e.g., measured in number of nucleotides) between a TSV and a binding site for a first primer designed to detect the TSV (e.g., the distance between the TSV and the 3’ end or 5’ end of the binding site for a first primer); and distance (e.g., measured in number of nucleotides) between the TSV and binding site for a second primer (e.g., the distance between the TSV and the 3’ end or 5’ end of the binding site for a second primer), different from the first primer, designed to detect the TSV.
  • one or more primer features are determined for all or a portion of a primer (e.g., a portion of a primer that initially anneals to a target or gene sequence). Additional description of the primer features can be found herein and with reference to the section entitled “Generating Primer Features”.
  • a primer, each primer of a primer pair, or each primer of a set of primers identified by, or used for, a method herein may comprise a suitable length. A suitable length may be determined by a method described
  • a portion of a primer configured to initially anneal to a target sequence comprises a length in a range of 8 to 60 nucleotides, 10 to 50 nucleotides, 15 to 45 nucleotides, or 18 to 41 nucleotides.
  • a primer comprises a 5' tail or one or more additional 5' sequences (e.g., barcode, identifier sequences, random sequences, adaptor sequences, common primer sites, sequencing primer sites, and/or the like).
  • an additional sequence of a primer comprises a length in a range of 1 to 60 nucleotides. In some embodiments, an entire length of a primer, each primer of a primer pair, or each primer of a set of primers identified by, or used for a method herein is in a range of 10 to 150 nucleotides, 20 to 100 nucleotides, or 30 to 75 nucleotides.
  • a primer, each primer of a primer pair, or each primer of a set of primers identified by, or used for, a method herein may comprise a suitable Tm.
  • a suitable Tm may be determined by a method described herein.
  • a portion of a primer configured to initially anneal to a target sequence comprises a Tm in a range of 30 to 85 °C, 60 to 80 °C, or 65 to 75 °C.
  • a primer comprises a 5' tail or one or more additional 5' sequences (e.g., barcode, identifier sequences, random sequences, adaptor sequences, common primer sites, sequencing primer sites, and/or the like).
  • an additional sequence of a primer comprises a Tm in a range of 30 to 85 °C, 60 to 80 °C, or 65 to 75 °C.
  • an entire length of a primer, each primer of a primer pair, or each primer of a set of primers identified by, or used for a method herein is in a range of 30 to 85 °C, 60 to 80 °C, or 65 to 75 °C.
  • a subset of primers or all primers of a set of primers identified by, or used for, a method herein may comprise the same Tm or similar melting temperatures as determined for the entire length of, or target-specific portions of the primers.
  • a subset of primers or all primers of a set of primers identified by, or used for, a method herein may have an average Tm with a standard deviation of no more than 20°C, 10°C, 5°C, or 2°C.
  • any one primer of a subset or set of primers identified by, or used for, a method herein comprises a Tm that differs by no more than 10°C, 5°C, or 2°C from any other primer in the subset or set of primers.
  • Sequence context features may refer to sequence features that are within the sequence context of a TSV. Sequence context features may include, but are not limited to, a conservation score of the sequence context comprising the TSV, a distance between the TSV and a nearest splice site in the sequence context, and/or a splice site score of the sequence context (e.g., a score indicating that a splice site is located within the sequence context). Sequence context features may indicate an ability to amplify the TSV for detection. For example, a sequence context comprising a splice site indicates that different size amplicons comprising different sequences may be produced using the same set of primers due to alternative splicing.
  • Tumor-derived Polynucleotide may refer to a polynucleotide that was or is part of a tumor cell (e.g., a tumor cell of the patient).
  • a tumor-derived polynucleotide may include, but is not limited to tumor RNA, cell-free tumor RNA, circulating cell-free tumor RNA, tumor DNA, cell-free tumor DNA, circulating cell-free tumor DNA, and circulating tumor DNA (ctDNA).
  • a tumor-derived polynucleotide may be present in any tissue and/or fluid of the patient.
  • a tumor-derived polynucleotide may be present in blood and/or blood-derived products of the patient (e.g., serum and plasma).
  • a tumor-derived polynucleotide may also be present in saliva, semen, vaginal secretions, urine, feces, nasal mucus, sweat, ear wax, and spinal fluid.
  • a tumor-derived polynucleotide may be identified based on the presence of the one or more TSVs in the tumor-derived polynucleotide. The presence of a tumor-derived polynucleotide may be indicative of MRD.
  • Circulating Tumor DNA may refer to DNA or DNA fragments derived from tumor cells that have escaped the tumor and are present in the circulatory system.
  • ctDNA may be present in blood (e.g., serum and plasma).
  • ctDNA may be identified based on the presence of the one or more TSVs in the ctDNA. The presence of ctDNA may be indicative of MRD.
  • Locus A “locus” may refer to a set of consecutive nucleotides in a genome (e.g., the genome of a patient) within a threshold distance of a TSV (e.g., within 50, 100, 150, 200, 250, 300,
  • a locus may refer to the nucleotides encoding a gene (e.g., a gene comprising a TSV), however, a locus may also refer to a non- coding locus (e.g., a locus that does not encode a gene). Additional Description Additional detailed disclosures of the various concepts and embodiments related to methods and compositions of designing a patient-specific panel are provided below.
  • FIG.1 is a diagram depicting an illustrative technique 100 for using variant data from tumor cells and non-tumor cells of a patient to design a patient-specific panel for detecting MRD in the patient, according to some embodiments of the technology described herein.
  • Technique 100 involves collecting a tumor sample 102 (e.g., tumor cells) and a non-tumor cell sample 104 (e.g., non-tumor cells) from a patient. After collection, DNA is extracted 106 from the samples (e.g., using any suitable method). Extracted DNA may then be sequenced (e.g., using Illumina® sequencing). Sequencing may be whole genome sequencing or whole exome sequencing.
  • Sequencing produces sequence data, which is used in variant calling to identify variants 111, 112, 114 and 115 found in the tumor cell DNA 118 and/or the non- tumor cell DNA 120.
  • identifying a plurality of tumor specific variants (TSVs) 111, 114 and 115 for the patient-specific panel 109 is performed using any suitable method including the methods described herein.
  • Step 109 may include both identifying the plurality of TSVs (111, 114 and 115) and identifying 110 a subset of the plurality of TSVs, 111 and 115.
  • the patient-specific panel is designed 122 for use in detecting the identified TSVs (TSV 111 and TSV 115).
  • step 122 designing a patient specific panel, may comprise designing a pair of primers for each of TSV 111 and TSV 115, each pair of primers being capable of amplifying a polynucleotide comprising the TSV.
  • the patient specific panel 125 is contacted with a plasma sample 126 (e.g., DNA extracted from the plasma that may comprise ctDNA).
  • a plasma sample 126 e.g., DNA extracted from the plasma that may comprise ctDNA
  • Contacting the patient-specific panel 125 and the plasma 126 may further comprise performing a PCR reaction and sequencing the amplicons from the PCR reaction.
  • an indication of MRD 124 in the patient may be provided.
  • Illustrative technique 100 involves obtaining a tumor sample 102 (e.g., tumor cells) and a non-tumor sample 104 (e.g., non-tumor cells) from a patient.
  • the tumor sample 102 may be collected from any tumor and/or cancer including, but not limited to a lung cancer,
  • the tumor sample 102 is collected from a solid tumor. In some embodiments, the tumor sample 102 is collected from a melanoma tumor. In some embodiments, the tumor sample 102 is collected from a lung tumor. In some embodiments, the non-tumor sample 104 is collected from healthy tissue of the same type as the tumor sample 102. For example, if the tumor sample 102 is collected from a liver tumor then the non-tumor sample 104 is healthy liver tissue.
  • the non-tumor sample 104 is collected from healthy tissue that is different from the type of tissue that tumor sample 102 is collected from. For example, if the tumor sample 102 collected is from a liver tumor then the non-tumor sample 104 is healthy lung tissue. In some embodiments the non-tumor sample 104 is a blood sample (e.g., plasma).
  • Illustrative technique 100 next involves DNA extraction 106 from the tumor sample 102 and non-tumor sample 104. Methods for extracting DNA from biological samples (e.g., the tumor sample and the non-tumor sample) are well known in the art. Following DNA extraction, DNA may be sequenced using any suitable sequencing technique to produce sequence data that may be used in variant calling 108.
  • Variant calling 108 may refer to identifying DNA variants, from sequence data of the tumor sample 102 and the non-tumor sample 104, that differ from a standard genome (e.g., GRCh38 or hg19). Thus, variant calling may be expected to identify tumor cell variants (e.g., 111, 114, and 115), non-tumor cell variants (e.g., variants that appear after patient conception and are not part of the germline and are also not tumor-specific, such as 112), and germline variants that are not in the standard genome. Any suitable method can be used for variant calling 108 (e.g., as described herein).
  • Variant calling 108 may be performed separately on sequence data from tumor samples (e.g., tumor sample 102) and sequence data from non- tumor samples (e.g., non-tumor sample 104). Obtaining variant data is further described herein including with reference to FIG.2A.
  • identification of tumor specific variants for the patient-specific panel 125 may be performed.
  • identification of tumor specific variants for the patient-specific panel may refer to identifying variants that are specific to the tumor sample 102 over the non-tumor sample 104.
  • Selecting a subset of the plurality of TSVs may involve using a trained machine learning model (not shown) to score TSVs based on one or more features described herein, and selecting TSVs for the subset of the plurality of TSVs based on the scores. Identifying a plurality of TSVs and identifying the subset of the plurality of TSVs are further described herein, and in reference to the sections “Tumor Specific Variants” and “Subset of the Plurality of Tumor Specific Variants” and in reference to FIG.2A.
  • patient specific panel design 122 may be performed.
  • a patient-specific panel 125 may be designed to detect one or more TSVs of a subset of the plurality of TSVs (e.g., as described herein).
  • the patient-specific panel 125 may be designed to detect the subset of the plurality of TSVs, as described herein.
  • the patient-specific panel 125 may be designed to detect at least 1 (e.g., at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 110, at least 120, at least 130, at least 140, at least 150, at least 160, at least 170, at least 180, at least 190, or at least 200) TSVs.
  • at least 1 e.g., at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 110, at least 120, at least 130, at least 140, at least 150, at least 160, at least 170, at least 180, at least 190, or at least 200
  • the patient-specific panel 125 may be designed to detect at least 10 TSVs, at least 25 TSVs, at least 50 TSVs, at least 75 TSVs, at least 100 TSVs, at least 125 TSVs, at least 150 TSVs, at least 175 TSVs, at least 200 TSVs, at least 250 TSVs, or at least 300 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 10-200 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 25-200 TSVs.
  • the patient-specific panel 125 may be designed to detect 50-200 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 75-200 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 100-200 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 10-150 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 25-150 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 50-150 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 75-150 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 75-150 TSVs. In some embodiments, in patient specific panel design 122
  • the patient-specific panel 125 may be designed to detect 100-150 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 18-200 TSVs.
  • the patient-specific panel 125 may comprise a set of primers (e.g., a pair of primers or nested primers) that are designed to amplify a region of a polynucleotide (e.g., ctDNA of the patient) comprising a TSV (e.g., of the subset of the plurality of TSVs).
  • the patient specific panel may comprise a plurality of sets of primers, wherein at least some of the plurality of sets of primers may be designed to detect different TSVs of the subset of the plurality of TSVs.
  • the patient specific panel may comprise a plurality of sets of primers, wherein at least some of the plurality of sets of primers are each designed to detect a region of a polynucleotide comprising a TSV of the subset of the plurality of TSVs.
  • the patient specific panel 125 may be contacted with ctDNA (e.g., plasma ctDNA 126) as one step in making an MRD call (e.g., determining an indication of MRD).
  • a patient-specific panel may comprise a plurality of sets of primers, wherein at least 2, at least 10, at least 20, at least 50 or at least 100 of the plurality of sets of primers are each designed to detect a region of a polynucleotide comprising a TSV.
  • a patient-specific panel comprises one or more primers, or a plurality of sets of primers that target non-tumor specific loci.
  • a patient-specific panel may comprise one or more, or a plurality of sets of primers that are not designed to detect a region of a polynucleotide comprising a TSV.
  • a patient-specific panel may comprise a plurality of primers designed to detect a region of a polynucleotide comprising a TSV, and a plurality of primers targeting non-tumor specific loci.
  • Non-tumor specific loci may comprise specific nucleotide variants (e.g., SNPs) or random control loci found in both tumor and healthy tissue of a patient or population.
  • Primers targeting non-tumor specific loci may be included in a patient-specific panel for any suitable purpose (e.g., for sample tracking, control or normalization).
  • primers targeting non-tumor specific loci are included in a patient-specific panel to normalize the amount of total primers used in any one method or assay.
  • the inventors herein have determined that normalizing the total number of primers used in a multiplex assay described herein can sometimes help correct for any amplification bias associated with differences in the number of TSV specific primers used among patients or assays.
  • primers targeting non-tumor specific loci comprise the same or similar features as other primers in a panel designed to detect a region of a polynucleotide comprising a TSV.
  • the plasma 126 may be collected from the blood of the same patient from which tumor sample 102 and non-tumor sample 104 were collected.
  • a biological sample instead of plasma 126 may be collected from any location of the patient that may comprise ctDNA.
  • a biological sample may be collected from saliva, semen, vaginal secretions, urine, feces, nasal mucus, sweat, ear wax, spinal fluid, blood, serum, or plasma from a patient.
  • the biological sample may not comprise detectable ctDNA (e.g., when the patient does not have detectable MRD).
  • multiple biological samples may be collected from a patient and sequenced to obtain sequence data.
  • the multiple biological samples may be sequentially collected from a patient over a specified period of time then sequenced to obtain sequence data.
  • the specified period of time may begin after cancer treatment ends and may continue for the remainder of the patient’s life.
  • the frequency with which biological samples are collected from a patient may be any suitable frequency for monitoring MRD.
  • biological samples may be collected from a patient weekly.
  • biological samples may be collected from a patient about twice a month.
  • biological samples may be collected from a patient about once a month.
  • biological samples may be collected from a patient about once every three months.
  • biological samples may be collected from a patient about once every six months. In some embodiments, biological samples may be collected from a patient at least twice a month. In some embodiments, biological samples may be collected from a patient at least once a month. In some embodiments, biological samples may be collected from a patient at least once every three months. In some embodiments, biological samples may be collected from a patient at least once every six months. In some embodiments, the frequency with which biological samples may be collected from the patient may be based on the type of disease the patient is being monitored for (e.g., type of cancer), the expected likelihood of recurrence, and the rate of disease progression after recurrence.
  • Technique 100 next optionally proceeds to an MRD call 124 (i.e., determining an indication of MRD in the patient).
  • Determining an indication of MRD may comprise contacting the plasma ctDNA 126 with the patient specific-panel 125.
  • Determining an indication of MRD may further comprise amplifying ctDNA using primers of the patient specific panels to produce amplicons.
  • Determining an indication of MRD may further comprise sequencing one or more amplicons produced by amplifying the ctDNA to generate sequence data.
  • Determining an indication of MRD may further comprise analyzing sequence
  • FIG.2A is a flowchart of an illustrative process 200 for identifying a subset of a plurality of tumor specific variants (TSVs) for use in a patient-specific panel for identifying MRD, and optionally identifying and/or synthesizing primers for inclusion in the patient- specific panel, according to some embodiments of the technology described herein.
  • TSVs tumor specific variants
  • Process 200 involves obtaining variant data 202 (e.g., variants called using a tumor cell sample and a non-tumor cell sample of the patient); identifying a plurality of TSVs 204 using the variant data 202, optionally identifying primers 206 (e.g., primer sequences) for use in detecting the TSVs; identifying a subset of the plurality of TSVs for use in the patient specific panel 208 (e.g., TSVs that are more likely to be detectable using the patient-specific panel, and/or provide an indication of MRD); and optionally synthesizing primers for detecting the subset of the plurality of TSVs 210 (e.g., the synthesized primers for using the patient-specific panel).
  • variant data 202 e.g., variants called using a tumor cell sample and a non-tumor cell sample of the patient
  • identifying primers 206 e.g., primer sequences
  • Process 200 begins at act 202 where variant data is obtained.
  • Obtaining variant data may refer to obtaining DNA variants associated with the tumor cells (e.g., biological sample of tumor cells) and/or non-tumor cells (e.g., biological sample of non-tumor cells) of the patient.
  • Obtaining variant data may comprise obtaining variant data using a variant caller as described herein.
  • Obtaining variant data may comprise obtaining sequence data of the tumor sample and the non-tumor sample and using a variant caller to identify variants.
  • obtaining variant data comprises generating sequence data of the tumor sample and the non-tumor sample.
  • obtaining variant data comprises obtaining the variants and additional data indicative of the variants (e.g., variant genomic location data, variant type data, variant sequence data, variant sequence context data, variant sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data, and/or variant primer data) as described herein including in the section “Variant Data.”
  • obtaining variant genomic location data comprises obtaining the location in the genome where a variant is located (e.g., the genomic locus).
  • obtaining variant type data comprises obtaining the type of mutation that generated a variant (e.g., single nucleotide change (e.g., C to A, A to G, etc.), insertion or deletion).
  • obtaining variant sequence data comprises obtaining the sequence of a variant.
  • obtaining variant sequence context data comprises obtaining data describing the polynucleotide sequence surrounding a variant (e.g., within 10, 50, 100, 150, 200, 250, 300, 350, 400, 450 or more nucleotides of a variant).
  • obtaining variant sequence context data comprises obtaining sequence context homopolymer data (e.g., the location of a homopolymer relative to the variant and the homopolymer length), sequence context splice site data (e.g., the location and type of any predicted splice sites in the sequence context), sequence context mutation data (e.g., variants identified in the sequence context), and/or sequence context conservation data (e.g., a score describing the degree of conservation of the sequence context (e.g., a conservation score generated by PhyloP or phastCons).
  • sequence context homopolymer data e.g., the location of a homopolymer relative to the variant and the homopolymer length
  • sequence context splice site data e.g., the location and type of any predicted splice sites in the sequence context
  • sequence context mutation data e.g., variants identified in the sequence context
  • sequence context conservation data e.g., a score describing the degree of conservation of the sequence context
  • obtaining variant sequencing coverage data comprises obtaining the sequencing coverage (e.g., Illumina® sequencing coverage) of a variant in tumor cells (e.g., a biological sample of tumor cells) and non-tumor cell (e.g., a biological sample of tumor cells) of a patient.
  • obtaining variant allele frequency data comprises obtaining the frequency of a variant in the tumor sample and/or the frequency in the non-tumor sample.
  • obtaining variant allele frequency data further comprises obtaining the allele frequency of the variant in healthy individuals of a genomic database (e.g., gnomAD, 1000 genomes, and ExACpopulations).
  • Obtaining variant sequencing error rate data may comprise obtaining data concerning the sequencing error rate associated with preparing a sample for sequencing and sequencing the sample as part of obtaining variant data.
  • obtaining variant primer data comprises obtaining primer sequences designed to amplify tumor cell variants or tumor specific variants.
  • Obtaining variant primer data may further comprise obtaining primer length, binding location, melting temperature, distance from TSV data and/or primer score from a primer design algorithm.
  • process 200 may include identifying a plurality of tumor specific variants 204 using the data obtained in step 202.
  • identifying the plurality of TSVs can be performed using any suitable method (e.g., GATK, FreeBayes, DeepVariant, SpeedSeq.).
  • identifying TSVs is based on a plurality of TSV features and corresponding thresholds. In some embodiments, identifying a plurality of tumor specific variants comprises using at least one (e.g., at least 2, at least 3, at least 4, at least 5) feature(s) selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-
  • identifying a plurality of tumor specific variants comprises using features selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence context indel, neighboring variants, static variants, primer flags, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and/or tumor cell variant allele frequency.
  • identifying a plurality of TSVs 204 may comprise selecting variants using variant bi-directional support, the selecting determining, for each variant of at least some of a plurality of variants, whether the variant is observed at least a threshold number of times in plus strand sequencing reads and minus strand sequencing reads of the variant data.
  • the threshold number of times is 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 times in plus strand sequencing reads and minus strand sequencing reads of the variant data.
  • the threshold number of times is 2, 8, or 15 times in plus strand sequencing reads and minus strand sequencing reads of the variant data.
  • the threshold number of times is between 2 and 15 in plus strand sequencing reads and minus strand sequencing reads of the variant data.
  • a variant may not be selected as a TSV when variant bi-directional support exceeds a threshold number.
  • identifying the plurality of TSVs 204 may comprise selecting variants using healthy population variant allele frequency. The selecting may determine, for each variant of at least some of a plurality of variants, whether the variant has a variant allele frequency of less than a threshold percentage in a healthy population, as defined by at least one genomic database (e.g., gnomAD, 1000 genomes, and/or ExACpopulations).
  • the threshold percentage may be 0.1%, 0.5%, 1%, 1.5%, 2%, or 3% variant allele frequency in a healthy population, as defined by at least one genomic database.
  • the threshold percentage may also be between 0.1%-3%, 0.5%-2%, or 0.75%-1.5% variant allele frequency in a healthy population, as defined by at least one genomic database.
  • the threshold percentage may be between 0.5% and 2% variant allele frequency in a healthy population, as defined by at least one genomic database.
  • the threshold percentage is 1% variant allele frequency in a healthy population, as defined by at least one genomic database.
  • a variant may not be selected as a TSV when the healthy population variant allele frequency exceeds the threshold percentage.
  • identifying the plurality of TSVs 204 may comprise selecting variants using sequence context homopolymer size, the selecting determining, for each variant of at least some of a plurality of variants, whether a homopolymer sequence exceeding a threshold size is present between the variant and a binding site of a primer designed to detect the presence of the variant (e.g., in the genome of the tumor cells of the patient).
  • a homopolymer refers to a series of consecutive nucleotides in a polynucleotide all of the same type (e.g., AAAAA represents a homopolymer of 5 nucleotides).
  • the threshold size may be a homopolymer of 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 or more nucleotides present between the variant and a binding site of a primer designed to detect the presence of the variant.
  • the threshold size may be a homopolymer between 4 nucleotides and 8 nucleotides in length present between the variant and a binding site of a primer designed to detect presence of the variant.
  • the threshold size is a homopolymer of 6 nucleotides present between the variant and a binding site of a primer designed to detect presence of the variant.
  • a variant may not be selected as a TSV when sequence context homopolymer size exceeds the threshold size.
  • identifying the plurality of TSVs 204 may comprise selecting variants using sequence context indel, the selecting determining for each variant of at least some of a plurality of variants, whether (1) an indel is located with the sequence context of the variant (2) the indel is located between a primer designed to detect the variant and the variant and/or (2) an indel is located within a threshold distance (e.g., nucleotides) from the variant.
  • the threshold distance may be 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more nucleotides from the variant.
  • the threshold distance may be 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39 or 40 nucleotides from the variant.
  • the threshold distance may be between 10-100, 10-75, 10-50, 15-75, or 15-50 nucleotides from the variant.
  • the threshold distance may be 25 nucleotides from the variant.
  • a variant may not be selected as a TSV when the sequence context indel meets one or more of criteria 1-3.
  • identifying the plurality of TSVs 204 may comprise selecting variants using neighboring variants, the selecting determining for each variant of at least some of a plurality of variants, whether a threshold number of other variants are located within the sequence context of the variant.
  • the threshold number of other variants may be 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 variants.
  • the threshold number of variants may be 1-5 or 1-10 variants.
  • the threshold number of variants may be 2 variants.
  • the sequence context may comprise nucleotides within 25, 50, 75 or 100 nucleotides of the variant (e.g., 25, 50, 75 or
  • identifying the plurality of TSVs 204 may comprise selecting variants using static variants, the selecting determining for each variant of at least some of a plurality of variants, whether the variant is observed in a threshold number of normal samples (e.g., sequencing data of normal samples).
  • the threshold number of normal sample observations may be 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 observations.
  • the threshold number of normal sample observations may be 2 observations.
  • a variant may be a TSV when the number of static variants is less than a threshold number of normal sample observations.
  • identifying the plurality of TSVs 204 may comprise selecting variants using primer flags, the selecting determining for each variant of at least some of a plurality of variants, whether primer associated with the variant is identified as having more than a threshold number of primer flags.
  • Primer flags may include, but are not limited to: (1) a homopolymer sequence that exceeds a threshold length found in the primer sequence.
  • the threshold length of the homopolymer sequence may be 3, 4, 5, 6, 7, 8, 9, or 10 nucleotides in length.
  • the threshold length of the homopolymer sequence may be 3, 4 or 5 nucleotides in length.
  • the threshold length of the homopolymer sequence may be 4 nucleotides in length; (2) a homopolymer sequence that exceeds a threshold length located between the binding site of the primer and the variant (e.g., TSV) (e.g., see Sequence Context Homopolymer Size, as described herein).
  • the threshold length of the homopolymer sequence may be 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, or 14 nucleotides.
  • the threshold length of the homopolymer sequence may be 5, 6, or 7 nucleotides.
  • the threshold length of the homopolymer sequence may be 6 nucleotides.
  • TA nucleotide repeats that exceed a threshold number of consecutive repeats present between a primer binding sequence (which may include the primer binding sequence) and a corresponding variant the primer is designed to detect.
  • the threshold number of “TA” nucleotide repeats may be 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 consecutive “TA” repeats.
  • the threshold number of “TA” nucleotide repeats may be 6, 7, or 8 consecutive “TA” repeats.
  • the threshold number of “TA” nucleotide repeats may be 7 consecutive “TA” repeats.
  • a percentage of guanine and cytosine nucleotides within a threshold distance of the variant e.g., upstream and/or downstream
  • the threshold distance may be between 20-60, 30-50, or 35-45 nucleotides.
  • the threshold distance may be 30, 35, 40, 45, or 50 nucleotides.
  • the threshold distance may be at least 40 nucleotides.
  • the threshold distance may be 40 nucleotides.
  • the threshold percentage of guanine and cytosine nucleotides may be 70%, 75%, 80%, 85%, 90%, or 95%.
  • the threshold percentage of guanine and cytosine nucleotides may be at least 80%. Within the threshold hold distance of the variant, the threshold percentage of guanine and cytosine nucleotides may be 80%. In some embodiments, a variant may not be selected as a TSV when a primer designed for use in detecting the TSV has 1 primer flag. In some embodiments, a variant may not be selected as a TSV when a primer designed for use in detecting the TSV has 2 primer flags. In some embodiments, a variant may not be selected as a TSV when a primer designed for use in detecting the TSV has 3 primer flags.
  • a variant may not be selected as a TSV when a primer designed for use in detecting the TSV has 4 primer flags.
  • Identifying the plurality of TSVs 204 may comprise selecting variants using sequence coverage in non-tumor cells, the selecting determining, for each variant of at least some of a plurality of variants, whether sequencing coverage of the variant in the non-tumor cells of the patient exceeds a threshold.
  • the threshold is between 10X and 150X sequencing coverage of the variant in the non-tumor cells of the patient. In some embodiments, the threshold is between 50X and 100X sequencing coverage of the variant in the non-tumor cells of the patient.
  • the threshold is between 45X and 100X sequencing coverage of the variant in the non-tumor cells of the patient. In some embodiments, the threshold is 45X, 50X, 75X or 100X sequencing coverage of the variant in the non-tumor cells of the patient. In some embodiments, the threshold is 20X, 30X, 40X, 50X, 60X, 70X, 80X, 90X or 100X. In some embodiments, a variant may not be selected as a TSV when sequencing coverage of the variant does not exceed a threshold coverage.
  • Identifying the plurality of TSVs 204 may comprise selecting variants using a ratio of variant allele frequency between tumor cells and non-tumor cells, the selecting determining, for each variant of at least some of a plurality of variants, whether the ratio of the variant exceeds a threshold ratio.
  • the threshold ratio may be between a ratio of 100:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, the threshold ratio is between a ratio of 30:1 and 10:1 of
  • the threshold ratio is between a ratio of 40:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, the threshold ratio is between a ratio of 50:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, the threshold ratio is between a ratio of 75:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency.
  • the threshold ratio is 5:1, 10:1, 20:1, 30:1, 40:1, 50:1, 60:1, 70:1, 80:1, 90:1, or 100:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency.
  • a variant may not be selected as a TSV when the ratio of variant allele frequency of the variant does not exceed a threshold ratio.
  • Identifying the plurality of TSVs 204 may comprise selecting variants using the tumor cell variant allele frequency, the selecting determining, for each variant of the plurality of variants, whether the tumor cell variant allele frequency exceeds a threshold.
  • the threshold is between a 0.05 and a 0.1 tumor cell variant allele frequency.
  • the threshold is between a 0.025 and a 0.2 tumor cell variant allele frequency. In some embodiments, the threshold is between a 0.025 and a 0.5 tumor cell variant allele frequency. In some embodiments, the threshold is 0.025, 0.5, 0.75, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 tumor cell variant allele frequency. In some embodiments, a variant may not be selected as a TSV when tumor cell variant allele frequency of the variant does not exceed a threshold allele frequency.
  • identifying the plurality of TSVs 204 comprises assigning variants to tiers using bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and/or tumor cell variant allele frequency according to the thresholds in Table 1.
  • identifying the plurality of TSVs 204 comprises assigning variants to tiers using the thresholds of Table 1, wherein a variant that meets all the thresholds of tier 1 is assigned to tier 1; a variant that does not meet the thresholds of tier 1, but meets all the thresholds of tier 2 is assigned to tier 2; a variant that does not meet the thresholds of tiers 1 or 2, but meets all the thresholds of tier 3 is assigned to tier 3; a variant that does not meet the thresholds of tiers 1, 2, or 3 but meets all the thresholds of tier 4 is assigned to tier 4; a variant that does not meet the thresholds of tiers 1, 2, 3 or 4 but meets all the thresholds of tier 5 is assigned to tier 5; and a variant that does not meet any of the thresholds of tiers 1, 2, 3, 4 or 5 is not assigned to any tier.
  • a plurality of variants may be selected as the plurality of TSVs according to the tiers. For example, tier 1 variants may be selected as TSVs first, followed by consecutive selection in tiers 2, 3, 4, and 5 until the total number of TSVs of the plurality of TSVs is obtained. For instance, if the plurality of TSVs has 15 TSVs and tier 1 has 12 variants, tier 2 has 6 variants and tier 3 has 3 variants, then 12 TSVs may be selected from tier 1 and 3 TSVs may be selected from tier 2.
  • the number of TSVs being selected may be a number of (e.g., 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5) times the number of TSVs needed for the patient-specific panels.
  • the TSVs being selected may be 2 times the number of TSVs needed for the patient-specific panel.
  • the number of TSVs being selected may be between 50 and 200 variants.
  • the number of TSVs being selected may be at least 50 (e.g., at least 75, at least 100, at least 150, or at least 200) variants.
  • identifying the plurality of TSVs comprises identifying variants from tier 1 or tier 2, but not tier 3, tier 4 or tier 5.
  • process 200 optionally comprises identifying primers for use in detecting TSVs 206.
  • Identifying primers may comprise identifying primer sequences using any suitable method (e.g., Primer3 and Primer-BLAST) including the methods described herein.
  • identifying primers may comprise identifying primers using one or
  • Primers may be completely or partially complementary to a target polynucleotide (e.g., a target polynucleotide comprising a TSV).
  • a primer comprises a portion complementary to a target polynucleotide sequence, and in some embodiments a primer comprises a 5' tail that is not complementary to a target polynucleotide sequence.
  • a primer or primer pair is identified according to a length or Tm of all or a portion of a primer sequence.
  • a primer or primer pair is identified according to a length or Tm of a portion of a primer that is complementary to a target sequence.
  • Identifying primers may comprise identifying primers for use in polymerase chain reaction (PCR).
  • Identifying primers may comprise identifying primers for use in nested PCR or hemi-nested PCR.
  • Identifying primers may comprise identifying primers (e.g., a first and/or a second primer) for use in quantitative polymerase chain reaction (qPCR). If the primers are used in PCR or qPCR, the first and/or second primer may be designed to amplify a region of a polynucleotide comprising a variant (e.g., a TSV).
  • the region may be of a suitable size for sequencing (e.g., Illumina ® sequencing) or qPCR detection.
  • a first and second primer are selected for amplification of one strand of a locus comprising a TSV, where the second primer is nested relative to the first primer, and a third and fourth primer are selected for amplification of an opposite strand of a locus comprising a TSV of interest, where the fourth primer is nested relative to the third primer.
  • identifying primers comprises identifying primers for use in Anchored Multiplex PCR (AMP).
  • a first primer e.g., a first target-specific primer, e.g., a first primer targeting a specific TSV of interest
  • a second PCR reaction can be conducted using a second target-specific primer (often nested relative to the first primer) that is paired with the same anchored primer, or a nested anchored primer, to produce a second amplicon comprising the TSV of interest.
  • 1 or 2 primers are designed or identified to amplify a single strand of a nucleic acid comprising a TSV of interest using an AMP method.
  • both the first and an optional second primer are configured to anneal to the same template strand, and 3’ of a TSV of interest.
  • Exemplary AMP based methods are disclosed in US Patent No.9487828, which is incorporated herein by reference.
  • an AMP based method is used to
  • 58 10940863.511975645.1 amplify a complementary strand comprising the TSV of interest.
  • one or two additional primers are designed or identified for target-specific amplification of a complementary strand comprising the TSV of interest.
  • AMP comprises ligating a molecular barcode adaptor (MBC) comprising a universal primer binding site to fragments of target DNA or RNA (e.g., ctDNA), amplifying the ligated fragments of ctDNA with a universal primer and a first gene specific primer (e.g., a primer that is designed to amplify a polynucleotide comprising a TSV) in a first PCR reaction; and amplifying the products of the first PCR reaction with the universal primer, a second gene specific primer and a P7 primer, which binds to at least some of the binding site of the second gene specific primer.
  • MPC molecular barcode adaptor
  • AMP comprises ligating a molecular barcode adaptor (MBC) comprising a universal primer binding site to fragments of target DNA or RNA (e.g., ctDNA), amplifying the ligated fragments of ctDNA with a universal primer and a gene specific primer (e.g., a primer that is designed to amplify a polynucleotide comprising a TSV) in a first PCR reaction; and amplifying the products of the first PCR reaction with the universal primer and a P7 primer in a second PCR reaction, the P7 primer binding to at least some of the binding site of the gene specific primer.
  • MPC molecular barcode adaptor
  • AMP or an amplification method comprises ligating an adaptor comprising a universal primer binding site to nucleotides fragments of a sample (e.g., ctDNA) and amplifying the ligated fragments with (i) a universal primer configured to bind to a complement of the adaptor sequence, and (i) one or more target-specific primers (e.g., TSV specific primers, e.g., one or more primers of a patient-specific panel) where each of the target-specific primers are configured to amplify a polynucleotide comprising a TSV when used with the universal primer.
  • a universal primer configured to bind to a complement of the adaptor sequence
  • one or more target-specific primers e.g., TSV specific primers, e.g., one or more primers of a patient-specific panel
  • a target-specific primer comprises a 5’-tail.
  • a 5’-tail may comprise a universal priming site, a molecular barcode and/or index sequence, and/or a sequencing primer site (e.g., a P7 primer site).
  • amplicons derived from an amplification reaction using a universal primer and one or more 5’-tailed target- specific primers are further amplified using universal primers to provide amplicons used to obtain sequencing data.
  • AMP comprises ligating a molecular barcode adaptor (MBC) comprising a universal primer binding site to fragments of target DNA or RNA (e.g., ctDNA), amplifying the ligated fragments of ctDNA with a universal primer and a gene
  • MLC molecular barcode adaptor
  • AMP produces amplified DNA that comprises adaptors for sequencing (e.g., Illumina ® sequencing).
  • Obtaining primer sequence may occur at one of multiple different steps in process 200.
  • obtaining variant data 202 may comprise obtaining primer sequences.
  • identifying primers for use in detecting TSVs may occur after obtaining variant data 202 and before identifying a plurality of TSVs 204.
  • identifying primers for use in detecting TSVs may occur after identifying a plurality of TSVs and before identifying a subset of the plurality of TSVs for use in the patient-specific panel. In some embodiments, identifying primers for use in detecting TSVs may occur after identifying a subset of the plurality of TSVs for use in the patient-specific panel.
  • Process 200 next includes identifying a subset of the plurality of TSVs for use in the patient-specific panel 208. Identifying a subset of the plurality of TSVs for use in the patient- specific panel 208 is described herein including with reference to the section entitled “Subset of the plurality of Tumor Specific Variants”, and with reference to FIG.2B and FIG.3.
  • Process 200 optionally includes synthesizing the primers 210 associated with the subset of the plurality of TSVs. Synthesizing the primers may comprise synthesizing at least some of the primer for detecting the subset of the plurality of TSVs. In some embodiments, synthesizing the primers comprises synthesizing the primers for detecting for each TSV of the subset of the plurality of TSVs. Primer synthesis may be performed using any suitable method.
  • FIG.2B is a flowchart of an illustrative process 250 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using a trained machine learning model, in accordance with some embodiments of the technology described herein.
  • Process 250 involves generating respective sets of features 252 corresponding to the at least some TSVs of the plurality of TSVs (generated in step 204 of process 200); processing the respective sets of features using a trained machine learning (ML) model to obtain a score for the at least some of the TSVs of the plurality of TSVs 254; and selecting TSVs for inclusion into the subset of the plurality of TSVs using the scores obtained with the trained machine learning model 258 according to process 300 (FIG.3).
  • Process 250 includes generating, for each of at least some of the plurality of TSVs, a respective set of features to obtain sets of features 252.
  • generating the respective set of features 252 comprises generating at least one sequencing coverage feature
  • Generating sequencing coverage features may refer to generating features that are based on sequence data (e.g., Illumina ® sequencing reads) covering a TSV. Generating sequence coverage features may include, but are not limited to generating sequencing depth of coverage of plus strands and minus strands for a TSV (i.e., minimum strand coverage), and/or a ratio of depth of coverage between plus strands and minus strands of the variant data for a TSV (i.e., strand bias).
  • Minimum strand coverage may be generated based on the minimum sequencing depth of coverage of plus strands and minus strands of a TSV of the subset of the plurality of TSVs in the tumor cell sample (e.g., tumor cells) of the patient. For example, minimum strand coverage may be determined by calculating the minimum sequencing depth of coverage of plus strands and minus strands of a TSV of the subset of the plurality of TSVs in the tumor sample. For example, if the number of plus strands covering the TSV is 10 and the number of minus strands covering the TSV is 11 then the minimum strand coverage would be 10.
  • Strand bias may be generated based on the relative number of sequencing reads of the plus strand and the minus strand that cover the locus comprising a TSV (e.g., reads of the locus without and without the TSV in the tumor cells or normal cells of the patient). For example, strand bias may be determined by dividing plus strand read depth of the locus comprising the variant and minus strand read depth of the locus comprising the TSV in tumor cells of the patient (e.g., a biological sample of tumor cells). In some embodiments, strand bias may be determined by calculating the log2 ratio between plus strand read depth and minus strand read depth.
  • strand bias may be determined by calculating the absolute value of the log2 of the ratio between plus strand read depth and minus strand read depth. In some embodiments, strand bias may be determined by calculating the absolute value of the log 2 of the ratio between plus strand read depth + 1 and minus strand read depth + 1.
  • Generating allele frequency features may refer to generating features that are based on the allele frequency of a variant or TSV in tumor cells of a patient, non-tumor cells of a patient, or a database comprising genomes sequences from healthy individuals and/or individuals having a disease (e.g., cancer). Generating allele frequency features may include, but is not limited to generating non-tumor cell depth coverage of a TSV, generating a number
  • Non-tumor cell depth coverage of a TSV may be determined based on the number of sequencing reads covering the locus comprising the TSV (e.g., including alleles with the TSVs and alleles without the TSV) in non-tumor cells (e.g., biological sample of the non- tumor cells) of the patient.
  • non-tumor cell depth coverage of a TSV may be the number of sequencing reads covering the locus containing the TSV of the patient (e.g., a 100X coverage of the locus comprising the TSV).
  • Non-tumor cell depth coverage may be indicative of the detectability of the TSV because coverage in non-tumor cells may indicate the relative ease of amplifying or sequencing the locus containing the variant.
  • Tumor cell alternate observations may be determined based on the number of sequencing reads (e.g., reads from WES) containing the TSV in the tumor cells (e.g., a biological sample of tumor cells) of the patient.
  • tumor cell alternate observations may be the number of sequencing reads (e.g., reads from WES) containing the TSV of the tumor cells (e.g., a biological sample of tumor cells) of the patient.
  • Tumor cell alternative observations may be indicative of the detectability of the TSV in ctDNA of the patient.
  • Tumor allele frequency may be calculated based on (1) the total number of sequencing reads covering the TSV in tumor cells (e.g., a biological sample of tumor cells) of the patient and (2) the total number sequencing reads covering the locus comprising the variant allele (e.g., all the sequencing reads covering the locus including the sequencing reads covering the TSV).
  • Tumor cell variant allele frequency may be calculated by dividing (1) and (2) above.
  • a TNC may be a series of three sequential nucleotides in a sequence read (e.g., AAA, TAT, GTA, etc.).
  • the TNC may comprise a variant (e.g., a TSV) in the middle position of the TNC.
  • Generating the TNC error rate feature i.e., the error rate in error corrected bins
  • Generating the TNC error rate feature may comprise generating the estimated probability of observing a TSV in the middle position of the TNC due to errors introduced during sample preparation and/or sequencing of one or more biological samples (e.g., biological samples collected and/or sequenced previously).
  • Generating the TNC error rate feature may comprise obtaining data associated with the TNC error rate observed during sample preparation and/or sequencing (e.g., see Table 2).
  • generating the TNC error rate feature comprises generating an error rate feature comprising error rates that are within 50% of the error rates described in Table 2.
  • Generating a C to A variant mutation feature may refer to generating an indicator (e.g., binary indicator) for whether the variant is a C to A mutation.
  • generating a C to A variant mutation feature may comprise assigning a “1” to a variant mutation from C to A and a “0” to any other mutation type.
  • generating a C to A variant mutation feature may comprise assigning a “0” to a variant mutation from C to A and a “1” to any other mutation type.
  • Generating primer features may refer to generating a distance (e.g., measured in number of nucleotides) between a TSV and a binding site for a primer (e.g., a first primer, a target-specific primer) designed to detect the TSV.
  • Generating primer features may refer to generating a distance (e.g., measured in number of nucleotides) between a TSV and binding site for a primer (e.g., a first primer), optionally different from a second primer, designed to detect the TSV.
  • Generating a primer feature may comprise determining the minimum distance of (1) the binding site of a primer (e.g., a first primer) and the TSV and optionally (2) the binding site of a second primer and the TSV.
  • generating a primer feature may be determining the maximum distance of (1) the binding site of a primer (e.g., a first primer) and the TSV and optionally (2) the binding site of a second primer and the TSV.
  • the primers designed for a TSV comprise two sets of primers (e.g., nested primers): gene specific primer 1 forward (GSP1-F) and gene specific primer 1 reverse (GSP1-R); and gene specific primer 2 forward (GSP2-F) and gene specific primer 2 reverse (GSP2-R).
  • GSP1-F and GSP1-R may be used in a first PCR reaction to amplify a polynucleotide comprising the TSV.
  • GSP2-F and GSP2-R may be used in a subsequent PCR reaction to amplify a region within the polynucleotide amplified in the first reaction (e.g., to amplify a region of the amplicons generated in the first PCR reaction).
  • generating primer features may comprise generating features using the minimum distance between (1) the distance between a primer binding site of GSP1-F and the TSV (e.g., measured from the 3’ or 5’ end of the primer binding site) and (2) the distance between a primer binding site of GSP1-R and the TSV (e.g., measured from the 3’ or 5’ end of the primer binding site).
  • generating primer features may comprises generating using the minimum distance between (1) the distance between a primer binding site of GSP2-F and the TSV (e.g., measured from the 3’ or 5’ end of the GSP2-F primer binding site) and (2) the distance between a primer binding site of GSP2-R and the TSV (e.g., measured from the 3’ or 5’ end of the GSP2-R primer binding site).
  • generating primer features may comprises generating using the maximum distance between (1) the distance between a primer
  • generating primer features may comprises generating using the maximum distance between (1) the distance between a primer binding site of GSP2-F and the TSV (e.g., measured from the 3’ or 5’ end of the GSP2-F primer binding site) and (2) the distance between a primer binding site of GSP2-R and the TSV (e.g., measured from the 3’ or 5’ end of the GSP2-R primer binding site).
  • an absolute, average, mean, minimum, or maximum distance between a primer binding site of one or more primers and a TSV is in a range of 0 to 250, 0 to 150 or 0 to 50 nucleotides. In some embodiments, an absolute, average, mean, minimum or maximum distance between a primer binding site of one or more primers and a TSV is 20 to 40 nucleotides, or about 30 nucleotides.
  • Generating sequence context features may include, but is not limited to, generating a conservation score of the sequence context comprising the TSV, a distance between the TSV and a nearest splice site in the sequence context, and/or a splice site score of the sequence context (e.g., a score of indicating a splice site is located within the sequence context).
  • a sequence context may refer to the nucleotides on either side of the TSV in the primary sequence of the genome of the patient as is further described herein including with reference to the section “Sequence Context Features”.
  • the nucleotides within the sequence context may be the nucleotides between and including a first primer binding site and a second primer binding site.
  • the nucleotides within the sequence context may be the nucleotides between the primers of the outer set of primers (GSP1 primers) including the primer binding sites.
  • the nucleotides within the sequence context may be the nucleotides of the locus comprising the TSV.
  • Generating a context feature may comprise determining a conservation score of a polynucleotide of the patient comprising a TSV, a distance between the TSV and a nearest splice site on the polynucleotide (e.g., using SpliceSiteFinder), and/or a splice site score of the polynucleotide (e.g., using MaxEntScan).
  • generating a sequence conservation score comprises determining the conservation of the sequence (e.g., % conservation between species), e.g., using standard methods including, but not limited to, BLAST, HMMER, OrthologR, and Infernal.
  • generating the conservation score comprises generating a phastCons conservation score and/or a phyloP
  • a conservation score of a polynucleotide comprising a TSV may be determined using any suitable algorithm that determines conservation.
  • generating the first set of features for the first TSV comprises determining for each TSV of the subset of the plurality of TSVs: the sequencing depth of coverage of plus strands and minus strands for the TSV, the non-tumor cell depth coverage for the TSV, the number of observations of the TSV in tumor cells of the patient, and the trinucleotide context (TNC) error rate feature.
  • TTC trinucleotide context
  • the method further comprises determining for each TSV of the subset of the plurality of TSVs, the maximum distance between the TSV and a binding site for the second primer designed to detect the TSV, the ratio of depth of coverage between plus strands and minus strands of the variant data for the TSV, the tumor allele frequency of the TSV, the phastCons conservation score, the TSV and the binding site for the first primer designed to detect the TSV, the distance between the TSV and the nearest splice site on the polynucleotide, and a phyloP conservation score.
  • the method further comprises for each TSV, determining the C to A variant mutation feature, the minimum distance between the TSV and a binding site for the second primer designed to detect the TSV, the splice site score of the polynucleotide, the minimum distance between the TSV and the binding site for the second primer designed to detect the TSV.
  • Process 250 step 254 involves processing the plurality of sets of features using the trained ML model to obtain a corresponding plurality of scores. Processing the plurality of sets of features may comprise processing using a trained machine learning (ML) model.
  • the trained ML model may be a classification model.
  • the trained ML model may be a regression model.
  • the trained ML model may be a linear model.
  • the trained ML model may be a nonlinear model.
  • the trained ML model may be any suitable ML model including, but not limited to, a linear mixed effects model with a linked logistic function, a non-linear mixed effect model, a neural network, a support-vector machine, or a random forest.
  • the trained ML model may be a random forest model.
  • the trained ML model may be a random forest classifier. The random forest classification of each TSV may be indicative of the score of the TSV. The random
  • FIG.12 is a diagram depicting an illustrative technique 1200 for training the trained machine learning model to generate a score indicative of the predicted detectability of a TSV 1218, according to some embodiments of the technology described herein.
  • MRD positive patients 1202 were monitored with previously designed patient-specific panels 1208 (e.g., panels comprising primers that were designed using a different method).
  • the previously designed patient-specific panels provided the TSV detectability 1206 in the corresponding patient 1202 with 1214 indicative of the presence of a TSV and 1212 indicative of the absence of a TSV.
  • the training data 1204 comprises the TSV Detectability 1206 and Variant Data 1210.
  • the training data 1204 is used to in training the machine learning model 1216 with the objective of generating scores indicative of the predicted detectability of a TSV in a MRD positive patient 1218 (e.g., in sequencing data of a biological sample of an MRD positive patient.
  • training data 1204 may have been collected from biological samples of a plurality of MRD positive patients 1202 (e.g., as described herein). Patients of the plurality of MRD positive patients 1202 may have been previously diagnosed with cancer as described herein.
  • the plurality of patients may consist of MRD positive patients.
  • the plurality of MRD positive patients 1202 may comprise patients that have been previously diagnosed with a specific cancer type.
  • the plurality of MRD positive patients 1202 may consist essentially of patients that have been previously diagnosed with a specific cancer type.
  • the plurality of MRD positive patients 1202 may consist of patients that have been previously diagnosed with a specific cancer type (e.g., lung cancer or melanoma).
  • Patients previously treated for lung cancer may comprise patients previously treated non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC), and lung adenocarcinoma.
  • the machine learning model may be trained using data from a plurality of MRD positive patients 1202 who have been treated for the same type of cancer.
  • the machine learning model may be trained using data from a plurality of MRD positive patients 1202 who have been treated for melanoma. In other embodiments, the machine learning model may be trained using data from a plurality of MRD positive patients 1202 who have been treated for lung cancer. In these embodiments, the trained machine learning model may reflect cancer type-specific biases in TSV detectability, and achieve a
  • a model trained using data from a plurality of MRD positive patients previously treated for a first cancer type may also be predictive of the detectability of TSVs in a different cancer type (e.g., melanoma) (e.g., as described in the Example 3).
  • the plurality of MRD positive patients 1202 may comprise patients previously diagnosed with different cancer types (e.g., a first patient may be previously diagnosed with a first cancer type, a second patient may be previously diagnosed with a second cancer type, a third patient may be previously diagnosed with a third cancer type, etc.).
  • the plurality of patients 1202 may comprise patients previously treated for one or more of brain cancer, liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, colon cancer, squamous cell carcinoma, etc.
  • the plurality of patients 1202 may comprise patients that have been previously diagnosed with lung cancer and patients that have been previously diagnosed with melanoma.
  • a machine learning model may be trained using data from different patients previously diagnosed with different cancer types. These machine learning models may be more generalized, as the features explaining TSV detectability that are common across different cancer types may be prioritized.
  • the fitted model may be beneficial for prioritizing variants in new cancer types for which there is not yet pre-existing data to train on, as well as rare cancer types that are limited in availability.
  • Each TSV e.g., 1212 and 1214
  • each biological sample of a plurality of biological samples collected from a plurality of MRD positive patients 1202 may have been monitored using one or more of previously designed patient-specific panels 1208.
  • the number of TSVs of the plurality of TSVs used for training the model may be dependent on the number of TSVs being targeted by the previously designed patient-specific panel 1208.
  • the previously designed patient-specific panel 1208 may target at least 50, at least 75, at least 100, at least 150, at least 200, or at least 250 TSVs.
  • 100-300 TSVs are targeted by the previously designed patient-specific panels 1208.
  • 150-250 TSVs are targeted by the previously designed patient-specific panels 1208.
  • 200 TSVs are targeted by the previously designed patient-specific panels 1208.
  • the plurality of MRD positive patients 1202 monitored using the previously designed patient-specific panels 1208 may comprise at least 25 MRD positive patients, at least 50 MRD positive patients, at least 75 MRD positive patients, at least 100 MRD positive patients,
  • the plurality of MRD positive patients 1202 monitored using the previously designed patient-specific panels may comprise 25-500 MRD positive patients.
  • the plurality of MRD positive patients 1202 monitored using the previously designed patient-specific panels may comprise 25-75 MRD positive patients.
  • the plurality of MRD positive patients 1202 monitored using the previously designed patient-specific panels may comprise MRD positive 50 patients.
  • the plurality of MRD positive patients may comprise 499 patients previously treated for melanoma and/or 57 patients previously treated for lung cancer.
  • Generating the trained machine learning model 1216 may comprise training a machine learning model to generate a score indicative of the predicted detectability of a TSV 1218 in a biological sample (e.g., plasma) of a MRD positive patient.
  • Training the trained machine learning model may comprise: obtaining, for a plurality of previously monitored TSVs 1212 and 1214 in each biological sample of a plurality of biological samples collected from a plurality of MRD positive patients 1202, sets of training data, each set of training data comprising: (i) an indication (e.g., a binary indication) of whether the TSV is present or absent 1206 in the biological sample; and (ii) variant data 1210 associated with the TSV (e.g., features derived from the variant data).
  • an indication e.g., a binary indication
  • Training the trained machine learning model may further comprise using the sets of training data to estimate a score indicative of detectability of a TSV in a biological sample from a MRD positive patient.
  • the previously designed patient-specific panels 1208 may be used to monitor at least 50, at least 75, at least 100, at least 150, at least 200, or at least 250 TSVs. In some embodiments, the previously patient-specific panels 1208 may be used to monitor 100-300 TSVs. In some embodiments, the previously patient-specific panels 1208 may be used to monitor 150-250 TSVs. In some embodiments, the previously designed patient-specific panels 1208 may be used to monitor 200 TSVs. Because the previously designed patient- specific panels are patient-specific, the previously designed patient-specific panels may not monitor the same TSVs.
  • the indication of whether the TSV is present or absent 1206 in the biological sample may be any suitable indication.
  • the indication may be a negative indication indicating the TSV was not detected (e.g., a value of 0) or a positive indication indicating the TSV was detected (e.g., a value of 1).
  • Detection may be based on a threshold (e.g., the lower limit of detection of the sequencing instrument) thus a detected TSV may be a TSV with an
  • an undetected TSV may be a TSV with an allele frequency in the biological sample of the patient that does not exceed the threshold.
  • the indication may be the allele frequency at which the TSV was detected.
  • the indication of whether the TSV is present or absent may be the TSV being present in the biological sample at an allele frequency that exceeds a threshold.
  • the threshold may be the limit of detection of the patient-specific panel.
  • the threshold may be an allele frequency of at least 0.0001, 0.001, 0.01, 0.1 or higher.
  • the threshold may exceed the expected experimental noise associated with preparation and/or sequencing of the biological sample.
  • the variant data 1210 associated with each TSV may comprise one or more of the features described herein.
  • the variant data associated with each TSV may comprise one or more of at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature.
  • the variant data associated with each TSV may comprises at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature.
  • the objective of training the model 1218 may be generating a score indicative of the predicted likelihood that the TSV will be observed in the biological sample of an MRD positive patient.
  • FIG.13 is a flowchart of an illustrative process 1300 for training the trained machine learning model (e.g., a random forest), according to some embodiments of the technology described herein.
  • Process 1300 beings with obtaining training data 1302.
  • Obtaining training data 1302 may comprise obtaining training data in any suitable way.
  • Obtaining training data 1302 may comprise obtaining data from previously performed monitoring of TSVs in MRD positive patients (as described herein).
  • Obtaining training data 1302 may comprise obtaining for each previously monitored MRD positive patient, variant data associated with the TSVs of the MRD positive patient, and corresponding indication of whether the TSV is present or absent in a biological sample of the patient.
  • Training data may be obtained from a plurality of MRD positive patients, as described herein.
  • the training data obtained may be used to estimate the trained machine learning model 1304 by estimating the model parameters 1306 based on estimated model hyperparameters 1308. This may be an iterative process where the
  • hyperparameters are estimated (e.g., using a grid search as described herein) then the model parameters are estimated using the estimated hyperparameters and the training data, and the fit of the model to validation data is determined. This may be repeated with adjusted hyperparameters based on the grid search.
  • cross validation may refer to training using a first portion of the training data and then using the remaining training data to assess the accuracy of the model (e.g., in determining an indication of the detectability of a TSV).
  • Estimating model parameters 1306 and estimating model hyperparameters 1308 may be performed using known methods e.g., using scikit-learn.
  • Estimating model hyperparameters 1306 may comprise performing a grid search of the hyperparameters.
  • a grid may refer to a method where for each hyperparameter, a set of values encompassing the range of potential values are predefined, and combinations of these values for each hyperparameter are considered (e.g., the model’s parameters are estimated using a given set of hyperparameters and the predictability of the model is determined).
  • exhaustive grid search all possible combinations of the predefined values for all hyperparameters may be considered.
  • a subset of the possible combinations may be sampled and considered.
  • a randomized grid search may be used to identify a near optimal set of hyperparameters in potentially less time than an exhaustive grid search (due to the combinatorics of the problem, the grid can potentially be quite large, and take a long time to compute).
  • the grid search may be an exhaustive grid search.
  • the grid search may be a random grid search.
  • the hyperparameters may be determined one at a time.
  • the grid search may comprise adjusting the value of one hyperparameter at a time.
  • the grid search may comprise adjusting the values of at least 1 hyperparameter at a time.
  • Estimating model hyperparameters 1306 may comprise estimating hyperparameters of a random forest model.
  • the hyperparameters may be one or more of the number of trees in the forest, the function to measure the quality of a split, the number of features to consider when looking for the best split, the maximum depth of the tree, the minimum number samples required to split an internal node, the minimum number of samples in newly created leaves, and/or the maximum number of leaves to grow in the tree.
  • Training the trained machine learning model may also comprise selecting features for use in the trained machine learning model. Selecting features may comprise: training a
  • FIG.3 is a diagram depicting an illustrative technique 300 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using the TSVs using a trained machine learning model, according to some embodiments of the technology described herein.
  • Diagram 302 represents the TSVs identified previously using methods described herein, with 304 representing specific TSVs, 306 representing the sequence of the tumor sample, and 308 representing the sequence of the non-tumor sample.
  • Diagram 310 represents features associated with each TSV of diagram 302 (e.g., as described above).
  • the TSVs in diagram 302 and features in diagram 310 are processed in trained machine learning model 312 to generate a score indicative of the predicted detectability of each of the TSVs.
  • TSVs are ranked 317 by score 316 and the top X TSVs (e.g., TSVs exceeding a threshold 318, where X may be the number of TSVs desired or needed for inclusion in the patient-specific panel) are selected as the subset of the plurality of TSVs.
  • Technique 300 begins with obtaining data indicative of tumor specific variants (TSVs) 302 and data indicative of features for each TSV 310.
  • TSVs 304 may refer to variants that are present in the tumor sequence 306 (e.g., the genome of the tumor) and absent in the non-tumor sequence 308 (e.g., the genome of non-tumor tissue). However, this need not always be the case.
  • TSVs may also refer to variants that have a greater allele frequency in the tumor sequence 306 than the non-tumor sequence 308.
  • TSVs 302 may refer to TSVs that are identified using any suitable method including, but not limited to the methods described herein including in the section “Subset of the Plurality of Tumor Specific Variants and Features For Selecting the Subset.” Data indicative of TSVs 302 may be stored in any suitable file format.
  • Features for each TSV 310 may refer to any suitable features that are indicative of the detectability of a TSV. For example, these features may include features related to the sequence coverage, allele frequency, sequencing error rate, primer design (e.g., primers
  • Data indicative of these features may be generated by any suitable method, including but not limited to as described herein including in with reference to FIG.2B.
  • Data indicative of features for each TSV 310 may be stored in any suitable file format.
  • Technique 300 may continue with, for one or more the TSVs of 302, inputting the data indicative of a TSV of the one or more TSVs and the features indicative of the TSV into the trained ML model 312, and outputting a score indicative of the predicted detectability of the TSV.
  • the trained ML model 312 may be a ML model of any suitable type.
  • the trained ML model 312 may be a trained ML model as described herein including in reference to FIG.2B.
  • the trained ML model 312 may be a nonlinear model (e.g., a random forest).
  • the trained ML model 312 may be trained using any suitable method including, but not limited to, the methods described herein including with reference to FIG.2B.
  • the trained machine learning model 312 may be trained with data comprising (1) an indication of the detectability of a TSV (e.g., in a patient with or without MRD) and (2) features associated with the TSV, including, but not limited to, as described herein including with reference to FIG.2B.
  • An indication of MRD in the patient may be used as a proxy for the detectability of a TSV.
  • the trained ML model 312 may output scores 316 associated with each TSV (and corresponding features) that are inputted into the model. The score 316 may be indicative of the detectability of a corresponding TSV in a biological sample of the patient.
  • the score 316 may be the predicted likelihood of detecting the TSV in a biological sample (e.g., plasma) of an MRD positive patient (e.g., detecting using a patient-specific panel).
  • the score 316 may be the predicted likelihood of detecting the TSV in a biological sample of an MRD positive patient.
  • a likelihood of detecting the TSV in a biological sample of an MRD positive patient may be determined as follows: Each tree of the random forest may provide a probability defined as the proportion of samples in the terminal leaf that belong to class "1" (i.e., the TSV is detected). These probabilities may be aggregated across all of the trees to determine a likelihood that, given input features, that TSV is detected.
  • the scores 316 may be ranked 317 and the top TSVs may be selected for the panel (e.g., the patient specific panel) in step 314.
  • the top TSVs may be selected based on a threshold 318.
  • the threshold may be
  • selecting TSVs for inclusion in the panel comprises selecting the TSVs with the highest scores.
  • electing TSVs for inclusion in the panel comprises selecting TSVs with scores above a threshold.
  • At least 10 e.g., at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, or at least 100
  • TSVs may be selected from the subset of the plurality of TSVs.
  • the number of TSVs e.g., top X TSVs
  • the number of TSVs selected for the subset of the plurality of TSVs may be determined based on the number of TSVs that need to be monitored to indicate MRD (e.g., to have confidence in an indication of MRD).
  • FIG.4 is a flowchart of an illustrative process 400 for identifying the subset of the plurality of TSVs for use in a patient-specific panel, according to some embodiments of the technology described herein.
  • Process 400 has the following acts: act 402, obtain the variant data indicative of a plurality of variants of the patient; act 404, identify a plurality of TSVs for the patient based on allele frequencies of the variants; act 406, identify a subset of the plurality of the TSVs for use in the patient-specific panel (which comprises acts 408, 410 and 412); act 408, generate a set of features corresponding to at least some of the TSVs of the subset of the plurality of TSVs; act 410, process the features using a trained ML model to obtain scores for each TSV that are indicative of the detectability of the TSV; and act 412, select, using the scores, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient-specific panel.
  • Illustrative process 400 begins with obtaining variant data indicative of a plurality of variants of the patient, the variant data previously-generated by analyzing sequence data generated by sequencing at least one biological sample obtained from the patient 402.
  • Obtaining variant data may comprise obtaining variant data comprising data indicative of the variants identified in at least one biological sample (e.g., in the genome of a tumor cell sample and/or a non-tumor cell sample).
  • Data indicative of the variants may be any suitable data and may include, but is not limited to, the variant data described herein including in the section “Variant Data”.
  • the variant data may comprise the data needed to calculate the features described herein.
  • the variant data may comprise data associated with primer design (e.g., primers designed to detect variants), as described herein.
  • the variant data in act 402 is previously-generated by analyzing sequence data of at least one biological sample obtained from the patient.
  • Sequence data may be any suitable sequence data of at least one biological sample for the patient.
  • the sequence data may be sequence data as described herein including in the section “Sequence data”.
  • Act 402 may also comprise generating the sequence data by sequencing the DNA and/or RNA of the biological sample using any suitable method including, but not limited to, the methods described herein. Additionally, sequencing of the samples and generation of the variant data for one or more of the biological samples may be performed by a third party.
  • the at least one or more biological samples obtained from the patient may comprise a tumor cell sample (e.g., tumor cells).
  • the at least one or more biological samples obtained from the patient may comprise a non-tumor cell sample (e.g., non-tumor cells).
  • the biological samples may be any suitable biological samples (e.g., tumor cells or non-tumor cells) including as described herein including in the section “Biological Samples”.
  • the patient may be any patient having diseased tissue (e.g., cancer or a tumor) including but not limited to as described herein including with reference to the section “Patient”.
  • the patient may be a patient having lung cancer or melanoma.
  • Process 400 continues with act 404, identifying a plurality of tumor-specific variants (TSVs) for the patient.
  • a plurality of TSVs may comprise any suitable number of TSVs including as described herein including with reference to FIG.1.
  • the plurality of TSVs may contain at least twice the number of TSVs as needed in the patient-specific panel (e.g., needed to determine an indication of MRD). Having a plurality of TSVs comprising twice the number of TSVs required for the panel may increase the chances that at least the minimum number of suitable TSVs are identified for use in the patient-specific panel.
  • a plurality of TSVs may comprise at least 50, at least 100, at least 150, or at least 200 TSVs.
  • the plurality of TSVs may be selected using any suitable methods including but not limited to as described herein and in reference to the section “Tumor Specific Variants (TSVs) and Features For Selecting TSVs.”
  • TSVs Tuor Specific Variants
  • the plurality of TSVs may be selected using one or more features that are indicative of the detectability of the feature (e.g., the detectability using primer amplification and/or next generation sequencing). Identifying TSVs for the plurality of TSVs may comprise applying thresholds based on the features described herein.
  • the healthy population allele frequency features may be used to identify a variant(s) that are TSV when the variant does not exceed a threshold allele frequency in a healthy population, as described herein and in the section “Healthy Population Variant Allele Frequency” (indicating that the variant is not tumor associated).
  • Process 400 next continues with identifying a subset of the plurality of TSVs for use in the patient-specific panel for use in detecting MRD in the patient 406.
  • a subset of the plurality of TSVs may refer to TSVs whose presence in a biological sample (e.g., plasma) is indicative of MRD.
  • a subset of the plurality of TSVs is described herein including in the section, “Subset of the Plurality of Tumor Specific Variants and Features For Selecting the Subset.”
  • a patient-specific panel may be any suitable patient specific panel including as described herein in the section “Patient-specific panel” and with reference to FIG.1.
  • the patient-specific panel may comprise a techniques (e.g., PCR primers) for detecting the subset of the plurality of TSVs in a biological sample (e.g., plasma from a patient). Any suitable method for detecting MRD may be used including but not limited to methods described herein including in the section “Minimal Residual Disease (MRD)”.
  • Act 406 of process 400 comprises three acts: 408, 410 and 412.
  • Act 408 comprises generating for each of at least some of the plurality of TSVs a respective set of features to obtain a plurality of sets of features.
  • Features, for use in the respective set of features may comprise any suitable features.
  • Suitable features may be features that are indicative of the detectability of a TSV in a patient (e.g., in ctDNA of a patient).
  • the respective set of features may include, but is not limited to, the features described herein including in the section “Subset of the Plurality of Tumor Specific Variants and Features For Selecting the Subset”.
  • Generating for each of at least some of the plurality of TSVs a respective set of features to obtain a plurality of sets of features may comprise generating the respective set of features using any suitable method including but not limited to the methods described herein, including with reference to FIG.2B.
  • Generating the plurality of sets of features may comprise generating using the variant data described herein.
  • Generating the plurality of sets of features may produce a corresponding input for the trained machine learning algorithm comprising features associated with at least some of the TSVs (e.g., each of the TSVs of the plurality of TSVs).
  • Act 410 of process 400 comprises processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores indicative of the predicted detectability of a corresponding TSV.
  • Processing the plurality of sets of features may comprise: (1) inputting a set of features of the plurality of sets of features into the trained ML algorithm and (2) outputting a score indicative of the predicted detectability of a TSV associated with the set of features.
  • Processing the plurality of sets of features may be performed using any suitable trained machine learning model, including but not limited to, the machine learning models described herein including with reference to FIG.2B. For
  • processing may be performed using a trained random forest model.
  • the trained machine learning model may be trained using any suitable method including but not limited to the methods described herein including with reference to FIG.2B.
  • the training may be performed using previous MRD data collected from patients using patient-specific panels and sequencing (e.g., patients previously treated for cancer).
  • Act 412 of process 400 comprises selecting, using the plurality of scores and from among the at least some of the TSVs, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient specific panel. Selecting the TSVs may be performed using any suitable method including but not limited to the methods described herein including with reference to FIG.2B and FIG.3.
  • selecting may include ranking the TSVs according to score and then selecting the top highest ranking TSVs.
  • the “top” may be determined based on the number of TSVs needed in the patient-specific panel. For example, in a specific patient with a specific cancer/tumor type, 50 TSVs may be needed in the patient- specific panel to produce an accurate indication of MRD. In a different patient, 100 TSVs may be needed. In some embodiments, the number of TSVs may be selected based on the type of cancer. Selecting TSVs may also include selecting TSVs with scores that exceed a threshold as described herein. TSVs above the threshold may all be sufficiently detectable to include in the patient-specific panel.
  • TSVs may be needed for the patient- specific panel, but there are 100 TSVs with scores that exceed the threshold, thus, any 50 of the TSVs that exceed the threshold may be selected.
  • the TSVs selected from the TSVs that exceed the threshold may be selected based on the cancer type. In some embodiments, the TSVs selected from the TSVs that exceed the threshold may be selected based on patient-specific attributes.
  • FIG.5 is a diagram depicting an illustrative technique 500 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using variants identified by sequencing non-tumor cells and tumor cells of the patient to identify TSVs and exclude non- tumor-specific variants, scoring the TSVs using a trained machine learning model, and selecting TSVs for the patient-specific panel using the scores, according to some embodiments of the technology described herein.
  • variants from whole exome sequencing of non-tumor cells (502) and whole exome sequencing of tumor cells (504) are obtained.
  • Obtaining variants in steps 502 and 504 may comprise obtaining variant data using any suitable method including methods described herein including in the section “Variant Data” and in reference to FIG.1.
  • obtaining variant data may comprise
  • non-specific tumor variants 506 are excluded from consideration 518 and tumor specific variants 510 are identified.
  • Tumor-specific variants 510 may be identified using any suitable methods, including but not limited to methods described herein including in the section “Tumor Specific Variants (TSVs) and Features For Selecting TSVs.”
  • TSVs Tumor Specific Variants
  • non-tumor specific variants 506 may be variants that are not tumor specific variants 510. However, this is not always the case.
  • Lower tier variants 508 may refer to variants that were placed into lower tiers when identifying tumor- specific variants as described herein, and were subsequently excluded from consideration with the non-tumor specific variants. For example, variants 502 may be tiered according to detectability of each variant. If 50 TSVs are needed at step 510 then 50 variants may be selected from among the top tiers of variants (e.g., the top 50 variants), whereas the lower tier variants are excluded from consideration 518. Methods for tiering variants and identifying lower-tier variants are described herein including with reference to FIG.2A. After selecting TSVs, technique 500 involves obtaining scored TSVs 512.
  • TSV scoring may be performed using any suitable method including but not limited to any method described herein including with reference to FIG.2B.
  • Scoring variants may comprise two steps: (1) generating sets of features associated with at least some of the TSVs and processing each set of features using a trained machine learning model that outputs a score indicative of the predicted detectability of the TSV.
  • Generating features may comprise generating features associated with each TSV (e.g., TSV mutation type, allele frequency etc.) using any suitable method including but not limited to generating features as described herein including as described in reference to FIG.2B.
  • Processing each set of features using a trained machine learning model may comprise processing using a nonlinear trained machine learning model (e.g., a random forest model).
  • the trained machine learning model may be trained using any suitable method including but not limited to the methods described herein, including with reference to FIG.2B.
  • the trained machine learning model may be trained with data comprising patient-specific panel data previously collected when monitoring patient’s for MRD and data indicating the MRD status of the patient (e.g., positive for MRD or negative for MRD). Processing the set of features using the trained machine learning model may produce scores that are indicative of the detectability of the TSVs in a biological sample (e.g., plasma) of a patient.
  • a biological sample e.g., plasma
  • the scored TSVs may be selected 514 for a panel 516 according using any suitable method including, but not limited to, the methods described herein including with reference to FIG.3.
  • the scored TSVs 512 may be ranked according to score and then a TSV may be selected 514 for a panel 516 if the TSV score exceeds a threshold.
  • the TSVs with the top scores may be selected from the patient specific panel 516.
  • monitoring a patient for MRD using a patient specific panel as disclosed herein comprises sequencing nucleic acids derived from circulating cells. In some embodiments, monitoring a patient for MRD using a patient specific panel as disclosed herein comprises sequencing cfDNA and/or cfRNA. In some embodiments, monitoring a patient for MRD using a patient specific panel as disclosed herein comprises sequencing tumor DNA and/or tumor RNA. In some embodiments, monitoring a patient for MRD using a patient specific panel as disclosed herein comprises sequencing ctDNA.
  • this disclosure describes a method for determining whether sequence data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD), the method comprising: generating sequence data from the biological sample of the patient, the generating comprising contacting the biological sample with primers identified for the subset of the plurality of TSVs identified using a method of designing a patient-specific panel as described herein; detecting TSVs using the sequence data; and determining, using the detected TSVs, whether the biological sample provides an indication of MRD (e.g., as described herein.
  • the biological sample is a biological sample from the patient that is expected to contain ctDNA when MRD is present.
  • the biological sample is a fluid, secretion, or mucosae of the patient (e.g., as described herein).
  • the biological sample is a blood, serum or plasma sample of the patient.
  • determining whether the biological sample provides an indication of MRD comprises using a statistical test to compare the sequencing error rate and the allele frequency of the TSV in tumor-derived polynucleotides.
  • determining whether the biological sample provides an indication of MRD comprises using a statistical test to compare the sequencing error rate and the allele frequency of the TSV in circulating tumor DNA (ctDNA) (e.g., as described herein and in reference to the section entitled “Minimal Residual Disease (MRD)”.
  • ctDNA circulating tumor DNA
  • determining an indication of MRD comprises determining if the total number of times all of the TSVs are observed in sequence data of a biological sample of the patient exceeds the expected number of TSVs to be observed due to error associated with sample preparation (e.g., DNA extraction and amplification with primers of the patient-specific panel) and detection (e.g., sequencing).
  • the method further comprises administering a therapeutic (e.g., a cancer therapeutic) to a patient with a positive indication of MRD or continuing MRD monitoring (e.g., as described herein) in a patient with a negative indication of MRD.
  • the method further comprises administering a therapeutic (e.g., a cancer therapeutic) to a patient with a positive indication of MRD or collecting one or more additional samples (e.g., blood) from the patient with a negative indication of MRD (e.g., for use in determining an indication of MRD).
  • a therapeutic e.g., a cancer therapeutic
  • additional samples e.g., blood
  • a negative indication of MRD e.g., for use in determining an indication of MRD.
  • the one or more additional biological samples may be collected from the patient over a specified time interval. For example, when a first biological sample from a patient does not have an indication of MRD then a second biological sample may be collected 6 months after the first biological sample is collected.
  • the time interval between collecting biological samples may be any suitable time interval. Suitable time intervals may be determined based on the type of MRD being monitored. For example, MRD associated with faster growing cancers/tumors may be monitored in shorter time intervals than MRD associated with slowing growing cancers/tumors. In some embodiments, the time interval is determined by the skilled person. In some embodiments, the time interval between collecting biological samples may be 1 month, 2 months, 3 months, 6 months, 1 year or more. The time interval need not be consistent over time.
  • a biological sample may be collected every month for the first six months after cancer treatment and then every 6 months thereafter.
  • the method further comprises treating cancer (e.g., by administering an anti- cancer therapeutic that is expected to treat the cancer of the patient) in a patient with a positive indication or MRD (e.g., the method indicates the patient has, may have, or possibly will have disease relapse (e.g., cancer relapse)) or continuing MRD monitoring (e.g., as described herein) in a patient with a negative indication of MRD (e.g., the method indicates that the patient does not have cancer relapse).
  • determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with sensitivity greater than
  • a threshold probability of detecting MRD in a patient who has MRD may be between 0.8 and 1.
  • the threshold may be between 0.85 and 1.
  • the threshold may be between 0.85 and 0.97.
  • the threshold may be 0.85, 0.86, 0.87, 0.88, 0.89, 0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, or 0.99.
  • probability of detecting MRD in a patient who has MRD is a probability of detecting MRD 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 weeks prior to detection of a new tumor by surveillance imaging.
  • probability of detecting MRD in a patient who has MRD is a probability of detecting MRD at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9 or at least 10 weeks prior to detection of a new tumor by surveillance imaging. In some embodiments, probability of detecting MRD in a patient who has MRD is a probability of detecting MRD at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9 or at least 10 months prior to detection of a new tumor by surveillance imaging.
  • determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with specificity greater than a threshold probability of not detecting MRD in a patient that does not have MRD.
  • the threshold may be between 0.95 and 1.
  • the threshold may be between 0.98 and 1.
  • the threshold may be 0.95, 0.96, 0.97, 0.98, 0.99 or 1.
  • administering the therapeutic comprises administering a therapeutic designed to treat the disease of the patient (e.g., a therapeutics designed, known or expected to treat the cancer of the patient).
  • Anti- cancer therapeutics are well known in the art.
  • Pantziarka et al. describes an open access database of licensed cancer drugs. Pantziarka et al. (2021) Frontiers in Pharmacology 12:627574.
  • the National Cancer Institute maintains a list of approved cancer drugs for treating a variety of different cancers (A to Z List of Cancer Drugs[online][retrieved on Dec.5, 2022]; retrieved from the internet ⁇ URL:https://www.cancer.gov/about-cancer/treatment/drugs>).
  • administering a therapeutic comprising administering one or more of a chemotherapeutic, an immunotherapeutic (e.g., an antibody), a cellular therapeutic (e.g., a CAR-T cell), a pain
  • administering a therapeutic comprises performing surgery on the patient (e.g., surgery to remove a tumor). Administering the therapeutic may be performed by any suitable means.
  • the method comprises selecting a patient for administration of a therapeutic when the patient has a positive indication of MRD (e.g., as determined using a patient-specific panel as described herein) and repeating the method with one or more further biological samples from the patient for use in monitoring when the patient has a negative indication of MRD.
  • this disclosure provides a method comprising: designing a patient-specific panel as described herein; using the patient-specific panel determine whether a biological sample of a patient (e.g., plasma) is indicative of MRD; and either (1) administering a therapeutic (e.g., a therapeutic for use in treating the cancer/tumor of the patient) to the patient if the biological sample is indicative of MRD or (2) continuing to monitor the patient for MRD (e.g., using the patient-specific panel).
  • the method comprises treating the patient using a therapeutic.
  • Example 1 Training the Machine Learning Model Data Used for Training
  • the random forest classifier model was trained using patient ctDNA data (data from sequencing circulating DNA comprising ctDNA) from previously generated patient-specific panels (e.g., the patient-specific panel were generated using a different model) targeting up to 200 tumor specific variants each. Only samples in which MRD was detected were used for training the model. Data included data from 57 patients previously diagnosed with lung cancer and data from 499 patients previously diagnosed with melanoma. Summary of Training Criteria: Given that MRD was present in the patient ctDNA data, the model was trained to predict whether each variant was detected above a baseline level, using the set of input features in FIG.6. To reduce the chance of overfitting, cross validation was used where, for each iteration, a portion of the panels were selected to be used for training the model, and then the remaining panels were used to assess either the accuracy of the model in predicting
  • Hyperparameters were initially tuned by varying one hyperparameter at a time across a range of values while holding the other hyperparameters constant at their default values. A grid search was used to explore the space around these initial values, maximizing the balanced accuracy of the model.
  • the hyperparameters tuned in this fashion were: the number of estimators included in the random forest, the minimum number of samples in a node that was required to further split that node, the maximum depth of each tree in the forest, the maximum number of features to consider at each bifurcation in the tree, and the minimum number of samples required to be in a terminal leaf of the tree.
  • Feature Selection An important aspect of training a machine learning model to predict the detectability of a TSV in a biological sample (e.g., plasma) of a patient is identifying informative features for use in training the machine learning model.
  • One way to accomplish this is training a machine learning model using a large number of possible features and then testing the model to determine which features are indicative of TSV detectability using SHapley Additive exPlanations (SHAP) values.
  • SHAP values show a given feature’s effect on the predicted outcome for a given sample.
  • the random forest classifier model was trained using a plurality of features that may be predictable of the detectability of a TSV, including features that are used by various rules-based algorithms in determining the detectability of a TSV, as well as features with a biological basis for impacting the detectability of a TSV over time (FIG.6). Training was performed using variant data from cancer patients (e.g., lung cancer patients and melanoma patients).
  • Characterizations of the various mutation types were also expected to be predictive for sequence biases that might be present and interfere with predictions, but were generally found to be less informative for training the machine learning model.
  • Primer quality results from primer design also were less predictive than expected, though primer quality was expected to be a highly predictive feature because primer quality is related to the ability to amplify and detect the TSV. This may be a result of the primer quality requirements established elsewhere in the pipeline.
  • the wildtype max Ent score (i.e., a score indicative of splice site near the TSV in the wildtype (non-tumor) cells genomes) was expected to give context to the “variant max Ent score” feature that was also included in the model, but ultimately wild-type max Ent score did not enhance the performance of the model.
  • 15 features were selected for training the machine learning model (FIG.7).
  • the “FDP” feature was found to be highly correlated with "NormalFDP" and therefore was not included in the model.
  • FIG.8 further shows the overall impact of each of the selected features on the SHAP value, where a broader distribution of SHAP values indicates a stronger effect on predictability.
  • the model was explicitly trained to predict whether a TSV would be detected in a sample comprising ctDNA of an MRD positive patient.
  • an additional objective was to rank TSVs for inclusion in a patient-specific panel such that monitoring the TSVs of the patient-specific panel in a biological sample of a patient gives an accurate indication of whether MRD is present or absent in the patient.
  • the final evaluation used to determine whether the model was accurate was determined by using a cohort of previously analyzed samples based on panels targeting up to 200 tumor specific variants. For each of the panels, the trained machine learning model was used to assign a probability to each variant. This probability is the predicted likelihood that the variant will be observed in a biological
  • the variants are then ranked by this probability, and the subset of most likely variants is selected as a “subpanel”. In this case the top 50 variants were used, but other subpanel sizes (for example 16, 100, etc.) could be used.
  • the subpanel was generated, the original sample comprising ctDNA was reanalyzed in silico to determine the MRD status based on this subpanel. This MRD result was then compared to the original result. If both the original result from the full panel and the new subpanel result were positive for MRD, the result was considered a true positive. Conversely, if both the original and subpanel results were negative for MRD, the result was considered a true negative.
  • Example 2 Measuring the relationship between feature values and TSV detectability Another important consideration when selecting a machine learning model is selecting a model that can capture the relationship between a feature and the desired prediction (e.g., scoring the detectability of a TSV). For example, the random forest classifier model used in Example 1 is capable of capturing complicated nonlinear relationships.
  • FIGs.9A-9O One way to observe the relationships between a feature and a designed prediction is using SHAP plots, which compare the feature values to corresponding SHAP values (FIGs.9A-9O).
  • a positive SHAP value indicates the corresponding feature value contributes to a higher prediction that the TSV will be detectable.
  • a negative SHAP value indicates the corresponding feature value leads the model to predict the TSV is not detectable.
  • Results in FIGs.9A-9O show a variety of relationships between the features and the SHAP values, including a relatively simple monotonically decreasing relationship like that of the phastCons conservation score (FIG.
  • Example 3 Comparing random forest-based and rules-based patient-specific panel design The sensitivity and specificity of the random forest-based patient-specific panel and a rules-based patient-specific panel were compared. Sensitivity and specificity were calculated
  • the rule-based patient specific panel was produced using similar features as the random forest patient specific panel (e.g., at least some of the features of FIG. 6). Results show that the random forest-based patient specific panel had significantly more sensitivity (an average increase of about 6%) and specificity (an average increase of about 1%) than the rule based-patient specific panel in detecting MRD (FIG.10A). Next it was determined whether a model trained using data from a first type of cancer could be used to predict MRD in samples from a second type of cancer. The random forest classifier model was trained using lung cancer MRD data and tested on melanoma MRD data.
  • FIG.10A An illustrative implementation of a computer system 1100 that may be used in connection with any of the embodiments of the technology described herein (e.g., such as the processes of FIGS.2A-2B and 4) is shown in FIG.11.
  • the computer system 1100 includes one or more processors 1104 and one or more articles of manufacture that comprise non- transitory computer-readable storage media (e.g., memory 1110 and one or more non-volatile storage media 1106).
  • the processor 1104 may control writing data to and reading data from the memory 1110 and the non-volatile storage device 1106 in any suitable manner, as the aspects of the technology described herein are not limited to any particular techniques for writing or reading data.
  • the processor 1104 may execute one or more processor-executable instructions stored in one or more non- transitory computer-readable storage media (e.g., the memory 1110), which may serve as non-transitory computer-readable storage media storing processor-executable instructions for execution by the processor 1104.
  • Computer system device 1100 may also include a network input/output (I/O) interface 1102 via which the computer system may communicate with other computing devices (e.g., over a network), and may also include one or more user I/O interfaces 1108, via which the computer system may provide output to and receive input from a user.
  • the user I/O interfaces 1108 may include devices such as a keyboard, a mouse, a microphone, a display device (e.g., a monitor or touch screen), speakers, a camera, and/or various other types of I/O devices.
  • the above-described embodiments can be implemented in any of numerous ways.
  • the embodiments may be implemented using hardware, software, or a combination thereof.
  • the software code can be executed on any suitable processor (e.g., a microprocessor) or collection of processors, whether provided in a single computing device or distributed among multiple computing devices.
  • one implementation of the embodiments described herein comprises at least one non-transitory computer-readable storage medium (e.g., RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other tangible, non-transitory computer-readable storage medium) encoded with a computer program (i.e., a plurality of executable instructions) that, when executed on one or more processors, performs the above- described functions of one or more embodiments (e.g., part of or all of the processes described above with reference to FIG.2A, FIG.2B, and FIG.4).
  • a computer program i.e., a plurality of executable instructions
  • the computer-readable medium may be transportable such that the program stored thereon can be loaded onto any computing device to implement aspects of the techniques described herein.
  • the reference to a computer program which, when executed, performs any of the above-described functions is not limited to an application program running on a host computer. Rather, the terms computer program and software are used herein in a generic sense to reference any type of computer code (e.g., application software, firmware, microcode, or any other form of computer instruction) that can be employed to program one or more processors to implement aspects of the techniques described herein.
  • any combination of two or more features, systems, articles, materials, kits, and/or methods described herein, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present disclosure.
  • the above-described embodiments can be implemented in any of numerous ways.
  • One or more aspects and embodiments of the present disclosure involving the performance of processes or methods may utilize program instructions executable by a device (e.g., a computer, a processor, or other device) to perform, or control performance of, the processes or methods.
  • inventive concepts may be embodied as a computer readable storage medium (or multiple computer readable storage media) (e.g., a computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement one or more of the various embodiments described above.
  • the computer readable medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various ones of the aspects described above.
  • computer readable media may be non-transitory media.
  • program or “software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects as described above. Additionally, it should be appreciated that according to one aspect, one or more computer programs that when executed perform methods of the present disclosure need not reside on a single computer or processor, but may be distributed in a modular fashion among a number of different computers or processors to implement various aspects of the present disclosure.
  • Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • data structures may be stored in computer-readable media in any suitable form.
  • data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by
  • 89 10940863.511975645.1 assigning storage for the fields with locations in a computer-readable medium that convey a relationship between the fields.
  • any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish a relationship between data elements.
  • the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
  • a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer, as non-limiting examples.
  • a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smartphone, a tablet, or any other suitable portable or fixed electronic device.
  • PDA Personal Digital Assistant
  • a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible formats.
  • Such computers may be interconnected by one or more networks in any suitable form, including a local area network or a wide area network, such as an enterprise network, and intelligent network (IN) or the Internet.
  • networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
  • some aspects may be embodied as one or more methods. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments. All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
  • a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
  • the phrase “at least one,” in reference to a list of one or more elements should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements.
  • At least one of A and B can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
  • 91 10940863.511975645.1 The terms “approximately,” “substantially,” and “about” may be used to mean within ⁇ 20% of a target value in some embodiments, within ⁇ 10% of a target value in some embodiments, within ⁇ 5% of a target value in some embodiments, within ⁇ 2% of a target value in some embodiments.
  • the terms “approximately,” “substantially,” and “about” may include the target value.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Pathology (AREA)
  • Biotechnology (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Zoology (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Wood Science & Technology (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Immunology (AREA)
  • Evolutionary Computation (AREA)
  • Bioethics (AREA)
  • Hospice & Palliative Care (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oncology (AREA)
  • Microbiology (AREA)
  • Artificial Intelligence (AREA)
  • Primary Health Care (AREA)
  • Software Systems (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Techniques for designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient. The techniques include: obtaining variant data indicative of a plurality of variants of the patient; identifying, using the variant data, a plurality of tumor- specific variants (TSVs) for the patient, the identifying comprising selecting variants from among the plurality of variants based on their allele frequencies in the tumor cells of the patient and in the non-tumor cells of the patient; and identifying a subset of the plurality of the TSVs for use in the patient- specific panel by: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain sets of features; processing the sets of features using a trained machine learning model to obtain corresponding scores, each of the scores indicative of the predicted detectability of a corresponding TSV in circulating-tumor DNA of the patient; and selecting, using the scores and from among the at least some of the TSVs, the TSVs for inclusion into the subset of the plurality of the TSVs.

Description

TECHNIQUES FOR DESIGNING PATIENT-SPECIFIC PANELS AND METHODS OF USE THEREOF FOR DETECTING MINIMAL RESIDUAL DISEASE BACKGROUND A central challenge for monitoring cancer patients during remission is identifying minimal residual disease (MRD), which is often indicative of cancer relapse. One strategy for identifying MRD is monitoring biological samples from a patient for circulating tumor DNA (ctDNA), which can be shed by cancer cells. SUMMARY In some aspects, this disclosure describes a method for designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient. The method may comprise: using at least one computer hardware processor to perform: obtaining variant data indicative of a plurality of variants present in tumor cells of the patient, the variant data being derived from at least one biological sample obtained from the patient; identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient; and identifying a subset of the plurality of TSVs for use in the patient- specific panel for use in detecting MRD in the patient, the identifying comprising: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features; processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the predicted detectability of a corresponding TSV in tumor-derived polynucleotides of the patient to be monitored using the patient-specific panel; and selecting, using the plurality of scores and from among the at least some of the TSVs, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient specific panel. In some embodiments, the method further comprises identifying primers for use in detecting presence, in a biological sample, of at least some variants in the subset of the plurality of TSVs. In some embodiments, obtaining the variant data indicative of the plurality of variants of the patient comprises: obtaining at least one data structure encoding variant genomic location data, variant type data, variant sequence data, variant sequence context data, variant
1 10940863.511975645.1 sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data, and/or variant primer data. In some embodiments, the variant sequence context data comprises sequence context homopolymer data, sequence context splice site data, sequence context mutation data, and/or sequence context conservation data. In some embodiments, obtaining variant data indicative of a plurality of variants of the patient comprises obtaining the variant data previously- generated by analyzing sequence data generated by sequencing at least one biological sample obtained from the patient, optionally wherein obtaining variant data comprises sequencing the at least one biological sample obtained from the patient and analyzing sequencing data produced by the sequencing. In some embodiments, the variant data indicative of a plurality of variants present in tumor cells of the patient comprises data characterizing a variant derived from sequencing data from a sample comprising genomic material derived from tumor cells of the patient. In some embodiments, sequencing the at least one biological sample comprises sequencing using whole genome sequencing (WGS) or whole exome sequencing (WES). In some embodiments, obtaining variant data comprises obtaining sequence data of a tumor cell sample and a non-tumor cell sample of the patient. In some embodiments, the tumor cell sample comprises melanoma cells or lung cancer cells. In some embodiments, obtaining the variant data indicative of the plurality of variants of the patient comprises using at least one variant caller to identify the plurality of variants. In some embodiments, obtaining the variant data indicative of the plurality of variants of the patient comprises analyzing sequence data generated by sequencing the tumor cells obtained from the patient and using at least one variant caller to identify the plurality of variants. In some embodiments, identifying the plurality of TSVs comprises: selecting variants from among the plurality of variants using at least one feature selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and tumor cell variant allele frequency. In some embodiments, identifying the plurality of TSVs comprises identifying the plurality of TSVs in a biological sample of a tumor comprising the tumor cells of the patient. In some embodiments, identifying the plurality of TSVs comprises selecting variants using at least two features described herein.
2 10940863.511975645.1 In some embodiments, identifying the plurality of TSVs comprises selecting variants using at least three features described herein. In some embodiments, identifying the plurality of TSVs comprises selecting variants using at least four described herein. In some embodiments, identifying the plurality of TSVs comprises selecting variants using at least five features described herein. In some embodiments, identifying the plurality of TSVs comprises selecting variants using all the features in the group consisting of variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and tumor cell variant allele frequency. In some embodiments, identifying the plurality of TSVs comprises selecting variants using variant bi-directional support, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant is observed at least a threshold number of times in plus strand sequencing reads and minus strand sequencing reads of the variant data. In some embodiments, the threshold number of times is between 2 and 15. In some embodiments, identifying the plurality of TSVs comprises selecting variants using the healthy population variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant has a variant allele frequency in a healthy population, as defined by at least one genomic database, of less than a threshold percentage. In some embodiments, the threshold percentage is 1%. In some embodiments, identifying the plurality of TSVs comprises selecting variants using sequence context homopolymer size, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether a homopolymer sequence exceeding a threshold size is present between the variant and a binding site of a primer designed to detect presence of the variant. In some embodiments, selecting variants using sequence context homopolymer size comprises selecting variants using sequence data derived from a biological sample of a tumor comprising the tumor cells of the patient. In some embodiments, the threshold size is 6 nucleotides. In some embodiments, identifying the plurality of TSVs comprises selecting variants using sequence coverage in non-tumor cells, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether sequencing coverage of the variant in the non-tumor cells of the patient exceeds a threshold. In some embodiments, the threshold is between 45X and 100X. In some embodiments, identifying
3 10940863.511975645.1 the plurality of TSVs comprises selecting variants using the ratio of variant allele frequency between tumor cells and non-tumor cells, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the ratio of the variant exceeds a threshold ratio. In some embodiments, identifying the plurality of TSVs comprises determining the ratio of variant allele frequency between sequence data of a biological sample of a tumor comprising the tumor cells of the patient and sequence data of non-tumor cells of the patient. In some embodiments, the threshold ratio is between a ratio of 20:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, identifying the plurality of TSVs comprises selecting variants using the tumor cell variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the tumor cell variant allele frequency exceeds a threshold. In some embodiments, selecting variants using the tumor cell variant allele frequency comprises selecting using sequence data a biological sample of a tumor comprising the tumor cells of the patient. In some embodiments, the threshold is between a 0.05 and a 0.1 tumor cell variant allele frequency. In some embodiments, generating the set of features comprises generating: at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. In some embodiments, plurality of TSVs comprises a first TSV, wherein generating the respective set of features comprises generating a first set of features for the first TSV, and wherein generating the first set of features for the first TSV comprises generating at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. In some embodiments, generating the first set of features for the first TSV comprises generating the at least one sequencing coverage feature for the first TSV, and wherein generating the at least one sequencing coverage feature comprises determining sequencing depth of coverage of plus strands and minus strands for the first TSV, and/or a ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV. In some embodiments, generating the at least one sequencing coverage feature for the first TSV
4 10940863.511975645.1 comprises generating the at least one sequencing coverage feature using sequence data of a biological sample of a tumor comprising the tumor cells of the patient. In some embodiments, generating the first set of features for the first TSV comprises generating the at least one allele frequency feature, and wherein generating the at least one allele frequency feature comprises determining non-tumor cell depth coverage for the first TSV, a number of observations of the first TSV in tumor cells of the patient, and/or a tumor allele frequency of the first TSV. In some embodiments, generating the at least one allele frequency feature comprises generating the at least one allele frequency feature using sequence data of a biological sample of a tumor comprising the tumor cells of the patient. In some embodiments, generating the first set of features for the first TSV comprises generating the at least one primer feature, and wherein generating the at least one primer feature comprises determining a distance between a first TSV and a binding site for a primer designed to detect the first TSV. In some embodiments, generating the at least one primer feature comprises determining a distance between a first TSV and a PCR primer designed to amplify a portion of a polynucleotide comprising the first TSV. In some embodiments, generating the at least one primer feature comprises determining a maximum distance between the first TSV and a binding site for a first primer designed to detect the first TSV and/or a maximum distance between the first TSV and binding site for a second primer, different from the first primer, designed to detect the first TSV. In some embodiments, generating the at least one primer feature comprises determining a minimum distance between the first TSV and a binding site for a first primer designed to detect the first TSV and/or a minimum distance between the first TSV and binding site for a second primer designed to detect the first TSV. In some embodiments, a first primer and/or a second primer are PCR primers designed to amplify a portion of a polynucleotide comprising the first TSV. In some embodiments, generating the first set of features for the first TSV comprises generating the at least one sequence context feature, and wherein generating the at least one sequence context feature comprises determining a conservation score of a polynucleotide of the patient comprising the first TSV, a distance between the first TSV and a nearest splice site on the polynucleotide, and/or a splice site score of the polynucleotide. In some embodiments, generating the conservation score comprises generating a phastCons conservation score and/or a phyloP conservation score. In some embodiments, generating the first set of features for the first TSV comprises determining: the sequencing depth of coverage of plus strands and minus strands for the first TSV, the non-tumor cell depth coverage for the first TSV, the number of observations of the first TSV in tumor cells of the
5 10940863.511975645.1 patient, and the trinucleotide context (TNC) error rate feature. In some embodiments, the method further comprises determining one or more of the maximum distance between the first TSV and a binding site for the second primer designed to detect the first TSV, the ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV, the tumor allele frequency of the first TSV, the phastCons conservation score of the first TSV, the maximum distance between the first TSV and a binding site for the first primer designed to detect the first TSV, the distance between the first TSV and the nearest splice site on a polynucleotide of the patient comprising the first TSV, and a phyloP conservation score. In some embodiments, the method further comprises determining one or more of the C to A variant mutation feature, the minimum distance between the first TSV and a binding site for the second primer designed to detect the first TSV, the splice site score of the polynucleotide, the minimum distance between the first TSV and the binding site for the second primer designed to detect the first TSV. In some embodiments, processing the plurality of sets of features using the trained machine learning model to obtain a corresponding plurality of scores comprises processing the plurality of sets of features using a trained nonlinear classification model. In some embodiments, the trained nonlinear classification model comprises a random forest model. In some embodiments, the trained machine learning model comprises a plurality of parameters having respective values and wherein processing a set of features of the plurality of sets of features comprises computing a score using the set of features and the respective values of the plurality of parameters. In some embodiments, the score is the predicted likelihood that the TSV will be observed in the biological sample of an MRD positive patient. In some embodiments, selecting the TSVs for inclusion into the subset of the plurality of TSVs comprises selecting a threshold number of TSVs based on their respective scores. In some embodiments, selecting a threshold number of TSVs based on their respective scores comprises selecting TSVs with the highest scores. In some embodiments, selecting a threshold number of TSVs based on their respective scores comprises selecting 50 TSVs with the highest scores. In some embodiments, the trained machine learning model is trained using TSVs from a plurality of MRD positive patients having a first cancer and is predictive of the likelihood of detecting a TSV in a biological sample from a MRD positive patient having a second cancer that is different from the first cancer. In some embodiments, wherein the first cancer is lung cancer and the second cancer is melanoma.
6 10940863.511975645.1 In some embodiments, the method further comprises: synthesizing primers corresponding to at least some of the TSVs in the subset of the plurality of TSVs. In some aspects, this disclosure describes a method of training a machine learning model to generate a score indicative of the predicted detectability of a tumor-specific variant (TSV) in a biological sample of a minimal residual disease (MRD) positive patient, the machine learning model comprising a plurality of parameters, the method comprising: obtaining training data, the training data derived from data collected during previously performed monitoring for presence of a plurality of TSVs in a plurality of biological samples collected from MRD positive patients, the training data comprising: for each TSV in the plurality of TSVs and each biological sample in which the TSV was previously monitored, (i) variant data associated with the TSV; and (ii) and an indication of whether the TSV was present or absent in the biological sample; and training the machine learning model by using the training data to estimate values of the plurality of parameters to obtain a trained machine learning model. In some embodiments, obtaining training data comprises obtaining variant data associated with each TSV, the variant data comprising at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. In some embodiments, obtaining training data comprises obtaining an indication of whether the TSV is present or absent in the biological sample, the indication determined based on the TSV being present in the biological sample at an allele frequency that exceeds a threshold. In some embodiments, training a machine learning model to predict a score indicative of detectability of a TSV in a biological sample comprises training the machine learning model to predict a likelihood that the TSV will be observed in the biological sample of an MRD positive patient. In some embodiments, the MRD positive patients comprise patients that have been previously diagnosed with lung cancer and/or patients that have been previously diagnosed with melanoma. In some embodiments, the plurality of TSVs comprises at least 200 TSVs. In some embodiments, the MRD positive patients comprise at least 50 MRD positive patients. In some embodiments, the MRD positive patients comprise at least 500 MRD positive patients. In some embodiments, training the machine learning model comprises training a nonlinear machine learning model. In some embodiments, training the machine learning model comprises training a nonlinear regression machine learning model. In some embodiments, training the machine learning model comprises training a nonlinear
7 10940863.511975645.1 classification machine learning model. In some embodiments, training the machine learning model comprises training a random forest model. In some embodiments, training the machine learning model to estimate values of the plurality of parameters, comprises estimating the values of 5 parameters. In some embodiments, training the machine learning model comprises training the trained machine learning model as described herein. In some aspects, this disclosure describes a method for determining whether patient- specific panel data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD), the method comprising: identifying primers for use in detecting a subset of a plurality of TSVs using the method described herein; generating sequence data from the biological sample of the patient, the generating comprising contacting the biological sample with the primers; detecting TSVs using the sequence data; and determining, using the detected TSVs, whether the biological sample provides an indication of MRD. In some embodiments, the biological sample is a blood, serum or plasma sample of the patient. In some embodiments, detecting the TSVs using the sequence data comprises determining the allele frequency of the TSVs in the biological sample. In some embodiments, determining whether the biological sample provides an indication of MRD comprises determining whether the allele frequency of at least some of the TSVs exceeds an error rate of generating sequencing data of the biological sample. In some embodiments, the method further comprises administering a therapeutic when the patient has a positive indication of MRD or continuing to collect biological samples from the patient for use in monitoring the patient for MRD when the patient has a negative indication of MRD. In some embodiments, administering a therapeutic comprises administering a therapeutic to treat a cancer and/or tumor associated with the indication of MRD. In some embodiments, determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with sensitivity greater than a 0.85 probability of detecting MRD in a patient that has MRD. In some embodiments, determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with specificity greater than a 0.98 probability of not detecting MRD in a patient that does not have MRD. Some embodiments describe selecting a patient for administration of a therapeutic, the selecting comprising: determining whether sequence data of a biological sample of the patient provides an indication that the patient has minimal residual disease (MRD) using the methods
8 10940863.511975645.1 described herein; and selecting the patient when the patient has a positive indication of MRD; or repeating the method with one or more further biological samples from the patient. In some aspects, this disclosure describes a system for designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient. The system may comprise: at least one computer hardware processor; and at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining variant data indicative of a plurality of variants of the patient present in tumor cells of the patient; identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient; identifying a subset of the plurality of TSVs for use in the patient-specific panel for use in detecting MRD in the patient, the identifying comprising: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features; processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the predicted detectability of a corresponding TSV in tumor-derived polynucleotides of the patient to be monitored using the patient-specific panel; and selecting, using the plurality of scores and from among the at least some of the TSVs, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient specific panel. In some embodiments, the at least one computer hardware processor stores processor executable instructions that cause the at least one computer hardware processor to perform the method of designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient, as described herein. In some aspects, this disclosure describes at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining variant data indicative of a plurality of variants of the patient present in tumor cells of the patient; identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient; identifying a subset of the plurality of TSVs for use in the patient-specific panel for use in detecting MRD in the patient, the identifying comprising: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features; processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the
9 10940863.511975645.1 predicted detectability of a corresponding TSV in circulating-tumor DNA (ctDNA) of the patient to be monitored using the patient-specific panel; and selecting, using the plurality of scores and from among the at least some of the TSVs, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient specific panel. In some embodiments, the at least one computer hardware processor stores processor executable instructions that cause the at least one computer hardware processor to perform the method of designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient, as described herein. BRIEF DESCRIPTION OF THE DRAWINGS FIG.1 is a diagram depicting an illustrative technique 100 for using variant data from tumor cells and non-tumor cells of a patient to design a patient-specific panel for detecting MRD in the patient, according to some embodiments of the technology described herein. FIG.2A is a flowchart of an illustrative process 200 for identifying a subset of a plurality of tumor specific variants (TSVs) for use in a patient-specific panel for identifying MRD, and optionally identifying and/or synthesizing one or more primers for inclusion in a patient-specific panel, according to some embodiments of the technology described herein. Steps enclosed with dashed lines are optional. FIG.2B is a flowchart of an illustrative process 250 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using a trained machine learning model, according to some embodiments of the technology described herein. FIG.3 is a diagram depicting an illustrative technique 300 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using the TSVs using a trained machine learning model, according to some embodiments of the technology described herein. FIG.4 is a flowchart of an illustrative process 400 for identifying the subset of the plurality of TSVs for use in a patient-specific panel, according to some embodiments of the technology described herein. FIG.5 is a diagram depicting an illustrative technique 500 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using variants identified by sequencing non-tumor cells and tumor cells of the patient to identify TSVs and exclude non- tumor-specific variants, scoring the TSVs using a trained machine learning model, and selecting TSVs for the patient-specific panel using the scores, according to some embodiments of the technology described herein.
10 10940863.511975645.1 FIG.6 is a scatter plot showing SHapley Additive exPlanations (SHAP) values of TSV features included when training and testing the machine learning model, according to some embodiments of the technology described herein. FIG.7 is a table of TSV features selected for use in a trained machine learning model, according to some embodiments of the technology described herein. FIG.8 is a beeswarm plot of the SHAP values of each feature of FIG.7 where a broader SHAP value distribution for a given feature indicates the feature impact on the scores of the trained machine learning model, according to some embodiments of the technology described herein. FIG.9A is a scatter plot comparing variant max splice site score to SHAP values of the variant max splice site score where each point is colored by Random Forest (RF) Predicted Probability, according to some embodiments of the technology described herein. FIG.9B is a scatter plot comparing minimum primer 1 distance to SHAP values of minimum primer 1 distance where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9C is a scatter plot comparing minimum primer 2 distance to SHAP values of the minimum primer 2 distance where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9D is a scatter plot comparing maximum primer 2 distance to SHAP values of maximum primer 2 distance where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9E is a scatter plot comparing tumor cell alternate observations to SHAP values of tumor cell alternate observations where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9F is a scatter plot comparing maximum primer 1 distance to SHAP values of maximum primer 2 distance where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9G is a scatter plot comparing phyloP conservation score to SHAP values of phyloP conservation score where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9H is a scatter plot comparing phastCons conservation score to SHAP values of the phastCons conservation score where each point is colored RF Predicted Probability, according to some embodiments of the technology described herein.
11 10940863.511975645.1 FIG.9I is a scatter plot comparing tumor cell allele frequency (FAF) to SHAP values of FAF where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9J is a scatter plot comparing non-tumor cell depth coverage to SHAP values of non-tumor cell depth coverage where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9K is a scatter plot comparing strand bias to SHAP values of strand bias where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9L is a scatter plot comparing minimum strand coverage to SHAP values of minimum strand coverage where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9M is a scatter plot comparing error rate corrected error bins to SHAP values of the error rate corrected error bins where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9N is a scatter plot comparing C to A mutations to SHAP values of the C to A mutations where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.9O is a scatter plot comparing distance to nearest splice site to SHAP values of the distance to nearest splice site where each point is colored by RF Predicted Probability, according to some embodiments of the technology described herein. FIG.10A shows box and whisker plots of the sensitivity and specificity of a rules- based algorithm and the trained machine learning model for predicting MRD in lung and melanoma cancer patients, according to some embodiments of the technology described herein. FIG.10B shows bar charts of the sensitivity and specificity of a rules-based algorithm and the trained machine learning model for predicting MRD in melanoma cancer patients using an iteration of the model trained solely on lung cancer data, according to some embodiments of the technology described herein. FIG.11 depicts an illustrative implementation of a computer system that may be used in connection with some embodiments of the technology described herein. FIG.12 is a diagram depicting an illustrative technique 1200 for training the trained machine learning model to generate a score indicative of the predicted detectability of a TSV, according to some embodiments of the technology described herein.
12 10940863.511975645.1 FIG.13 is a flowchart of an illustrative process 1300 for training the trained machine learning model, according to some embodiments of the technology described herein. DETAILED DESCRIPTION Early detection of cancer relapse/recurrence is an important aspect of effective cancer treatment. One strategy for detecting cancer relapse comprises using patient-specific panels (e.g., a selection of tumor specific variants and/or one or more primers used to detect them) to detect minimal residual disease (MRD) using biological samples (e.g., samples containing circulating tumor DNA (ctDNA)) collected from a patient, often after administration of a cancer therapy. Circulating tumor DNA often comprises wild-type nucleic acid sequences (e.g., comprising somatic and/or germline mutations) as well as nucleic acid sequences comprising tumor-specific variants (TSVs), which are often indicative of MRD. This strategy may be implemented for a patient via a two-stage “panel design” process. The first stage may involve identifying tumor-specific variants for a patient, for example, by sequencing a biological sample (e.g., a sample comprising tumor cells and/or non-tumor cells) obtained from the patient and analyzing the sequencing results. The second stage may involve creating a customized panel (e.g., patient-specific panel) for that patient and may comprise a suitable technique for detecting the TSVs (e.g., untargeted sequencing, targeted sequencing, polynucleotide probes, polymerase chain reaction amplification of the TSVs, qPCR, hybrid/array capture, the like or a combination thereof). In some embodiments, a patient-specific panel is used to select and/or detect TSVs in a biological sample of a patient. In some embodiments, TSVs are detected using a suitable amplification method. The detection may be performed by contacting a biological sample (or polynucleotides from the sample) with primers or probes (depending on the technique) for detecting the TSVs of a patient-specific panel. For example, primers (e.g., sets of primers) from a patient-specific panel may be used to amplify polynucleotides of a biological sample which are expected to comprise detectable TSVs when MRD is present in a patient. Amplicons resulting from an amplification may be analyzed using a suitable method. In some embodiments, following amplification, amplicons of the selected polynucleotides may be sequenced (e.g., using next-generation sequencing) to detect TSVs. A positive indication of MRD may be found when the total number of TSVs detected exceeds a suitable threshold (e.g., the threshold may be an expected number of TSVs to be detected due to error associated with sample preparation and sequencing). A positive indication of MRD may also
13 10940863.511975645.1 be found when specific TSVs are detected. In another example, fluorescent polynucleotide probes may be used to detect polynucleotides comprising TSVs in the biological sample (e.g., polynucleotides extracted from the biological sample). A positive indication of MRD may be found when the fluorescent signal from the fluorescent polynucleotide probes exceeds a threshold (e.g., the threshold may be the expected fluorescent background signal). The degree to which a patient-specific panel is effective in detecting MRD may be quantified using measures such as panel sensitivity and specificity. Sensitivity refers to the true positive rate of detecting MRD in a patient. It refers to the probability of detecting MRD in a patient that has MRD. Specificity refers to the true negative rate of detecting MRD in a patient. It refers to the probability of not detecting MRD in a patient that does not have MRD. The consequences of incorrect results—providing a false indication of cancer relapse or providing a false indication of cancer remission—are highly undesirable. The inventors have appreciated that the sensitivity and specificity of a patient-specific panel designed using previous methods may be improved upon. For example, some panel design processes involve selecting tumor-specific variants using manually-designed rules (e.g., selecting TSVs for which features, such as allele frequency in tumor cells and/or non-tumor cells, sequencing coverage, and/or sequencing depth, exceed respective manually-set thresholds) and then designing a panel to detect the selected tumor-specific variants. Such TSV selection rules encode subjective assumptions about the importance of TSV features in selecting TSVs that will actually help to detect MRD. In particular, the rules may not accurately and faithfully represent the complex (e.g., non-linear and heterogeneous) relationship between various TSV characteristics and the likelihood that such TSVs can be subsequently detected in ctDNA of a patient with high sensitivity and specificity. Consequently, the sensitivity and specificity of panels designed using such a process is less desirable. The inventors have developed a new patient-specific panel design process that improves upon previous panel design techniques in that it produces patient-specific panels that have higher sensitivity and specificity as compared to patient-specific panels produced using previous panel design techniques. The precise improvement can be quantified and is described in greater detail herein including with reference to FIGs.10A-10B. Additionally, the inventors have used objective and data driven criteria to select features for inclusion in the model that are predictive of the detectability of TSVs rather than rule-based criteria. Notably, the new panel design process involves using machine learning technology (e.g., instead of subjective rules) to select tumor-specific variants for inclusion in
14 10940863.511975645.1 a customized patient-specific panel designed for monitoring MRD in the patient. The machine learning technology involves a machine learning model that is trained to represent the (e.g., non-linear and heterogeneous) relationship between various features of a TSV (see e.g., the features shown in FIG.7) and the likelihood that such a TSV will be detected in the circulating nucleic acids (e.g., ctDNA) of the patient during subsequent monitoring. Such a data-driven approach removes the need to rely on manually-tuned and subjective rules of previous approaches and directly leads to the improved specificity and sensitivity of the resulting panel, as demonstrated herein. In some embodiment, the new panel design process involves three stages: (1) identifying variants by analyzing sequence data obtained by sequencing one or more biological samples obtained from a patient; (2) identifying, among the identified variants, a set of tumor-specific variants for the patient; and (3) evaluating the tumor-specific variants using a trained machine learning model (e.g., a random forest model, a non-linear mixed- effects model, a logistic regression model, a support vector machine model, etc.) to identify a subset of the plurality of tumor-specific variants to use for the patient-specific panel. In turn, primers corresponding to at least some (e.g., all) of the TSVs in the identified subset may be synthesized and used for analyzing (e.g., amplifying and/or detecting, e.g., sequencing) another biological sample obtained from the patient at a later time (e.g., in part by contacting nucleic acids in the biological sample with the synthesized primers). Subsequent sequencing results may be analyzed, for example, to detect MRD. In contrast to previous methods, this disclosure provides evidence that the relationships between TSV features (e.g., sequence context, allele frequency in tumor cells vs. healthy cells, etc.) and a TSV being indicative of MRD are much more complex (e.g., non-linear and heterogeneous). In order to better design patient-specific panels, this disclosure describes using machine learning models that can capture these complex relationships and shows a corresponding improvement in patient specific-panel sensitivity and specificity in detecting MRD compared to previous panel-design methods. Accordingly, some embodiments provide for a computer-implemented method of designing a patient-specific panel (e.g., a panel for use in detecting tumor-specific variants of the patient) for use in detecting minimal residual disease (MRD) in a (e.g., a human) patient. In some embodiments, the method comprises: (A) obtaining variant data indicative of a plurality of variants of the patient present in tumor cells of the patient (e.g., the plurality of variants may include germline variants, somatic variants, and/or tumor-specific somatic variants); (B) identifying, using the variant data and from among the plurality of variants, a
15 10940863.511975645.1 plurality of tumor-specific variants (TSVs) for the patient (e.g., variants indicative of the tumor or ctDNA of the tumor); and (C) identifying a subset of the plurality of the TSVs for use in the patient-specific panel for use in detecting MRD in the patient, the identifying comprising: (i) generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features (e.g., a set of features for each of the at least some of the TSVs of the plurality of TSVs); (ii) processing the plurality of sets of features using a trained machine learning model (e.g., a trained random forest model) to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the predicted detectability of a corresponding TSV in tumor-derived polynucleotides (e.g., circulating-tumor DNA) of the patient to be monitored using the patient-specific panel; and (iii) selecting, using the plurality of scores and from among the at least some of the TSVs (e.g., by selecting the top highest scoring TSVs), the TSVs for inclusion into the subset of the plurality of the TSVs for use in the patient-specific panel. In some embodiments, the method further comprises identifying (e.g., designing or accessing previously-designed) primers for use in amplifying and/or detecting presence (or absence), in a biological sample (e.g., another biological sample obtained at a later time), of at least some variants in a subset of the plurality of the TSVs. This may be done, for example, by designing and/or generating primers that are designed to amplify portions of a patient’s ctDNA which include TSVs in the subset of the plurality of TSVs. In some embodiments, primers may be identified after the plurality of TSVs for the patient is identified (e.g., by designing or accessing previously-designed primers for each of at least some of the TSVs in the plurality of TSVs). In some such embodiments, information about the primer(s) identified for a TSV may be used to evaluate the TSV for inclusion into the subset of the plurality of TSVs for use in a patient-specific panel. In other embodiments, the primers may be identified for the TSVs in the subset of the plurality of TSVs after the subset of the plurality of TSVs are selected (e.g., in embodiments where information about the primers is not used to evaluate TSVs for inclusion into the subset of the plurality of TSVs for used in the patient- specific panel). Further discussion of identifying primers can be found herein including with reference to FIG.2A. In some embodiments, obtaining variant data indicative of a plurality of variants of a patient comprises: obtaining one or more data structures encoding variant genomic location data, variant type data, variant sequence data, variant sequence context data, variant sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data, and/or variant primer data. All these types
16 10940863.511975645.1 of data may be stored using any suitable data structure and using any suitable format, as aspects of the technology described herein are not limited in this respect. In some embodiments, obtaining variant data may include obtaining at least one data structure encoding variant sequence context data. Obtaining sequence context data may comprise obtaining one or more of sequence context homopolymer data (e.g., data indicative of the location and size of homopolymers within a threshold distance of a given variant), sequence context splice site data (e.g., data indicative of the location of any splice site within a threshold distance of a given variant), sequence context mutation data (e.g., data indicative of the location and type of mutations within a threshold distance of a given variant), and/or sequence context conservation data (e.g., data indicative of the degree of conservation of the ctDNA sequence within a threshold distance of a given variant). The variant data indicative of a plurality of variants present in tumor cells of the patient may comprise data characterizing a variant derived from sequencing data from a sample comprising genomic material derived from tumor cells of the patient. In some embodiments, obtaining variant data indicative of a plurality of variants of the patient comprises obtaining variant data previously-generated by analyzing sequence data generated by sequencing at least one biological sample obtained from the patient (e.g., the tumor cells obtained from the patient). In some embodiments, obtaining variant data indicative of a plurality of variants of a patient comprises sequencing (e.g., using whole genome sequencing or whole exome sequencing) the at least one biological sample obtained from the patient (e.g., melanoma cells, lung cancer cells, or cells of any other type of cancer that the patient may have and/or may be monitored for) and analyzing sequencing data produced by the sequencing. Obtaining variant data may comprise obtaining sequence data of a tumor cell sample and/or a non-tumor cell sample of the patient. In some embodiments, obtaining variant data indicative of a plurality of variants of a patient comprises generating the variant data or accessing (e.g., importing, downloading) previously-generated variant data. Regardless of when generated, variant data may be generated, in some embodiments, using at least one suitable variant caller to identify a plurality of variants (e.g., as described in Koboldt, D. C. (2020) Genome Med 12:91) and generate various information about the variants (e.g., variant genomic location data, variant sequence data, variant sequence context data, variant sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data and/or variant primer data). For example, generating the variant data may comprise obtaining sequence data corresponding to the at least one biological sample obtained from the patient
17 10940863.511975645.1 and inputting the sequence data into at least one variant caller whose output may identify the plurality of variants and information about the identified variants. Any suitable variant caller may be used in this respect, further examples of which are described herein including in the section “Variant”. As discussed above, in some embodiments, this method comprises identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient. Identifying a plurality of TSVs may comprise identifying the plurality of TSVs in a biological sample of a tumor comprising the tumor cells of the patient. Identifying the plurality of TSVs (e.g., the plurality of TSVs to be inputted into the trained machine learning model) may comprise: selecting variants from among the plurality of variants using at least one feature (e.g., at least two features, at least three features, at least four features, at least five features, or all the features) selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and/or non-tumor cells, and tumor cell variant allele frequency. Identifying the plurality of TSVs may comprise selecting variants using variant bi- directional support, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant is observed at least a threshold number of times in plus strand sequencing reads and minus strand sequencing reads of the variant data (e.g., 2-15 times). Additional methods for selecting variants using variant bi- directional support are described herein including in the section “Variant Bi-directional Support”. Identifying the plurality of TSVs may comprise selecting variants using the healthy population variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant has a variant allele frequency in a healthy population, as defined by at least one genomic database, of less than a threshold percentage (e.g., 1%). Additional methods for selecting variants using healthy population variant allele frequency are described herein including the section “Healthy Population Variant Allele Frequency”. Identifying the plurality of TSVs may comprise selecting variants using sequence context homopolymer size, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether a homopolymer sequence exceeding a threshold size (e.g., nucleotides) is present between the variant and a binding site of a primer designed to detect presence of the variant (e.g., in the genome of the tumor cells of the
18 10940863.511975645.1 patient). Additional methods for selecting variants using sequence context homopolymer size are described herein including in the section “Sequence Context Homopolymer Size”. Identifying the plurality of TSVs may comprise selecting variants using sequence coverage in non-tumor cells, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether sequencing coverage of the variant in the non-tumor cells of the patient exceeds a threshold (e.g., between 45X and 100X). Selecting variants using sequence coverage in non-tumor cells may comprise selecting variants using sequence data of a biological sample of a tumor comprising the tumor cells of the patient. Additional methods for selecting variants using sequence coverage in non-tumor cells are described herein including in the section “Sequence Coverage in Non- tumor Cells”. Identifying a plurality of TSVs may comprise selecting variants using the ratio of variant allele frequency between tumor cells and non-tumor cells, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the ratio of the variant exceeds a threshold ratio (e.g., a ratio between 20:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency). The ratio of VAF may be determined using sequence data of the tumor cells and sequence data of the non- tumor cells. Additional methods for selecting variants using the ratio of variant allele frequency between tumor cells and non-tumor cells are described herein including in the section “Ratio of Variant Allele Frequency between Tumor Cells and Non-tumor Cells”. Identifying the plurality of TSVs may comprise selecting variants using the tumor cell variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the tumor cell variant allele frequency exceeds a threshold (e.g., between a 0.05 and a 0.1 tumor cell variant allele frequency). Selecting variants using the tumor cell variant allele frequency may comprise selecting the variants using sequence data derived from a biological sample of a tumor comprising the tumor cells of the patient. Additional methods for selecting variants using the tumor cell variant allele frequency are described herein including in the section “Tumor Cell Variant Allele Frequency”. As discussed above, in some embodiments, the method comprises identifying a subset of the plurality of the TSVs for use in the patient-specific panel for use in detecting MRD in the patient. In some embodiments, the plurality of TSVs comprises a first TSV, wherein generating the respective set of features (e.g., features to be provided as input into the trained
19 10940863.511975645.1 machine learning model) comprises generating a first set of features for the first TSV, and wherein generating the first set of features for the first TSV comprises generating at least one sequencing coverage feature (e.g., sequencing depth of coverage of plus strands and minus strands for the first TSV (e.g., minimum strand coverage), and/or a ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV (e.g., strand bias), at least one allele frequency feature (e.g., non-tumor cell depth coverage for the first TSV, a number of observations of the first TSV in tumor cells of the patient (e.g., tumor cell alternate observations), and/or a tumor allele frequency of the first TSV), a trinucleotide context (TNC) error rate feature (e.g., error rate in error corrected bins), a C to A variant mutation feature (e.g., the variant comprises a C to A mutation), at least one primer feature (e.g., a distance between the first TSV and a binding site for a primer designed to detect the first TSV), and/or at least one sequence context feature (e.g., a conservation score of a polynucleotide of the patient comprising the first TSV, a distance between the first TSV and a nearest splice site on the polynucleotide, and/or a splice site score of the polynucleotide). Generating these features may comprise generating the features using sequence data of a biological sample of a tumor comprising the tumor cells of the patient. Additional methods for generating features are described herein including the section “Subset of the Plurality of Tumor Specific Variants” and with reference to FIG.2B. As discussed above, in some embodiments, generating the first set of features for the first TSV comprises generating at least one primer feature. Generating the at least one primer feature may comprise determining a maximum distance between a first TSV and a binding site for a first primer (e.g., a PCR primer) designed to detect the first TSV (e.g., max primer 1 distance) and/or a maximum distance between the first TSV and binding site for a second primer (e.g., max primer 2 distance), different from the first primer, designed to detect the first TSV. In other embodiments, generating the at least one primer feature may comprise determining a minimum distance between a first TSV and a binding site for a first primer designed to detect the first TSV (e.g., minimum primer 1 distance) and/or a minimum distance between the first TSV and binding site for a second primer (e.g., a PCR primer) designed to detect the first TSV (e.g., minimum primer 2 distance). As discussed above, in some embodiments, generating a first set of features for a first TSV comprises generating the at least one sequence context feature. Generating the at least one sequence context feature may include generating a conservation score (e.g., generating a phastCons conservation score and/or a phyloP conservation score). Generating at least one
20 10940863.511975645.1 sequence context feature may also comprise generating distance to nearest splice site and/or a variant max splice site score. In some aspects, this disclosure describes generating specific combinations of features for use in identifying a subset of the plurality of the TSVs for use in the patient-specific panel for use in detecting MRD in the patient. For example, generating the first set of features for the first TSV may comprise determining: the sequencing depth of coverage of plus strands and minus strands for the first TSV, the non-tumor cell depth coverage for the first TSV, the number of observations of the first TSV in tumor cells of the patient, and/or the trinucleotide context (TNC) error rate feature. In some embodiments, generating a first set of features for a first TSV further comprises determining one or more of a distance (e.g., a minimum and/or maximum distance) between the first TSV and a binding site for a primer (e.g., a first prime and/or a second primer) designed to detect the first TSV, the ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV, the tumor allele frequency of the first TSV, the phastCons conservation score, the distance between the first TSV and the nearest splice site on the polynucleotide, and/or a phyloP conservation score. In some embodiments, generating the first set of features for the first TSV further comprises determining one or more of the C to A variant mutation feature, the minimum distance between the first TSV and a binding site for the second primer designed to detect the first TSV, the splice site score of the polynucleotide, the minimum distance between the first TSV and/or the binding site for the second primer designed to detect the first TSV. In some embodiments, processing the plurality of sets of features using the trained machine learning model to obtain a corresponding plurality of scores comprises processing the plurality of sets of features using a trained nonlinear ML model (e.g., a random forest, a support-vector machine, or a neural network). Non-linear ML models like these are expected to capture the nonlinear relationships between the variant features described herein and the predicted likelihood that the TSV will be observed in the biological sample of an MRD positive patient. The non-linear model may be a non-linear regression model (e.g., a model configured to output an estimated value, such as a likelihood or probability, in the 0-1 range). Alternatively, the non-linear model may be a non-linear classification model (e.g., a model configured to output an indication of one or multiple discrete classes, for example, where each of the classes corresponds to a respective bin of likelihood or probability values in the 0- 1 range). Additional methods for processing the plurality of sets of features using the trained machine learning are described herein including with reference to FIG.2B.
21 10940863.511975645.1 In some embodiments, the trained machine learning model comprises a plurality of parameters having respective values and wherein processing a set of features of the plurality of sets of features comprises computing a score using the set of features and the respective values of the plurality of parameters. The score may represent the predicted likelihood that the TSV will be observed in the biological sample of an MRD positive patient. Selecting the TSVs for inclusion into the subset of the plurality of the TSVs may comprise selecting a threshold number of TSVs based on their respective scores (e.g., selecting a subset of the plurality of the TSVs with high scores (e.g., the top 50 high scores)). Additional methods for selecting the TSVs for inclusion into the subset of the plurality of the TSVs are described herein including with reference to FIG.2B. In some embodiments, the trained machine learning model is trained using TSVs from a plurality of MRD positive patients having a first cancer and is predictive of the likelihood of detecting a TSV in a biological sample from a MRD positive patient having a second cancer that is different from the first cancer. The first cancer may be lung cancer and the second cancer may be melanoma. In some embodiments, the trained machine learning model is trained using TSVs from a plurality of MRD positive patients having one or more types of cancer, and is predictive of the likelihood of detecting a TSV in a biological sample from a MRD positive patient having the same cancer as the one or more types of cancer that the model was trained on. In some embodiments, the trained machine learning model is trained using TSVs from a plurality of MRD positive patients having one or more types of cancer, and is predictive of the likelihood of detecting a TSV in a biological sample from a MRD positive patient having a different type of cancer as the one or more types of cancer that the model was trained on. In some embodiments, selecting the TSVs for inclusion into the subset of the plurality of the TSVs further comprises: synthesizing primers corresponding to at least some of the TSVs in the subset of the plurality of TSVs (e.g., using a suitable primer synthesis method). Some embodiments further provide a method of training a machine learning model (e.g., a nonlinear machine learning model described herein) to generate a score indicative of the predicted detectability of a tumor-specific variant (TSV) (e.g., a likelihood) in a biological sample (e.g., plasma) of a minimal residual disease (MRD) positive patient, the machine learning model comprising a plurality of parameters (e.g., 5 parameters), the method comprising: obtaining training data, the training data derived from data collected during previously performed monitoring for presence of a plurality of TSVs (e.g., at least 200 TSVs in each MRD positive patient) in a plurality of biological samples collected from MRD
22 10940863.511975645.1 positive patients (e.g., patients previously diagnosed with melanoma or lung cancer), the training data comprising: for each TSV in the plurality of TSVs and each biological sample in which the TSV was previously monitored, (i) variant data associated with the TSV (e.g., variant data comprising the features described herein); and (ii) and an indication of whether the TSV was present or absent in the biological sample (e.g., allele frequency of the variant exceeds a threshold); and training the machine learning model by using the training data to estimate values of the plurality of parameters to obtain a trained machine learning model. In some embodiments, the machine learning model is trained using TSVs from a plurality of MRD positive patients having a first cancer (e.g., lung cancer) and the machine learning model is predictive of the probability of detecting a TSV in a biological sample from a MRD positive patient having a second cancer (e.g., melanoma) that is different from the first cancer. In some embodiments, the machine learning model is trained using TSVs from a plurality of MRD positive patients having one or more types of cancer, and is predictive of the probability of detecting a TSV in a biological sample from a MRD positive patient having the same cancer as the one or more types of cancer that the model was trained on. In some embodiments, the machine learning model is trained using TSVs from a plurality of MRD positive patients having one or more types of cancer, and is predictive of the probability of detecting a TSV in a biological sample from a MRD positive patient having a different type of cancer as the one or more types of cancer that the model was trained on. Training a machine learning model to predict a score indicative of detectability of a TSV in a biological sample may comprise training the machine learning model to predict a likelihood that the TSV will be observed in the biological sample of an MRD positive patient. Some embodiments further provide for a method for determining whether sequence data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD), the method comprising: identifying primers for use in amplifying and/or detecting a subset of a plurality of TSVs using the methods described herein; generating sequence data from the biological sample of the patient (e.g., a bodily fluid, like plasma), the generating comprising contacting (e.g., a polymerase chain reaction solution) the biological sample with primers; detecting TSVs using the sequence data (e.g., using any suitable method, for example Illumina® sequencing); and determining, using the detected TSVs, whether the biological sample provides an indication of MRD (e.g., based on an abundance of each TSV detected using the patient specific panel and the expected error in identifying TSVs).
23 10940863.511975645.1 Some embodiments provide a method for determining whether sequence data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD), the method comprising: identifying primers for use in amplifying and/or detecting a selected subset of a plurality of patient-specific TSVs; amplifying polynucleotides of the patient sample using the identified primers; generating sequence data (e.g., using any suitable method, for example Illumina® sequencing) from the amplified polynucleotides; and determining whether the biological sample provides an indication of MRD in the patient according to the sequence data (e.g., based on a presence, absence and/or amount of one or more, or all of the selected subset of TSVs in the biological sample). Determining whether the biological sample provides an indication of MRD may comprise determining whether the allele frequency of at least some of the TSVs exceeds an error rate associated with generating sequencing data of the biological sample. In some embodiments, determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with a sensitivity greater than a 0.80, greater than a 0.85, greater than a 0.90, greater than a 0.95, or greater than a 0.98 probability of detecting MRD in a patient that has MRD. In some embodiments, determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with specificity greater than a 0.98 probability of not detecting MRD in a patient that does not have MRD. Additional methods for determining whether sequence data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD) are described herein including in the section “Methods of Determining an Indication of MRD”. In some embodiments, the method further comprising administering a therapeutic when the patient has a positive indication of MRD (e.g., the patient may have or will have cancer relapse) or continuing to collect biological samples from the patient for use in monitoring the patient for MRD when the patient has a negative indication of MRD (e.g., no indication of cancer relapse). In some embodiments, administering a therapeutic comprises administering a therapeutic to treat a disease (e.g., cancer) associated with the MRD. Patient A “patient” refers to an animal (e.g., a human) that has or is suspected of having a disease (e.g., a cancer). The patient may be a mammal (e.g., a human, a non-human primate, a dog, a cat, a horse, a goat, a sheep, a mouse, or a rat), a bird, a reptile, an amphibian, a fish, or a laboratory model organism (e.g., mice and rats). The patient may be a human. For
24 10940863.511975645.1 example, the patient may be an adult human (e.g., older than 18 years of age), a human child, or a human infant. The patient may be a patient that has been treated for a disease. For example, the patient may have been treated for any type of cancer. For example, the patient may have been treated for lung cancer (e.g., non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC), or lung adenocarcinoma), brain cancer, liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, colon cancer, squamous cell carcinoma, melanoma, etc. The patient may be in remission from a disease. For example, the patient may be in remission from cancer. For example, the patient may be in remission from lung cancer (e.g., non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC), lung adenocarcinoma), brain cancer, liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, colon cancer, squamous cell carcinoma, melanoma and etc. As another example, the patient may be in remission of a cancer selected from NSCLC, colorectal cancer (CRC), bladder cancer, pancreatic cancer, head and neck squamous cell carcinomas (HNSCC), breast cancer, and hematological cancers (e.g., leukemia, lymphoma, and multiple myeloma). These cancers may be particularly likely to release nucleic acids (e.g., RNA or DNA) in bodily fluids. In some embodiments, a patient has been previously treated for a disease (e.g., cancer). In some embodiments, a patient may have been previously treated using one or more therapeutics such as surgery, chemotherapy, radiation therapy, immunotherapy, and/or hormone therapy. In some embodiments, treating a patient comprises removing a tumor. In some embodiments, the patient may be in remission from cancer. Patient-Specific Panel A “patient-specific panel” may refer to a collection (e.g., a set or subset) of tumor specific variants (TSVs) selected for use in detecting MRD in a patient or to a technique for detecting the selected TSVs (e.g., untargeted sequencing, targeted sequencing, polynucleotide probes, polymerase chain reaction amplification of the TSVs, and/or qPCR), depending on the context. In some embodiments, a patient-specific panel comprises a selected subset of TSVs. In some embodiments, each selected TSV of a patient-specific panel is predicted to have a high likelihood or probability of being detected in ctDNA derived from a patient. Minimal Residual Disease (MRD)
25 10940863.511975645.1 “Minimal residual disease” may refer to any remaining disease (e.g., diseased cell or ctDNA) that may be present in a patient after the patient has received and/or completed a treatment for the disease. For example, minimal residual disease associated with cancer may be detected when cancer cells, or tumor-derived polynucleotides (e.g., tumor RNA, cell free tumor DNA and/or circulating tumor DNA (ctDNA)) are present in a patient after treatment. In some embodiments, MRD may be detected based on ctDNA detection before cancer relapse is detected using standard surveillance imaging (e.g., computerized tomography (CT), magnetic resonance imaging (MRI), or Positron Emission Tomography (PET)). Some cancer types may shed DNA (e.g., ctDNA), which may end up in the bloodstream of a patient. Thus, minimal residual disease may be monitored based on sequencing of ctDNA from biological samples (e.g., plasma). The likelihood or probability of determining an indication of minimal residual disease may increase overtime. (e.g., cancer cells that survive treatment may continue to replicate and/or metastasize, which may result in additional ctDNA shedding). Determining an indication of MRD may be based on the number and/or frequency of TSVs of the subset of the plurality of TSVs detected using the patient-specific panel. An indication of MRD may be an estimate of the likelihood and/or probability that MRD is present in the ctDNA plasma sample of a patient. The estimate of the likelihood and/or probability may be based on a statistical test. The statistical test may be a Poisson test, a Binomial test, a T-Test, or any other suitable statistical test. In some embodiments, determining an indication of MRD comprises determining if the number of times each TSVs is observed (e.g., the number of times each TSV of a patient specific panel is detected) in sequence data of a biological sample of the patient exceeds the expected number of TSVs to be detected due to error associated with sample preparation (e.g., DNA extraction and amplification with primers of the patient-specific panel) and/or detection (e.g., sequencing). For example, if in the sequencing data of a biological sample of a patient, a first TSV is observed 10 times, a second TSV is observed 20 times, and a third TSV is observed 5 times then the total number of times TSVs are observed would be 35 times. In this example, if the expected number of TSVs to be detected due to sample preparation and/or detection error were 20, then there would be a positive indication of MRD. A positive indication of MRD may indicate that MRD is present in a patient (i.e., an MRD positive patient). A positive indication of MRD may be determined when TSVs of a patient-specific panel are detected in a biological sample of the patient. A positive indication of MRD may be determined and/or confirmed using standard surveillance imaging (e.g.,
26 10940863.511975645.1 computerized tomography (CT), magnetic resonance imaging (MRI), or Positron Emission Tomography (PET)), as described herein. A positive indication of MRD may be determined when the number of TSVs identified in a patient exceeds the expected number of TSVs expected to be observed due to error associated with sample preparation and/or detection. In some embodiments, a positive indication of MRD is determined when at least 1 TSV (e.g., at least 5 TSVs, at least 10 TSVs, at least 15 TSVs, at least 20 TSVs, at least 25 TSVs, at least 30 TSVs, at least 35 TSVs, at least 40 TSVs, at least 45 TSVs, or at least 50 TSVs) of the patient specific panel are detected in a biological sample of the patient. A negative indication of MRD may indicate that MRD is not present in a patient (i.e., an MRD negative patient). A negative indication of MRD may be determined in the absence of a positive indication of MRD. Methods of determining an indication of MRD are further described herein including in the section titled “Methods of Determining an Indication of MRD”. Sequence Data “Sequence data” may refer to data generated by sequencing nucleic acids in a biological sample (e.g., by using next-generation sequencing (NGS), nanopore-based sequencing or sequencing by synthesis) or obtaining sequence data of a biological sample by other means (e.g., quantitative polymerase chain reaction or hybridization of oligonucleotide probes). Sequence data may be collected using a suitable sequencing method and/or suitable sequencing equipment, which includes but is not limited to equipment manufactured by Illumina®, SOLid®, Ion Torrent®, PacBio®, nanopore-based, Sanger sequencing or 454TM. In some embodiments, sequencing data is generated using an NGS method. Sequence data may be collected using fluorescent probes that are designed to bind to a target polynucleotides (e.g., a polynucleotide comprising a TSV). Sequence data may comprise whole exome sequence data (WES) or whole genome sequence data (WGS). Sequence data may comprise sequence reads of polynucleotide sequences in a biological sample derived from a patient (e.g., reads covering the plus strand and the minus strand of the polynucleotide sequences). Sequence reads may be encoded in any suitable format. A sequence read may encode a polynucleotide sequence that the sequence read represents. A sequence read may encode a polynucleotide sequence in any suitable way (e.g., as a sequence of characters, with characters representing respective nucleotides in the polynucleotide sequence, as a sequence of numbers, with numbers representing respective nucleotides in the polynucleotide sequence, etc.), as aspects of the technology described herein are not limited in this respect.
27 10940863.511975645.1 The sequence data may comprise sequence reads of any suitable polynucleotide of the biological sample. The sequence data may comprise sequence reads of tumor-derived polynucleotides of the biological sample. The sequence data may comprise sequence reads of RNA of the biological sample. The sequence data may comprise sequence reads of DNA of the biological sample. The sequence data may comprise sequence reads of tumor DNA or tumor RNA of the biological sample. The sequence data may comprise sequence reads of cell free DNA (e.g., from healthy cells and/or tumor cells). The sequence data may comprise sequence reads of circulating tumor DNA (ctDNA) of the biological sample. The sequence data may comprise sequence reads of whole exome sequencing of the biological sample. The sequence data may comprise sequence reads of whole genome sequencing of the biological sample. The sequence data may comprise sequence reads that cover TSVs (e.g., TSVs of the subset of the plurality of TSVs). The sequence data may comprise sequence reads that were obtained using a targeted gene sequencing panel. In some embodiments, sequence data may refer to data that is used when identifying TSVs for use in a patient-specific panel. In these embodiments, sequence data may be whole genome sequencing data or whole exome sequence data (e.g., untargeted sequencing). These types of sequence data may be advantageous when identifying TSVs for use in a patient- specific panel at least because these types of sequence data broadly cover sequences from across the genome or exome and thus are favorable in identification of unknown TSVs in a patient and selectin TSVs for use in a patient-specific panel. In some embodiments, sequence data may refer to sequence data obtained using a patient-specific panel (e.g., targeted sequencing). For example, sequence data may be obtained by (1) amplifying polynucleotides of a biological sample of the patient using primers of a patient-specific panel to produce amplicons and (2) sequencing the amplicons (e.g., using next-generation sequencing). These sequence data may be advantageous in determining an indication of MRD because the sequencing is focused on detecting known TSVs in a biological sample of a patient using a targeted approach (e.g., the patient-specific panel may be used to amplify specific polynucleotides that are expected to contain TSVs when MRD is present), which may increase sequencing depth and in turn may increase the probability of observing a TSV that is at a low allele frequency. A sequence read does not include a physical molecule but data representing the same. Thus, a reference to a nucleotide in a sequence read is a reference to information about a nucleotide (e.g., information representing the type of nucleotide – for example “A”, or “G”,
28 10940863.511975645.1 or “C” or “T”). A sequencing read of the sequence data may comprise hundreds to thousands of nucleotides, depending on the sequencing technique used. Sequence data may comprise tens of thousands to billions of sequencing reads. For example, sequence data may comprise at least 50,000 reads (e.g., at least 100,000 sequencing reads, at least 250,000 sequencing reads, at least 500,000 sequencing reads, at least 1,000,000 sequencing reads, at least 2,000,000 sequencing reads, at least 4,000,000 sequencing reads, at least 8,000,000 sequencing reads, at least 16,000,000 sequencing reads, at least 50,000,000 sequencing reads, at least 100,000,000 sequencing reads, at least 500,000,000 sequencing reads, or at least 1,000,000,000 sequencing reads). The sequence data may comprise at least 50,000 reads. The sequence data may comprise 50,000-250,000 reads. Obtaining sequence data may involve accessing at least one data structure, in memory, storing the sequence reads part of the sequence data. Sequence data may comprise sequence data of polynucleotides amplified using a patient-specific panel. For example, a patient-specific panel may be used to specifically sequence only certain polynucleotides from the biological sample (e.g., polynucleotides comprising TSVs of the subset of the plurality of TSVs) by using primers of the patient- specific panel to amplify specific polynucleotides (e.g., polynucleotides associated with locus that may comprise TSVs of the subset of the plurality of TSVs) and then sequencing the amplified polynucleotides. Sequencing data may be generated by sequencing nucleic acids in at least one biological sample. In some embodiments, sequencing data is generated by sequencing nucleic acids derived from two biological samples. Non-limiting examples of nucleic acids in a sample include tumor-derived polynucleotides, circulating nucleic acids (e.g., cellular or acellular nucleic acids), cellular nucleic acids, acellular or cell-free nucleic acids, circulating cell-free nucleic acids, RNA (e.g., mRNA), cell-free RNA (cfRNA), circulating cfRNA, cell- free DNA (cfDNA), circulating cfDNA, tumor RNA, cell-free tumor RNA, circulating cell- free tumor RNA, tumor DNA, cell-free tumor DNA, circulating cell-free tumor DNA, circulating tumor DNA (ctDNA), the like and combinations thereof. In some embodiments, sequencing data is generated by sequencing nucleic acids derived from one or more tumor cells, and/or nucleic acids derived from one or more normal cells. In some embodiments, sequence data is obtained by a suitable method comprising hybrid capture or array capture. Biological Samples
29 10940863.511975645.1 “Biological sample(s)” may refer to one or more specimens collected from the patient. The biological sample(s) may comprise any cell, tissue, biological fluid, and/or bone from a patient, or any other suitable biological sample from the patient. The biological sample(s) may comprise tumor cells and/or non-tumor cells from the patient (e.g., a tumor cell sample and separate non-tumor cell sample). The tumor cells may be collected from any tumor or cancer of the patient including, but not limited to lung cancer, brain cancer, liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, or colon cancer. The tumor cells may be collected from a solid tumor. The tumor cells may be collected from a melanoma tumor. The tumor cells may be collected from a lung tumor. The non-tumor cells may be collected from healthy tissue of the same type as the tumor. For example, if the tumor cells collected are liver tumor cells then the healthy tissue is healthy liver. The non-tumor cells may be collected from a healthy tissue that is different from the tumor tissue collected. For example, if the tumor cells collected are liver tumor cells then the healthy tissue may be a healthy lung. The non-tumor cells may be collected from a blood sample (e.g., plasma). In some embodiments, the tumor cells may be collected from the patient and the non-tumor cells may be collected from a healthy subject (e.g., collecting non-tumor cells from a healthy subject by a third party). In some embodiments (e.g., embodiments related to identifying variants for use in a patient-specific panel), the biological sample may comprise tumor cells of the patient (e.g., comprise a portion of a tumor of the patient) and/or the biological sample may comprise non- tumor cells of the patient. The tumor cells of the patient are expected to comprise TSVs, thus sequencing these cells is expected to identify TSVs for use in the patient-specific panel. The non-tumor cells of the patient are not expected to comprise TSVs, thus sequencing from non- tumor cells may be helpful in distinguishing between TSVs and non-tumor specific somatic variants. In some embodiments (e.g., embodiments related to determining an indication of MRD using a patient-specific panel), the biological sample may be a sample that is expected to comprise tumor-derived polynucleotides (e.g., as described herein). Less invasive methods of monitoring MRD may be advantageous to promote patient comfort and simplify biological sample collection. Thus, biological samples for use in methods of determining an indication of MRD may be collected from bodily fluids, as described herein. In some embodiments, biological samples for use in methods of determining an indication of MRD may be collected from blood (e.g., plasma).
30 10940863.511975645.1 In some embodiments, the biological sample may be stored using cryopreservation. Non-limiting examples of cryopreservation include, but are not limited to, step-down freezing, blast freezing, direct plunge freezing, snap freezing, slow freezing using a programmable freezer, and vitrification. In some embodiments, the biological sample may be stored using lyophilization. In some embodiments, a biological sample may be placed into a container that already contains a preservant (e.g., RNALater to preserve RNA) and then frozen (e.g., by snap-freezing), after the collection of the biological sample from the patient. In some embodiments, such storage in frozen state may be done immediately after collection of the biological sample. In some embodiments, a biological sample may be kept at either room temperature or 4 oC for some time (e.g., up to an hour, up to 8 h, or up to 1 day, or a few days) in a preservant or in a buffer without a preservant, before being frozen. Non-limiting examples of preservants include formalin solutions, formaldehyde solutions, RNALater or other equivalent solutions, TriZol or other equivalent solutions, DNA/RNA Shield or equivalent solutions, EDTA (e.g., Buffer AE (10 mM Tris·Cl; 0.5 mM EDTA, pH 9.0)) and other coagulants, and Acids-Citrate-Dextrose (e.g., for blood specimens). In some embodiments, special containers may be used for collecting and/or storing a biological sample. For example, a vacutainer may be used to store blood. In some embodiments, a vacutainer may comprise a preservant (e.g., a coagulant, or an anticoagulant). In some embodiments, a container in which a biological sample is preserved may be contained in a secondary container, for the purpose of better preservation, or for the purpose of avoiding contamination. Any of the biological samples from a patient described herein may be stored under any condition that preserves stability of the biological sample. In some embodiments, the biological sample may be stored at a temperature that preserves stability of the biological sample. In some embodiments, the sample may be stored between 18 and 28 ˚C (e.g., 25˚C). In some embodiments, the sample may be stored under refrigeration (e.g., 4 °C). In some embodiments, the sample is stored under freezing conditions (e.g., -20 °C). In some embodiments, the sample may be stored under ultralow temperature conditions (e.g., -50 °C to -800 °C). In some embodiments, the sample may be stored under liquid nitrogen (e.g., - 1700 °C). In some embodiments, a biological sample may be stored at -60℃ to -80℃ (e.g., - 70℃) for up to 5 years. In some embodiments, a biological sample may be stored at -60℃ to -80℃ (e.g., -70℃) for up to 1 month, up to 2 months, up to 3 months, up to 4 months, up to 5 months, up to 6 months, up to 7 months, up to 8 months, up to 9 months, up to 10 months, up
31 10940863.511975645.1 to 11 months, up to 1 year, up to 2 years, up to 3 years, up to 4 years, or up to 5 years. In some embodiments, a biological sample may be stored as described by any of the methods described herein for up to up to 5 years, up to 10 years, up to 15 years, or up to 20 years. Variant A “variant” may refer to a mutation or genetic variation present, or suspected to be present (e.g., suspected based analysis of sequencing data) in a first genome compared to a second genome. In some embodiments, a variant is a mutation in a genome present in a patient (which genome may include nucleic acids derived from tumor cells and/or non-tumor cells) compared to a standard genome or reference genome (e.g., GRCh38 or hg19, or the like). In some embodiments a variant is a mutation in a genome of a tumor cell of a patient as compared to the genome of healthy cells or non-cancerous cells of the patient. In some embodiments, a variant is a tumor-specific variant. In some embodiments, a variant is not a tumor-specific variant. For example, in certain embodiments, variant data may falsely indicate a presence of a variant in a genome of a tumor, where the variant was introduced by a polymerase error or a sequencing read error. As another example, in certain embodiments, variant data may indicate a presence of a variant (e.g., a single nucleotide difference) in a genome of a tumor derived from a patient compared to a reference genome, where the variant is not tumor-specific because the same variant is also present in a non-tumor cell derived from the patient. Variants may be of different variant types. Non-limiting examples of variant types include single nucleotide mutations, two or more single nucleotide mutations (e.g., 2, 3, 4 or more single nucleotide mutations), insertions, deletions, translocations, inversion, duplications, or a mutation resulting from a combination thereof. In some embodiments a single nucleotide mutation is a single nucleotide substitution, single nucleotide deletion or single nucleotide insertion. In some embodiments, a single nucleotide mutation is a somatic mutation. A variant may be a genetic variation or mutation having a length of less than 1000 base pairs (bp), less than 500 bp, less than 250 bp or less than 50 bp. In some embodiments, a variant is a genetic variation or mutation having a length in a range of 1 to 50 bp, 1 to 20 bp or 1 to 10 bp. In some embodiments, a variant comprises two or more mutations that are immediately adjacent, and/or that are separated by 1 or more intervening nucleotides. Variants may be identified from sequence data derived from a biological sample(s) of a patient using any suitable method including GNUmap, GATK, SOAPsnp, SAMTools, SNVer, TRELKA, EBcall, MuTect, ADIScan1, ADIScan2 and SomaticSniper e.g., as
32 10940863.511975645.1 described in Cho et al, (2018) Nucleic acids research 46(15):e92-e92. In some embodiments, variant calling is performed according to best practices and described in Koboldt, D.C. (2020) Genome Med 12:91. Variant Data “Variant data” may refer to data (e.g., genetic data) indicating a presence or absence of variants in a biological sample of a patient and may comprise various types of data and/or information about the variants. In particular, variant data may include variant genomic location data, variant type data, variant sequence data, variant sequence context data, variant sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data, variant primer data, and/or any other suitable type of data about the variants. Each of these types of data may include data for some or all of the variants indicated by the variant data. For example, variant type data may specify a variant type for some or all of the variants indicated by the variant data. In some embodiments, variant genomic location data includes, for each variant, data indicative of the location of the variant in a genome (e.g., the location in a standard genome or the genome of the patient). Variant genomic location data may include a chromosomal location of the variant or a locus of the variant. Variant genomic location data may be in any suitable format, as aspects of the technology described herein are not limited in this respect. In some embodiments, variant type data includes, for each variant, data indicating the type of the variant (e.g., a single nucleotide mutation, an insertions, a deletion, a translocation, an inversion, a duplication, or any other type of mutation resulting from a combination thereof). Variant type data may be in any suitable format, as aspects of the technology described herein are not limited in this respect. In some embodiments, variant sequence data includes, for each variant, data indicating a sequence of the variant. For example, a single nucleotide mutation variant sequence may be indicated by a wildtype trinucleotide context (AAA) and a mutant sequence (ATA), where the variant is an A>T mutation. Variant sequence data may be in any suitable format, as aspects of the technology described herein are not limited in this respect. In some embodiments, variant sequence context data includes, for each variant, data indicating the sequence context surrounding the variant (e.g., sequence context conservation and splice sites in the sequence context). Variant sequence context data may include sequences of the polynucleotides comprising the variants (e.g., the sequence contexts associated with each variant and/or the loci associated with each variant). Variant sequence
33 10940863.511975645.1 context data may be in any suitable format, as aspects of the technology described herein are not limited in this respect. In some embodiments, variant sequence depth includes, for each variant, data indicating the number of sequencing reads covering the locus comprising a given variant in sequence data obtained from a sample comprising tumor cells and/or non-tumor cells of the patient. In some embodiments, variant allele frequency includes, for each variant, data indicating the number of times a variant is observed in sequence data (e.g., sequence data of a biological sample comprising tumor cells and/or non-tumor cells) at a given locus divided by the number of times the locus is observed in the sequence data (e.g., the number of time any allele is observed at the locus). In some embodiments, variant sequencing error rate includes, for each variant, an error rate of the sequencing apparatus during generation of the sequence data. In some embodiments, variant primer data includes, for each variant, data about primers that are designed to amplify a polynucleotide comprising that variant. Variant primer data may include primer binding site genomic location; primer sequence; primer length, primer melting temperature (e.g., for a portion of a primer or entire length of a primer); primer propensity for secondary structure; a score from a primer design algorithm (e.g., Primer3 and Primer-BLAST); site for a primer designed to detect the TSV. Generating “Generating” as used in the context of generating a feature may refer to calculating or determining the feature (e.g., determining using variant data), obtaining the feature (e.g., from the variant data, a primer design algorithm or any other suitable source) or obtaining a previously determined and stored value. Sequence Context A “sequence context” may refer to the nucleotides on either side of the TSV in the primary sequence of the genome of the patient within a given range of nucleotides. For example, a sequence context may refer to the nucleotides in the same locus as the TSV. In another example, a sequence context may refer to the nucleotides within 1, 2, 3, 5, 10, 20, 50, 100, 150, 200, 250, 300, 350, 400, 450 or more nucleotides (upstream and/or downstream) of a variant. A sequence context may refer to the nucleotides within 50 nucleotides (upstream and/or downstream) of a variant.
34 10940863.511975645.1 Tumor Specific Variants (TSVs) and Features For Selecting TSVs “Tumor specific variants (TSVs)” may refer to variants present in tumor cells collected from a patient (e.g., as compared to non-tumor cells collected from the same patient). For example, a TSV may be a variant that is present in tumor cells and not present in non-tumor cells. As another example, a TSV may be a variant that is present at a higher allele frequency (e.g., the frequency is 2, 3, 4, 5, 10, etc. times as high) in a biological sample of tumor cells of a patient as compared to a biological sample of non-tumor cells of the patient. In some embodiments, a TSV is a variant that is present at a higher allele frequency (e.g., the frequency is 2, 3, 4, 5, 10, etc. times as high) in tumor cells derived from a patient as compared to non-tumor cells derived from the patient. In another example, a TSV may be a variant that is present in tumor cells and not present in a genomic database of healthy individuals (e.g., gnomAD, 1000 genomes, and ExACpopulations). Therefore, in some embodiments, a TSV may not be polymorphism found within a population of healthy individuals (e.g., a single nucleotide polymorphism (SNP)). In another example, a TSV may be a variant that is present at a higher allele frequency (e.g., the frequency is 2, 3, 4, 5, 10, etc. times as high) in tumor cells of a patient as compared a healthy population allele frequency (e.g., as determined from a genomic database of healthy individuals) (e.g., gnomAD, 1000 genomes, and ExACpopulations)). In some embodiments, TSVs for a patient may be identified by: (1) obtaining variant data; and (2) identifying TSVs from among the variants in the variant data using features. Such features may include one or more of variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence context indel, neighboring variants, static variants, primer flags, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and/or tumor cell variant allele frequency. In some embodiments, the values of the features may be used to select the TSVs. For example, the values may be compared to certain thresholds or otherwise used (e.g., as part of more complex logic, such as rules or even machine learning models) to select the TSVs. Variant Bi-directional Support Variant bi-directional support may refer to the number of times a variant is observed in plus strand sequencing reads and minus strand sequencing reads of the variant data of the tumor cells (e.g., a biological sample of the tumor cells of the patient). Variant bi-directional
35 10940863.511975645.1 support may be calculated by determining the minimum of (1) the number of plus strand reads that cover a variant in the tumor cell sample of the patient and (2) the number of minus strands reads that cover the variant in the tumor cell sample of the patient. For example, if there are 8 plus strand reads covering the variant and 10 minus strand reads covering the variant then variant bi-directional support would be 8. In some embodiments, the variant may be identified as a TSV when variant bi-directional support exceeds a threshold. Variant bi-directional support may be indicative of the detectability of the variant by sequencing, specifically detectability when using both plus strand reads and minus strand reads, which may increase confidence of detection. Additional description of variant bi-directional support can be found herein including with reference to FIG.2A. Healthy Population Variant Allele Frequency “Healthy population variant allele frequency” may refer to the allele frequency of a variant in a healthy population, as defined by at least one genomic database. Healthy population variant allele frequency may be determined by obtaining the allele frequency of a variant in a database of healthy individuals (e.g., gnomAD, 1000 genomes, and ExACpopulations). In other words, healthy population allele frequency may be used to identify a variant(s) that is likely not a TSV because the variant is found in a healthy population above a threshold allele frequency (indicating that the variant is not tumor specific). Additional description of healthy population variant allele frequency can be found herein and with reference to FIG.2A. Sequence Context Homopolymer Size “Sequence context homopolymer size” may refer to the length and/or location of a homopolymer sequence (e.g., is the homopolymer located between a variant and a binding site of a primer designed to detect presence of the variant in the genome of the tumor cells). A homopolymer may refer to a series of consecutive nucleotides in a polynucleotide all of the same type (e.g., AAAAA represents a homopolymer of 5 nucleotides). Sequence context homopolymer size may be used to identify variants that are not TSVs based on location of the homopolymer and the length of the homopolymer exceeding a threshold. Homopolymers above a given length may interfere with amplification and/or sequencing of the polynucleotide comprising the variant and thus affect the detectability of a variant. Additional description of sequence context homopolymer size can be found herein and with reference to FIG.2A.
36 10940863.511975645.1 Sequence Context Indel A “Sequence Context Indel” may refer to the distance (e.g., in nucleotides) between (1) an insertion or deletion (indel) mutation that is located within the sequence context of the variant, and (2) the variant. A sequence context indel may be calculated by determining the distance between the 3’ or 5’ end of the indel and the variant (if the variant is a single nucleotide variant) or the 3’ or 5’ end of the variant (if the variant comprises more than one nucleotide). In some embodiments, the variant may be a TSV when either (1) no indel is located within the sequence context of the variant and/or (2) an indel is not located within a threshold distance of the variant. Additional description of sequence context indel can be found herein and with reference to FIG.2A. Neighboring Variants “Neighboring Variants” may refer to the number of neighboring variants (i.e., other variants that are not the variant currently being potentially identified as TSV) that are within the sequence context of the variant or within a threshold distance (e.g., in nucleotides) of the variant. For example, a sequence context of a variant may comprise two neighboring variants within 50 nucleotides of the variant. Neighboring variants may be calculated by counting the number of variants within the sequence context of the variant or within a specified distance of the variant. In some embodiments, the variant may be a TSV when the number of neighboring variants is less than a threshold. Additional description of neighboring variants can be found herein and with reference to FIG.2A. Static Variants “Static variants” may refer to the number of normal samples (i.e., sequencing data of normal samples) that the variant is observed in. For example, if the variant is observed in sequence data collected from two normal samples, then the static variant would be two. The variant may be a TSV when the number of static variants is less than a threshold. Additional description of static variants can be found herein and with reference to FIG.2A. Primer Flags “Primer Flags” may refer to characteristics of a primer (e.g., a primer of a patient- specific panel) that may indicate the detectability of a variant (e.g., TSV) using the primer. Primer flags may include, but are not limited to:
37 10940863.511975645.1 a homopolymer sequence that exceeds a threshold length found in the primer sequence; a homopolymer sequence that exceeds a threshold length located between the binding site of the primer and the variant (e.g., TSV) (e.g., see Sequence Context Homopolymer Size); “TA” nucleotide repeats that exceed a threshold number of consecutive repeats present in the sequence expected to be amplified by the primer in a PCR reaction (e.g. a PCR reaction comprising a primer and a corresponding primer for use amplifying a polynucleotide comprising the variant); and a percentage of guanine and cytosine nucleotides within a threshold distance of the variant (e.g., a threshold distance of 40 nucleotides upstream and/or downstream of the variant) being greater than a threshold percentage (e.g., 80%). A variant may be a TSV when the number of primer flags for a primer for use in detecting a variant is less than a threshold. Additional description of primer flags can be found herein and with reference to FIG.2A. Sequence Coverage in Non-tumor Cells “Sequence coverage in non-tumor cells” may refer to the number of sequencing reads covering a locus of a variant in the non-tumor cells of the patient (e.g., the number of sequencing reads covering the locus regardless of the allele in the locus). Sequence coverage in non-tumor cells may be calculated by determining the number of sequencing reads that cover a polynucleotide at a locus that has been observed to comprise a variant of the patient (e.g., in non-tumor cells of the patient). For example, 50X coverage refers to 50 sequencing reads covering a locus (e.g., a locus in a tumor cell of a patient observed to comprise a variant). In some embodiments, the variant may be a TSV when sequence coverage in non- tumor cells exceeds a threshold. Sequencing coverage in non-tumor cells may be indicative of the detectability of the variant because low coverage in non-tumor cells may indicate difficulty in amplifying and/or sequencing the variant. Additional description of sequence coverage in non-tumor cells can be found herein and with reference to FIG.2A. Ratio of Variant Allele Frequency between Tumor Cells and Non-tumor Cells The “ratio of variant allele frequency between tumor cells and non-tumor cells” may be calculated by (1) determining the allele frequency of the variant in the tumor cells (e.g., a biological sample comprising tumor cells of the patient), (2) determining the allele frequency
38 10940863.511975645.1 of the variant in non-tumor cells (e.g., a biological sample comprising non-tumor cells of the patient), and (3) calculating a ratio between (1) and (2). In other embodiments, allele frequency may be calculated by dividing the total number of times a specific allele is observed at a locus (e.g., an allele comprising the variant), by the total number of times that locus is observed in the sequence data (e.g., the number observations of the allele comprising the variant plus the number of observations of all the other alleles at that locus). In some embodiments, the variant may be a TSV when the ratio of variant allele frequency between tumor cells and non-tumor cells exceeds a threshold. Additional description of ratio of variant allele frequency between tumor cells and non-tumor cells can be found herein and with reference to FIG.2A. Tumor Cell Variant Allele Frequency Tumor cell variant allele frequency may be calculated based on (1) the total number of times the variant allele is observed in sequence data of the tumor cells (e.g., a biological sample comprising tumor cells of the patient) and (2) the total number of times the locus comprising the variant allele is observed in sequence data of the tumor cells of the patient (e.g., the variant allele observations plus all the other allele observations at that locus). Tumor cell variant allele frequency may be calculated by dividing (1) and (2) above. In some embodiments, the variant may be a TSV when tumor cell variant allele frequency exceeds a threshold. Additional description of tumor cell variant allele frequency can be found herein and with reference to FIG.2A. Subset of the Plurality of Tumor Specific Variants and Features For Selecting the Subset In some embodiments, designing a patient-specific panel comprises (1) obtaining variant data, (2) identifying a plurality of TSVs and (3) identifying a subset of the plurality of TSVs for inclusion in the patient-specific panel. TSVs of the subset of the plurality of TSVs may be identified using features associated with the TSVs. For example, at least one sequence coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. As described herein including with reference to FIG.2A and FIG.3, features corresponding to at least some of the TSVs may be processed using a trained ML model to produce scores that are indicative of the detectability of TSVs, which in turn may be used to select the subset of the plurality of TSVs for use in the patient-specific panel.
39 10940863.511975645.1 Sequencing Coverage Features “Sequencing coverage features” may refer to features that are based on the number of sequencing reads covering a variant and/or a locus of a variant (e.g., raw Illumina® sequencing reads) covering a TSV (e.g., in a biological sample of normal cells or tumor cells of the patient). Sequence coverage features may include, but are not limited to sequencing depth of coverage of plus strands and minus strands for a TSV (i.e., minimum strand coverage), and/or a ratio of depth of coverage between plus strands and minus strands of the variant data for a TSV (i.e., strand bias). Additional description of sequencing coverage features can be found herein and with reference to the section entitled “Generating Sequence Coverage Features”. Allele Frequency Features “Allele frequency features” may refer to features that are based on or derivative of the allele frequency of a variant or TSV in tumor cells of the patient, non-tumor cells of the patient, or a database comprising genome sequences from healthy individuals and/or individuals having a disease (e.g., cancer). Allele frequency may be calculated by dividing the number of times a variant allele is present at a locus in the variant data by the number of times the locus (with any variant) is observed in variant data. Allele frequency features may include, but are not limited to, non-tumor cell depth coverage of a TSV, number of observations of a TSV in tumor cells of the patient (i.e. tumor cell alternate observations), and/or a tumor allele frequency of the TSV. Additional description of allele frequency features can be found herein and in the section “Generating Allele Frequency Features”. Nucleotide Context (NC) Error Rate Feature In some embodiments, methods herein comprise determining a sequencing error rate (e.g., a value representing the rate of an incorrect nucleotide being identified at a position; incorrect nucleotides may be identified at a position due to events that take place during sample collection, preparation, sequencing, post-sequence analysis or any other occasion in which the sample or data is manipulated) by monitoring error rates in nucleotides or groups of nucleotides (i.e. nucleotide context (NC)). In some embodiments generating a set of features (e.g., a first set of features) for a first TSV comprises determining for each TSV of a subset of the plurality of TSVs a nucleotide context error rate. A nucleotide context (NC) refers to a series of sequential nucleic acids with specific bases in a nucleic acid sequence or a sequence read. In some embodiments error rates in single nucleotides (single nucleotide
40 10940863.511975645.1 context) are monitored. In some embodiments error rates in groups of two nucleotides (di- nucleotide context), three nucleotides (trinucleotide context), four nucleotides (four nucleotide context), five nucleotides (five nucleotide context), six nucleotides (six nucleotide context) or more are monitored. In some embodiments, error rates in groups of trinucleotide context are monitored as described herein. In turn, the estimated sequencing error rate may be compared to the actual number of mutations observed in the positions being monitored for mutations to determine an indication of MRD. In some embodiments, this technique involves estimating sequencing error from sequencing results at positions not being monitored for cancer-associated mutations (the collection of such sequence read positions may be termed “background regions” herein). In some embodiments, coverage and/or resolution play a significant role in determining an optimal context size (e.g., NC) for determining error rates, for example where coverage refers to a maximum number of observations for an error rate context, on average, given a depth of sequencing for the sample, and where resolution refers to a total number of error rate contexts of a given size. Sometimes, a larger context size yields more contexts, following the formula (N = 3 * 4^k) where "k" is the context size, for example. More contexts (i.e. higher resolution) often allows for more accurate estimation of an error rate that is driven by the bias of the sequence surrounding a variant. This sometimes comes at a direct and proportional cost of potential coverage (Depth / N). For example, at an example minimum depth of 10,000 reads for a sample, a trinucleotide context has a theoretical potential to detect error rates down to 1/52 (1.9%) on average while still increasing the overall resolution vs. di- or mono-nucleotide contexts. While any suitable NC length can be used for a method herein, the inventors herein have found that a trinucleotide context is often an optimal context size that yields acceptable detectable error rates across many sequencing depths. Although embodiments, examples, claims and drawings herein often refer to tri- nucleotide context (TNC) error rate, it is understood that the methods described in the embodiments, examples, claims and drawings herein can be performed using other suitable nucleotide contexts (e.g., single nucleotide context (SNC), dinucleotide context (DNC), trinucleotide context (TNC), four-nucleotide context, five-nucleotide context, six-nucleotide context, and the like). Trinucleotide Context (TNC) Error Rate Feature
41 10940863.511975645.1 A “TNC error rate feature” or “Error rate in error corrected bins” may refer to the estimated probability of observing a TSV in the middle position of the TNC due to errors introduced during sample preparation and/or sequencing of a biological sample (e.g., the biological sample of the patient). As used herein, “a TNC” may refer to a series of three sequential nucleotides in a sequence read (e.g., AAA, TAT, GTA, etc.). The TNC may comprise a variant (e.g., a TSV) in the middle position of the TNC. Additional description of the trinucleotide context (TNC) error rate feature can be found herein and with reference to the section entitled “Generating Trinucleotide Context (TNC) Error Rate.” C to A Variant Mutation Feature A used herein, “a C to A variant mutation feature” may refer to an indicator (e.g., a binary indicator) for whether the variant is a C to A mutation. For example, a variant mutation from C to A may be indicated with a “1” and any other mutation (e.g., C to T) may be indicated by a “0”. Additional description of the C to A variant mutation feature can be found herein and with reference to the section entitled “Generating a C to A Variant Mutation Feature”. Primer Features “Primer features” may refer to one or more features associated with a set of primers designed to detect (e.g., amplify for detection) a variant or TSV. Primer features may include, but are not limited to, primer genome location; primer sequence; primer melting temperature; primer propensity for secondary structure; a score from a primer design algorithm; a distance (e.g., measured in number of nucleotides) between a TSV and a binding site for a first primer designed to detect the TSV (e.g., the distance between the TSV and the 3’ end or 5’ end of the binding site for a first primer); and distance (e.g., measured in number of nucleotides) between the TSV and binding site for a second primer (e.g., the distance between the TSV and the 3’ end or 5’ end of the binding site for a second primer), different from the first primer, designed to detect the TSV. In some embodiments, one or more primer features are determined for all or a portion of a primer (e.g., a portion of a primer that initially anneals to a target or gene sequence). Additional description of the primer features can be found herein and with reference to the section entitled “Generating Primer Features”. A primer, each primer of a primer pair, or each primer of a set of primers identified by, or used for, a method herein (e.g., a multiplex amplification and/or sequencing reaction) may comprise a suitable length. A suitable length may be determined by a method described
42 10940863.511975645.1 herein. In some embodiments, a portion of a primer configured to initially anneal to a target sequence (e.g., adaptor primer sequence, target-specific sequence, gene-specific sequence) comprises a length in a range of 8 to 60 nucleotides, 10 to 50 nucleotides, 15 to 45 nucleotides, or 18 to 41 nucleotides. In some embodiments, a primer comprises a 5' tail or one or more additional 5' sequences (e.g., barcode, identifier sequences, random sequences, adaptor sequences, common primer sites, sequencing primer sites, and/or the like). In some embodiments, an additional sequence of a primer comprises a length in a range of 1 to 60 nucleotides. In some embodiments, an entire length of a primer, each primer of a primer pair, or each primer of a set of primers identified by, or used for a method herein is in a range of 10 to 150 nucleotides, 20 to 100 nucleotides, or 30 to 75 nucleotides. A primer, each primer of a primer pair, or each primer of a set of primers identified by, or used for, a method herein (e.g., a multiplex amplification and/or sequencing reaction) may comprise a suitable Tm. A suitable Tm may be determined by a method described herein. In some embodiments, a portion of a primer configured to initially anneal to a target sequence (e.g., adaptor primer sequence, target-specific sequence, gene-specific sequence) comprises a Tm in a range of 30 to 85 °C, 60 to 80 °C, or 65 to 75 °C. In some embodiments, a primer comprises a 5' tail or one or more additional 5' sequences (e.g., barcode, identifier sequences, random sequences, adaptor sequences, common primer sites, sequencing primer sites, and/or the like). In some embodiments, an additional sequence of a primer comprises a Tm in a range of 30 to 85 °C, 60 to 80 °C, or 65 to 75 °C. In some embodiments, an entire length of a primer, each primer of a primer pair, or each primer of a set of primers identified by, or used for a method herein is in a range of 30 to 85 °C, 60 to 80 °C, or 65 to 75 °C. In some embodiments, a subset of primers or all primers of a set of primers identified by, or used for, a method herein (e.g., a multiplex amplification and/or sequencing reaction) may comprise the same Tm or similar melting temperatures as determined for the entire length of, or target-specific portions of the primers. For example, a subset of primers or all primers of a set of primers identified by, or used for, a method herein may have an average Tm with a standard deviation of no more than 20°C, 10°C, 5°C, or 2°C. In some embodiments, any one primer of a subset or set of primers identified by, or used for, a method herein comprises a Tm that differs by no more than 10°C, 5°C, or 2°C from any other primer in the subset or set of primers. Sequence Context Features
43 10940863.511975645.1 “Sequence context features” may refer to sequence features that are within the sequence context of a TSV. Sequence context features may include, but are not limited to, a conservation score of the sequence context comprising the TSV, a distance between the TSV and a nearest splice site in the sequence context, and/or a splice site score of the sequence context (e.g., a score indicating that a splice site is located within the sequence context). Sequence context features may indicate an ability to amplify the TSV for detection. For example, a sequence context comprising a splice site indicates that different size amplicons comprising different sequences may be produced using the same set of primers due to alternative splicing. Additional description of the sequence context features can be found herein and with reference to the section entitled “Generating Sequence Context Features”. Tumor-derived Polynucleotide A “Tumor-derived polynucleotide” may refer to a polynucleotide that was or is part of a tumor cell (e.g., a tumor cell of the patient). A tumor-derived polynucleotide may include, but is not limited to tumor RNA, cell-free tumor RNA, circulating cell-free tumor RNA, tumor DNA, cell-free tumor DNA, circulating cell-free tumor DNA, and circulating tumor DNA (ctDNA). A tumor-derived polynucleotide may be present in any tissue and/or fluid of the patient. For example, a tumor-derived polynucleotide may be present in blood and/or blood-derived products of the patient (e.g., serum and plasma). A tumor-derived polynucleotide may also be present in saliva, semen, vaginal secretions, urine, feces, nasal mucus, sweat, ear wax, and spinal fluid. A tumor-derived polynucleotide may be identified based on the presence of the one or more TSVs in the tumor-derived polynucleotide. The presence of a tumor-derived polynucleotide may be indicative of MRD. Circulating Tumor DNA (ctDNA) Circulating tumor DNA may refer to DNA or DNA fragments derived from tumor cells that have escaped the tumor and are present in the circulatory system. For example, ctDNA may be present in blood (e.g., serum and plasma). ctDNA may be identified based on the presence of the one or more TSVs in the ctDNA. The presence of ctDNA may be indicative of MRD. Locus A “locus” may refer to a set of consecutive nucleotides in a genome (e.g., the genome of a patient) within a threshold distance of a TSV (e.g., within 50, 100, 150, 200, 250, 300,
44 10940863.511975645.1 350, 400, 450, 500 or more nucleotides of the TSV). A locus may refer to the nucleotides encoding a gene (e.g., a gene comprising a TSV), however, a locus may also refer to a non- coding locus (e.g., a locus that does not encode a gene). Additional Description Additional detailed disclosures of the various concepts and embodiments related to methods and compositions of designing a patient-specific panel are provided below. FIG.1 is a diagram depicting an illustrative technique 100 for using variant data from tumor cells and non-tumor cells of a patient to design a patient-specific panel for detecting MRD in the patient, according to some embodiments of the technology described herein. Technique 100 involves collecting a tumor sample 102 (e.g., tumor cells) and a non-tumor cell sample 104 (e.g., non-tumor cells) from a patient. After collection, DNA is extracted 106 from the samples (e.g., using any suitable method). Extracted DNA may then be sequenced (e.g., using Illumina® sequencing). Sequencing may be whole genome sequencing or whole exome sequencing. Sequencing produces sequence data, which is used in variant calling to identify variants 111, 112, 114 and 115 found in the tumor cell DNA 118 and/or the non- tumor cell DNA 120. Following variant calling 108, identifying a plurality of tumor specific variants (TSVs) 111, 114 and 115 for the patient-specific panel 109 is performed using any suitable method including the methods described herein. Step 109 may include both identifying the plurality of TSVs (111, 114 and 115) and identifying 110 a subset of the plurality of TSVs, 111 and 115. Following identification of the TSVs for the patient-specific panel 125, the patient-specific panel is designed 122 for use in detecting the identified TSVs (TSV 111 and TSV 115). As described herein, step 122, designing a patient specific panel, may comprise designing a pair of primers for each of TSV 111 and TSV 115, each pair of primers being capable of amplifying a polynucleotide comprising the TSV. Optionally, after patient-specific panel design 122, the patient specific panel 125 is contacted with a plasma sample 126 (e.g., DNA extracted from the plasma that may comprise ctDNA). Contacting the patient-specific panel 125 and the plasma 126 (e.g., the DNA of the plasma) may further comprise performing a PCR reaction and sequencing the amplicons from the PCR reaction. In some embodiments, following contacting, an indication of MRD 124 in the patient may be provided. Illustrative technique 100 involves obtaining a tumor sample 102 (e.g., tumor cells) and a non-tumor sample 104 (e.g., non-tumor cells) from a patient. The tumor sample 102 may be collected from any tumor and/or cancer including, but not limited to a lung cancer,
45 10940863.511975645.1 brain cancer, liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, or colon cancer. In some embodiments, the tumor sample 102 is collected from a solid tumor. In some embodiments, the tumor sample 102 is collected from a melanoma tumor. In some embodiments, the tumor sample 102 is collected from a lung tumor. In some embodiments, the non-tumor sample 104 is collected from healthy tissue of the same type as the tumor sample 102. For example, if the tumor sample 102 is collected from a liver tumor then the non-tumor sample 104 is healthy liver tissue. In some embodiments, the non-tumor sample 104 is collected from healthy tissue that is different from the type of tissue that tumor sample 102 is collected from. For example, if the tumor sample 102 collected is from a liver tumor then the non-tumor sample 104 is healthy lung tissue. In some embodiments the non-tumor sample 104 is a blood sample (e.g., plasma). Illustrative technique 100 next involves DNA extraction 106 from the tumor sample 102 and non-tumor sample 104. Methods for extracting DNA from biological samples (e.g., the tumor sample and the non-tumor sample) are well known in the art. Following DNA extraction, DNA may be sequenced using any suitable sequencing technique to produce sequence data that may be used in variant calling 108. Variant calling 108 may refer to identifying DNA variants, from sequence data of the tumor sample 102 and the non-tumor sample 104, that differ from a standard genome (e.g., GRCh38 or hg19). Thus, variant calling may be expected to identify tumor cell variants (e.g., 111, 114, and 115), non-tumor cell variants (e.g., variants that appear after patient conception and are not part of the germline and are also not tumor-specific, such as 112), and germline variants that are not in the standard genome. Any suitable method can be used for variant calling 108 (e.g., as described herein). Variant calling 108 may be performed separately on sequence data from tumor samples (e.g., tumor sample 102) and sequence data from non- tumor samples (e.g., non-tumor sample 104). Obtaining variant data is further described herein including with reference to FIG.2A. After variant calling 108, step 109, identification of tumor specific variants for the patient-specific panel 125, may be performed. In step 109, identification of tumor specific variants for the patient-specific panel may refer to identifying variants that are specific to the tumor sample 102 over the non-tumor sample 104. Step 109 may involve (1) identifying a plurality of TSVs based on features (e.g., as described herein) and corresponding thresholds (e.g., as described herein) and (2) identifying a subset of the plurality of TSVs for use in the patient-specific panel (e.g., as described herein). Identifying tumor specific variants may involve identifying variants that
46 10940863.511975645.1 are found in the sequence data of tumor sample 102, but not at the same locus in the sequence data of the non-tumor sample 104 (e.g., 111114 and 115). Selecting a subset of the plurality of TSVs may involve using a trained machine learning model (not shown) to score TSVs based on one or more features described herein, and selecting TSVs for the subset of the plurality of TSVs based on the scores. Identifying a plurality of TSVs and identifying the subset of the plurality of TSVs are further described herein, and in reference to the sections “Tumor Specific Variants” and “Subset of the Plurality of Tumor Specific Variants” and in reference to FIG.2A. After step 109, identification of tumor specific variants for the patient- specific panel, patient specific panel design 122 may be performed. In patient specific panel design 122, a patient-specific panel 125 may be designed to detect one or more TSVs of a subset of the plurality of TSVs (e.g., as described herein). In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect the subset of the plurality of TSVs, as described herein. In other embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect at least 1 (e.g., at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 110, at least 120, at least 130, at least 140, at least 150, at least 160, at least 170, at least 180, at least 190, or at least 200) TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect at least 10 TSVs, at least 25 TSVs, at least 50 TSVs, at least 75 TSVs, at least 100 TSVs, at least 125 TSVs, at least 150 TSVs, at least 175 TSVs, at least 200 TSVs, at least 250 TSVs, or at least 300 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 10-200 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 25-200 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 50-200 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 75-200 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 100-200 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 10-150 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 25-150 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 50-150 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 75-150 TSVs. In some
47 10940863.511975645.1 embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 100-150 TSVs. In some embodiments, in patient specific panel design 122, the patient-specific panel 125 may be designed to detect 18-200 TSVs. The patient-specific panel 125 may comprise a set of primers (e.g., a pair of primers or nested primers) that are designed to amplify a region of a polynucleotide (e.g., ctDNA of the patient) comprising a TSV (e.g., of the subset of the plurality of TSVs). As such, the patient specific panel may comprise a plurality of sets of primers, wherein at least some of the plurality of sets of primers may be designed to detect different TSVs of the subset of the plurality of TSVs. The patient specific panel may comprise a plurality of sets of primers, wherein at least some of the plurality of sets of primers are each designed to detect a region of a polynucleotide comprising a TSV of the subset of the plurality of TSVs. In some embodiments, the patient specific panel 125 (e.g., the primers of the patient-specific panel 125) may be contacted with ctDNA (e.g., plasma ctDNA 126) as one step in making an MRD call (e.g., determining an indication of MRD). In some embodiments, a patient-specific panel may comprise a plurality of sets of primers, wherein at least 2, at least 10, at least 20, at least 50 or at least 100 of the plurality of sets of primers are each designed to detect a region of a polynucleotide comprising a TSV. In some embodiments, a patient-specific panel comprises one or more primers, or a plurality of sets of primers that target non-tumor specific loci. For example, a patient-specific panel may comprise one or more, or a plurality of sets of primers that are not designed to detect a region of a polynucleotide comprising a TSV. In some embodiments, a patient-specific panel may comprise a plurality of primers designed to detect a region of a polynucleotide comprising a TSV, and a plurality of primers targeting non-tumor specific loci. Non-tumor specific loci may comprise specific nucleotide variants (e.g., SNPs) or random control loci found in both tumor and healthy tissue of a patient or population. Primers targeting non-tumor specific loci may be included in a patient-specific panel for any suitable purpose (e.g., for sample tracking, control or normalization). In some embodiments, primers targeting non-tumor specific loci are included in a patient-specific panel to normalize the amount of total primers used in any one method or assay. For example, the inventors herein have determined that normalizing the total number of primers used in a multiplex assay described herein can sometimes help correct for any amplification bias associated with differences in the number of TSV specific primers used among patients or assays. In some embodiments, primers targeting non-tumor specific loci comprise the same or similar features as other primers in a panel designed to detect a region of a polynucleotide comprising a TSV.
48 10940863.511975645.1 In some embodiments, the plasma 126, may be collected from the blood of the same patient from which tumor sample 102 and non-tumor sample 104 were collected. In other embodiments, a biological sample instead of plasma 126 may be collected from any location of the patient that may comprise ctDNA. For example, a biological sample may be collected from saliva, semen, vaginal secretions, urine, feces, nasal mucus, sweat, ear wax, spinal fluid, blood, serum, or plasma from a patient. In some embodiments, the biological sample may not comprise detectable ctDNA (e.g., when the patient does not have detectable MRD). In some embodiments, multiple biological samples (e.g., plasma 126) may be collected from a patient and sequenced to obtain sequence data. The multiple biological samples may be sequentially collected from a patient over a specified period of time then sequenced to obtain sequence data. The specified period of time may begin after cancer treatment ends and may continue for the remainder of the patient’s life. The frequency with which biological samples are collected from a patient may be any suitable frequency for monitoring MRD. In some embodiments, biological samples may be collected from a patient weekly. In some embodiments, biological samples may be collected from a patient about twice a month. In some embodiments, biological samples may be collected from a patient about once a month. In some embodiments, biological samples may be collected from a patient about once every three months. In some embodiments, biological samples may be collected from a patient about once every six months. In some embodiments, biological samples may be collected from a patient at least twice a month. In some embodiments, biological samples may be collected from a patient at least once a month. In some embodiments, biological samples may be collected from a patient at least once every three months. In some embodiments, biological samples may be collected from a patient at least once every six months. In some embodiments, the frequency with which biological samples may be collected from the patient may be based on the type of disease the patient is being monitored for (e.g., type of cancer), the expected likelihood of recurrence, and the rate of disease progression after recurrence. Technique 100 next optionally proceeds to an MRD call 124 (i.e., determining an indication of MRD in the patient). Determining an indication of MRD may comprise contacting the plasma ctDNA 126 with the patient specific-panel 125. Determining an indication of MRD may further comprise amplifying ctDNA using primers of the patient specific panels to produce amplicons. Determining an indication of MRD may further comprise sequencing one or more amplicons produced by amplifying the ctDNA to generate sequence data. Determining an indication of MRD may further comprise analyzing sequence
49 10940863.511975645.1 data to determine an indication of MRD using any suitable method including as described herein including in the sections entitled “Minimal Residual Disease” and “Methods of Determining and Indication of minimal Residual Disease”. FIG.2A is a flowchart of an illustrative process 200 for identifying a subset of a plurality of tumor specific variants (TSVs) for use in a patient-specific panel for identifying MRD, and optionally identifying and/or synthesizing primers for inclusion in the patient- specific panel, according to some embodiments of the technology described herein. Process 200 involves obtaining variant data 202 (e.g., variants called using a tumor cell sample and a non-tumor cell sample of the patient); identifying a plurality of TSVs 204 using the variant data 202, optionally identifying primers 206 (e.g., primer sequences) for use in detecting the TSVs; identifying a subset of the plurality of TSVs for use in the patient specific panel 208 (e.g., TSVs that are more likely to be detectable using the patient-specific panel, and/or provide an indication of MRD); and optionally synthesizing primers for detecting the subset of the plurality of TSVs 210 (e.g., the synthesized primers for using the patient-specific panel). Process 200 begins at act 202 where variant data is obtained. Obtaining variant data may refer to obtaining DNA variants associated with the tumor cells (e.g., biological sample of tumor cells) and/or non-tumor cells (e.g., biological sample of non-tumor cells) of the patient. Obtaining variant data may comprise obtaining variant data using a variant caller as described herein. Obtaining variant data may comprise obtaining sequence data of the tumor sample and the non-tumor sample and using a variant caller to identify variants. In some embodiments, obtaining variant data comprises generating sequence data of the tumor sample and the non-tumor sample. In other embodiments, obtaining variant data comprises obtaining the variants and additional data indicative of the variants (e.g., variant genomic location data, variant type data, variant sequence data, variant sequence context data, variant sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data, and/or variant primer data) as described herein including in the section “Variant Data.” In some embodiments, obtaining variant genomic location data comprises obtaining the location in the genome where a variant is located (e.g., the genomic locus). In some embodiments, obtaining variant type data comprises obtaining the type of mutation that generated a variant (e.g., single nucleotide change (e.g., C to A, A to G, etc.), insertion or deletion). In some embodiments, obtaining variant sequence data comprises obtaining the sequence of a variant.
50 10940863.511975645.1 In some embodiments, obtaining variant sequence context data comprises obtaining data describing the polynucleotide sequence surrounding a variant (e.g., within 10, 50, 100, 150, 200, 250, 300, 350, 400, 450 or more nucleotides of a variant). In some embodiments, obtaining variant sequence context data comprises obtaining sequence context homopolymer data (e.g., the location of a homopolymer relative to the variant and the homopolymer length), sequence context splice site data (e.g., the location and type of any predicted splice sites in the sequence context), sequence context mutation data (e.g., variants identified in the sequence context), and/or sequence context conservation data (e.g., a score describing the degree of conservation of the sequence context (e.g., a conservation score generated by PhyloP or phastCons). In some embodiments, obtaining variant sequencing coverage data comprises obtaining the sequencing coverage (e.g., Illumina® sequencing coverage) of a variant in tumor cells (e.g., a biological sample of tumor cells) and non-tumor cell (e.g., a biological sample of tumor cells) of a patient. In some embodiments, obtaining variant allele frequency data comprises obtaining the frequency of a variant in the tumor sample and/or the frequency in the non-tumor sample. In some embodiments, obtaining variant allele frequency data further comprises obtaining the allele frequency of the variant in healthy individuals of a genomic database (e.g., gnomAD, 1000 genomes, and ExACpopulations). Obtaining variant sequencing error rate data may comprise obtaining data concerning the sequencing error rate associated with preparing a sample for sequencing and sequencing the sample as part of obtaining variant data. In some embodiments, obtaining variant primer data comprises obtaining primer sequences designed to amplify tumor cell variants or tumor specific variants. Obtaining variant primer data may further comprise obtaining primer length, binding location, melting temperature, distance from TSV data and/or primer score from a primer design algorithm. As shown in FIG.2A, process 200 may include identifying a plurality of tumor specific variants 204 using the data obtained in step 202. In some embodiments, identifying the plurality of TSVs can be performed using any suitable method (e.g., GATK, FreeBayes, DeepVariant, SpeedSeq.). In some embodiments, identifying TSVs is based on a plurality of TSV features and corresponding thresholds. In some embodiments, identifying a plurality of tumor specific variants comprises using at least one (e.g., at least 2, at least 3, at least 4, at least 5) feature(s) selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-
51 10940863.511975645.1 tumor cells, and tumor cell variant allele frequency, as described herein. In some embodiments, identifying a plurality of tumor specific variants comprises using features selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence context indel, neighboring variants, static variants, primer flags, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and/or tumor cell variant allele frequency. In some embodiments, identifying a plurality of TSVs 204 may comprise selecting variants using variant bi-directional support, the selecting determining, for each variant of at least some of a plurality of variants, whether the variant is observed at least a threshold number of times in plus strand sequencing reads and minus strand sequencing reads of the variant data. In some embodiments, the threshold number of times is 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 times in plus strand sequencing reads and minus strand sequencing reads of the variant data. In some embodiments, the threshold number of times is 2, 8, or 15 times in plus strand sequencing reads and minus strand sequencing reads of the variant data. In some embodiments, the threshold number of times is between 2 and 15 in plus strand sequencing reads and minus strand sequencing reads of the variant data. In some embodiments, a variant may not be selected as a TSV when variant bi-directional support exceeds a threshold number. In some embodiments, identifying the plurality of TSVs 204 may comprise selecting variants using healthy population variant allele frequency. The selecting may determine, for each variant of at least some of a plurality of variants, whether the variant has a variant allele frequency of less than a threshold percentage in a healthy population, as defined by at least one genomic database (e.g., gnomAD, 1000 genomes, and/or ExACpopulations). The threshold percentage may be 0.1%, 0.5%, 1%, 1.5%, 2%, or 3% variant allele frequency in a healthy population, as defined by at least one genomic database. The threshold percentage may also be between 0.1%-3%, 0.5%-2%, or 0.75%-1.5% variant allele frequency in a healthy population, as defined by at least one genomic database. The threshold percentage may be between 0.5% and 2% variant allele frequency in a healthy population, as defined by at least one genomic database. In some embodiments, the threshold percentage is 1% variant allele frequency in a healthy population, as defined by at least one genomic database. In some embodiments, a variant may not be selected as a TSV when the healthy population variant allele frequency exceeds the threshold percentage.
52 10940863.511975645.1 In some embodiments, identifying the plurality of TSVs 204 may comprise selecting variants using sequence context homopolymer size, the selecting determining, for each variant of at least some of a plurality of variants, whether a homopolymer sequence exceeding a threshold size is present between the variant and a binding site of a primer designed to detect the presence of the variant (e.g., in the genome of the tumor cells of the patient). A homopolymer refers to a series of consecutive nucleotides in a polynucleotide all of the same type (e.g., AAAAA represents a homopolymer of 5 nucleotides). The threshold size may be a homopolymer of 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 or more nucleotides present between the variant and a binding site of a primer designed to detect the presence of the variant. The threshold size may be a homopolymer between 4 nucleotides and 8 nucleotides in length present between the variant and a binding site of a primer designed to detect presence of the variant. In some embodiments, the threshold size is a homopolymer of 6 nucleotides present between the variant and a binding site of a primer designed to detect presence of the variant. In some embodiments, a variant may not be selected as a TSV when sequence context homopolymer size exceeds the threshold size. In some embodiments, identifying the plurality of TSVs 204 may comprise selecting variants using sequence context indel, the selecting determining for each variant of at least some of a plurality of variants, whether (1) an indel is located with the sequence context of the variant (2) the indel is located between a primer designed to detect the variant and the variant and/or (2) an indel is located within a threshold distance (e.g., nucleotides) from the variant. The threshold distance may be 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more nucleotides from the variant. The threshold distance may be 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39 or 40 nucleotides from the variant. The threshold distance may be between 10-100, 10-75, 10-50, 15-75, or 15-50 nucleotides from the variant. The threshold distance may be 25 nucleotides from the variant. In some embodiments, a variant may not be selected as a TSV when the sequence context indel meets one or more of criteria 1-3. In some embodiments, identifying the plurality of TSVs 204 may comprise selecting variants using neighboring variants, the selecting determining for each variant of at least some of a plurality of variants, whether a threshold number of other variants are located within the sequence context of the variant. The threshold number of other variants may be 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 variants. The threshold number of variants may be 1-5 or 1-10 variants. The threshold number of variants may be 2 variants. The sequence context may comprise nucleotides within 25, 50, 75 or 100 nucleotides of the variant (e.g., 25, 50, 75 or
53 10940863.511975645.1 100 nucleotides upstream and/or downstream of the variant). The sequence context may be nucleotides within 50 nucleotides of the variant. In some embodiments, a variant may not be selected as a TSV when neighboring variants exceed the threshold number of other variants. In some embodiments, identifying the plurality of TSVs 204 may comprise selecting variants using static variants, the selecting determining for each variant of at least some of a plurality of variants, whether the variant is observed in a threshold number of normal samples (e.g., sequencing data of normal samples). The threshold number of normal sample observations may be 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 observations. The threshold number of normal sample observations may be 2 observations. In some embodiments, a variant may be a TSV when the number of static variants is less than a threshold number of normal sample observations. In some embodiments, identifying the plurality of TSVs 204 may comprise selecting variants using primer flags, the selecting determining for each variant of at least some of a plurality of variants, whether primer associated with the variant is identified as having more than a threshold number of primer flags. Primer flags may include, but are not limited to: (1) a homopolymer sequence that exceeds a threshold length found in the primer sequence. The threshold length of the homopolymer sequence may be 3, 4, 5, 6, 7, 8, 9, or 10 nucleotides in length. The threshold length of the homopolymer sequence may be 3, 4 or 5 nucleotides in length. The threshold length of the homopolymer sequence may be 4 nucleotides in length; (2) a homopolymer sequence that exceeds a threshold length located between the binding site of the primer and the variant (e.g., TSV) (e.g., see Sequence Context Homopolymer Size, as described herein). The threshold length of the homopolymer sequence may be 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, or 14 nucleotides. The threshold length of the homopolymer sequence may be 5, 6, or 7 nucleotides. The threshold length of the homopolymer sequence may be 6 nucleotides. (3) “TA” nucleotide repeats that exceed a threshold number of consecutive repeats present between a primer binding sequence (which may include the primer binding sequence) and a corresponding variant the primer is designed to detect. The threshold number of “TA” nucleotide repeats may be 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 consecutive “TA” repeats. The threshold number of “TA” nucleotide repeats may be 6, 7, or 8 consecutive “TA” repeats. The threshold number of “TA” nucleotide repeats may be 7 consecutive “TA” repeats.
54 10940863.511975645.1 (4) a percentage of guanine and cytosine nucleotides within a threshold distance of the variant (e.g., upstream and/or downstream) that exceeds a threshold percentage. The threshold distance may be between 20-60, 30-50, or 35-45 nucleotides. The threshold distance may be 30, 35, 40, 45, or 50 nucleotides. The threshold distance may be at least 40 nucleotides. The threshold distance may be 40 nucleotides. Within the threshold hold distance of the variant, the threshold percentage of guanine and cytosine nucleotides may be 70%, 75%, 80%, 85%, 90%, or 95%. Within the threshold hold distance of the variant, the threshold percentage of guanine and cytosine nucleotides may be at least 80%. Within the threshold hold distance of the variant, the threshold percentage of guanine and cytosine nucleotides may be 80%. In some embodiments, a variant may not be selected as a TSV when a primer designed for use in detecting the TSV has 1 primer flag. In some embodiments, a variant may not be selected as a TSV when a primer designed for use in detecting the TSV has 2 primer flags. In some embodiments, a variant may not be selected as a TSV when a primer designed for use in detecting the TSV has 3 primer flags. In some embodiments, a variant may not be selected as a TSV when a primer designed for use in detecting the TSV has 4 primer flags. Identifying the plurality of TSVs 204 may comprise selecting variants using sequence coverage in non-tumor cells, the selecting determining, for each variant of at least some of a plurality of variants, whether sequencing coverage of the variant in the non-tumor cells of the patient exceeds a threshold. In some embodiments, the threshold is between 10X and 150X sequencing coverage of the variant in the non-tumor cells of the patient. In some embodiments, the threshold is between 50X and 100X sequencing coverage of the variant in the non-tumor cells of the patient. In some embodiments, the threshold is between 45X and 100X sequencing coverage of the variant in the non-tumor cells of the patient. In some embodiments, the threshold is 45X, 50X, 75X or 100X sequencing coverage of the variant in the non-tumor cells of the patient. In some embodiments, the threshold is 20X, 30X, 40X, 50X, 60X, 70X, 80X, 90X or 100X. In some embodiments, a variant may not be selected as a TSV when sequencing coverage of the variant does not exceed a threshold coverage. Identifying the plurality of TSVs 204 may comprise selecting variants using a ratio of variant allele frequency between tumor cells and non-tumor cells, the selecting determining, for each variant of at least some of a plurality of variants, whether the ratio of the variant exceeds a threshold ratio. In some embodiments, the threshold ratio may be between a ratio of 100:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, the threshold ratio is between a ratio of 30:1 and 10:1 of
55 10940863.511975645.1 tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, the threshold ratio is between a ratio of 40:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, the threshold ratio is between a ratio of 50:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, the threshold ratio is between a ratio of 75:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, the threshold ratio is 5:1, 10:1, 20:1, 30:1, 40:1, 50:1, 60:1, 70:1, 80:1, 90:1, or 100:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. In some embodiments, a variant may not be selected as a TSV when the ratio of variant allele frequency of the variant does not exceed a threshold ratio. Identifying the plurality of TSVs 204 may comprise selecting variants using the tumor cell variant allele frequency, the selecting determining, for each variant of the plurality of variants, whether the tumor cell variant allele frequency exceeds a threshold. In some embodiments, the threshold is between a 0.05 and a 0.1 tumor cell variant allele frequency. In some embodiments, the threshold is between a 0.025 and a 0.2 tumor cell variant allele frequency. In some embodiments, the threshold is between a 0.025 and a 0.5 tumor cell variant allele frequency. In some embodiments, the threshold is 0.025, 0.5, 0.75, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 tumor cell variant allele frequency. In some embodiments, a variant may not be selected as a TSV when tumor cell variant allele frequency of the variant does not exceed a threshold allele frequency. In some embodiments, identifying the plurality of TSVs 204 comprises assigning variants to tiers using bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non-tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and/or tumor cell variant allele frequency according to the thresholds in Table 1.
Figure imgf000058_0001
56 10940863.511975645.1
Figure imgf000059_0001
Table 1: Tiers for selecting TSVs. * C>T and G>A only In some embodiments, identifying the plurality of TSVs 204 comprises assigning variants to tiers using the thresholds of Table 1, wherein a variant that meets all the thresholds of tier 1 is assigned to tier 1; a variant that does not meet the thresholds of tier 1, but meets all the thresholds of tier 2 is assigned to tier 2; a variant that does not meet the thresholds of tiers 1 or 2, but meets all the thresholds of tier 3 is assigned to tier 3; a variant that does not meet the thresholds of tiers 1, 2, or 3 but meets all the thresholds of tier 4 is assigned to tier 4; a variant that does not meet the thresholds of tiers 1, 2, 3 or 4 but meets all the thresholds of tier 5 is assigned to tier 5; and a variant that does not meet any of the thresholds of tiers 1, 2, 3, 4 or 5 is not assigned to any tier. In this embodiment, after assigning variants to a tier, a plurality of variants may be selected as the plurality of TSVs according to the tiers. For example, tier 1 variants may be selected as TSVs first, followed by consecutive selection in tiers 2, 3, 4, and 5 until the total number of TSVs of the plurality of TSVs is obtained. For instance, if the plurality of TSVs has 15 TSVs and tier 1 has 12 variants, tier 2 has 6 variants and tier 3 has 3 variants, then 12 TSVs may be selected from tier 1 and 3 TSVs may be selected from tier 2. The number of TSVs being selected may be a number of (e.g., 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5) times the number of TSVs needed for the patient-specific panels. For example, the TSVs being selected may be 2 times the number of TSVs needed for the patient-specific panel. In some embodiments, the number of TSVs being selected may be between 50 and 200 variants. The number of TSVs being selected may be at least 50 (e.g., at least 75, at least 100, at least 150, or at least 200) variants. In some embodiments, identifying the plurality of TSVs comprises identifying variants from tier 1 or tier 2, but not tier 3, tier 4 or tier 5. As shown in FIG.2A, process 200 optionally comprises identifying primers for use in detecting TSVs 206. Identifying primers may comprise identifying primer sequences using any suitable method (e.g., Primer3 and Primer-BLAST) including the methods described herein. For example, identifying primers may comprise identifying primers using one or
57 10940863.511975645.1 more of the following criteria: primer length, binding location, melting temperature (Tm), distance from TSV, and primer score from a primer design algorithm. Primers may be completely or partially complementary to a target polynucleotide (e.g., a target polynucleotide comprising a TSV). In some embodiments, a primer comprises a portion complementary to a target polynucleotide sequence, and in some embodiments a primer comprises a 5' tail that is not complementary to a target polynucleotide sequence. In some embodiments, a primer or primer pair is identified according to a length or Tm of all or a portion of a primer sequence. In some embodiments, a primer or primer pair is identified according to a length or Tm of a portion of a primer that is complementary to a target sequence. Identifying primers may comprise identifying primers for use in polymerase chain reaction (PCR). Identifying primers may comprise identifying primers for use in nested PCR or hemi-nested PCR. Identifying primers may comprise identifying primers (e.g., a first and/or a second primer) for use in quantitative polymerase chain reaction (qPCR). If the primers are used in PCR or qPCR, the first and/or second primer may be designed to amplify a region of a polynucleotide comprising a variant (e.g., a TSV). The region may be of a suitable size for sequencing (e.g., Illumina® sequencing) or qPCR detection. In some embodiments, a first and second primer are selected for amplification of one strand of a locus comprising a TSV, where the second primer is nested relative to the first primer, and a third and fourth primer are selected for amplification of an opposite strand of a locus comprising a TSV of interest, where the fourth primer is nested relative to the third primer. In some embodiments, identifying primers comprises identifying primers for use in Anchored Multiplex PCR (AMP). In AMP, a first primer (e.g., a first target-specific primer, e.g., a first primer targeting a specific TSV of interest) may be paired with an anchored primer specific for a ligated adapter sequence which produces a first amplicon derived from one strand of a locus comprising the TSV of interest. Optionally, a second PCR reaction can be conducted using a second target-specific primer (often nested relative to the first primer) that is paired with the same anchored primer, or a nested anchored primer, to produce a second amplicon comprising the TSV of interest. Accordingly, in some embodiments, 1 or 2 primers are designed or identified to amplify a single strand of a nucleic acid comprising a TSV of interest using an AMP method. In such embodiments, both the first and an optional second primer are configured to anneal to the same template strand, and 3’ of a TSV of interest. Exemplary AMP based methods are disclosed in US Patent No.9487828, which is incorporated herein by reference. In certain embodiments, an AMP based method is used to
58 10940863.511975645.1 amplify a complementary strand comprising the TSV of interest. In such embodiments, one or two additional primers are designed or identified for target-specific amplification of a complementary strand comprising the TSV of interest. In some embodiments, AMP comprises ligating a molecular barcode adaptor (MBC) comprising a universal primer binding site to fragments of target DNA or RNA (e.g., ctDNA), amplifying the ligated fragments of ctDNA with a universal primer and a first gene specific primer (e.g., a primer that is designed to amplify a polynucleotide comprising a TSV) in a first PCR reaction; and amplifying the products of the first PCR reaction with the universal primer, a second gene specific primer and a P7 primer, which binds to at least some of the binding site of the second gene specific primer. In some embodiments, AMP comprises ligating a molecular barcode adaptor (MBC) comprising a universal primer binding site to fragments of target DNA or RNA (e.g., ctDNA), amplifying the ligated fragments of ctDNA with a universal primer and a gene specific primer (e.g., a primer that is designed to amplify a polynucleotide comprising a TSV) in a first PCR reaction; and amplifying the products of the first PCR reaction with the universal primer and a P7 primer in a second PCR reaction, the P7 primer binding to at least some of the binding site of the gene specific primer. In some embodiments, AMP or an amplification method comprises ligating an adaptor comprising a universal primer binding site to nucleotides fragments of a sample (e.g., ctDNA) and amplifying the ligated fragments with (i) a universal primer configured to bind to a complement of the adaptor sequence, and (i) one or more target-specific primers (e.g., TSV specific primers, e.g., one or more primers of a patient-specific panel) where each of the target-specific primers are configured to amplify a polynucleotide comprising a TSV when used with the universal primer. Amplicons derived from an amplification reaction using a universal primer and one or more target specific primers may be used to obtain sequencing data. In some embodiments, a target-specific primer comprises a 5’-tail. A 5’-tail may comprise a universal priming site, a molecular barcode and/or index sequence, and/or a sequencing primer site (e.g., a P7 primer site). In some embodiments, amplicons derived from an amplification reaction using a universal primer and one or more 5’-tailed target- specific primers are further amplified using universal primers to provide amplicons used to obtain sequencing data. In some embodiments, AMP comprises ligating a molecular barcode adaptor (MBC) comprising a universal primer binding site to fragments of target DNA or RNA (e.g., ctDNA), amplifying the ligated fragments of ctDNA with a universal primer and a gene
59 10940863.511975645.1 specific primer (target-specific primer) and a P7 primer in a PCR reaction, the P7 primer binding to at least some of the binding site of the gene specific primer. In some embodiments, AMP produces amplified DNA that comprises adaptors for sequencing (e.g., Illumina® sequencing). Obtaining primer sequence may occur at one of multiple different steps in process 200. For example, obtaining variant data 202 may comprise obtaining primer sequences. In some embodiments, identifying primers for use in detecting TSVs may occur after obtaining variant data 202 and before identifying a plurality of TSVs 204. In some embodiments, identifying primers for use in detecting TSVs may occur after identifying a plurality of TSVs and before identifying a subset of the plurality of TSVs for use in the patient-specific panel. In some embodiments, identifying primers for use in detecting TSVs may occur after identifying a subset of the plurality of TSVs for use in the patient-specific panel. Process 200 next includes identifying a subset of the plurality of TSVs for use in the patient-specific panel 208. Identifying a subset of the plurality of TSVs for use in the patient- specific panel 208 is described herein including with reference to the section entitled “Subset of the plurality of Tumor Specific Variants”, and with reference to FIG.2B and FIG.3. Process 200 optionally includes synthesizing the primers 210 associated with the subset of the plurality of TSVs. Synthesizing the primers may comprise synthesizing at least some of the primer for detecting the subset of the plurality of TSVs. In some embodiments, synthesizing the primers comprises synthesizing the primers for detecting for each TSV of the subset of the plurality of TSVs. Primer synthesis may be performed using any suitable method. FIG.2B is a flowchart of an illustrative process 250 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using a trained machine learning model, in accordance with some embodiments of the technology described herein. Process 250 involves generating respective sets of features 252 corresponding to the at least some TSVs of the plurality of TSVs (generated in step 204 of process 200); processing the respective sets of features using a trained machine learning (ML) model to obtain a score for the at least some of the TSVs of the plurality of TSVs 254; and selecting TSVs for inclusion into the subset of the plurality of TSVs using the scores obtained with the trained machine learning model 258 according to process 300 (FIG.3). Process 250 includes generating, for each of at least some of the plurality of TSVs, a respective set of features to obtain sets of features 252. In some embodiments, generating the respective set of features 252 comprises generating at least one sequencing coverage feature,
60 10940863.511975645.1 at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. Generating sequencing coverage features may refer to generating features that are based on sequence data (e.g., Illumina® sequencing reads) covering a TSV. Generating sequence coverage features may include, but are not limited to generating sequencing depth of coverage of plus strands and minus strands for a TSV (i.e., minimum strand coverage), and/or a ratio of depth of coverage between plus strands and minus strands of the variant data for a TSV (i.e., strand bias). Minimum strand coverage may be generated based on the minimum sequencing depth of coverage of plus strands and minus strands of a TSV of the subset of the plurality of TSVs in the tumor cell sample (e.g., tumor cells) of the patient. For example, minimum strand coverage may be determined by calculating the minimum sequencing depth of coverage of plus strands and minus strands of a TSV of the subset of the plurality of TSVs in the tumor sample. For example, if the number of plus strands covering the TSV is 10 and the number of minus strands covering the TSV is 11 then the minimum strand coverage would be 10. Strand bias may be generated based on the relative number of sequencing reads of the plus strand and the minus strand that cover the locus comprising a TSV (e.g., reads of the locus without and without the TSV in the tumor cells or normal cells of the patient). For example, strand bias may be determined by dividing plus strand read depth of the locus comprising the variant and minus strand read depth of the locus comprising the TSV in tumor cells of the patient (e.g., a biological sample of tumor cells). In some embodiments, strand bias may be determined by calculating the log2 ratio between plus strand read depth and minus strand read depth. In some embodiments, strand bias may be determined by calculating the absolute value of the log2 of the ratio between plus strand read depth and minus strand read depth. In some embodiments, strand bias may be determined by calculating the absolute value of the log2 of the ratio between plus strand read depth + 1 and minus strand read depth + 1. Generating allele frequency features may refer to generating features that are based on the allele frequency of a variant or TSV in tumor cells of a patient, non-tumor cells of a patient, or a database comprising genomes sequences from healthy individuals and/or individuals having a disease (e.g., cancer). Generating allele frequency features may include, but is not limited to generating non-tumor cell depth coverage of a TSV, generating a number
61 10940863.511975645.1 of observations of a TSV in tumor cells of the patient (i.e., tumor cell alternate observations), and/or generating a tumor allele frequency of the TSV. Non-tumor cell depth coverage of a TSV may be determined based on the number of sequencing reads covering the locus comprising the TSV (e.g., including alleles with the TSVs and alleles without the TSV) in non-tumor cells (e.g., biological sample of the non- tumor cells) of the patient. In other embodiments, non-tumor cell depth coverage of a TSV may be the number of sequencing reads covering the locus containing the TSV of the patient (e.g., a 100X coverage of the locus comprising the TSV). Non-tumor cell depth coverage may be indicative of the detectability of the TSV because coverage in non-tumor cells may indicate the relative ease of amplifying or sequencing the locus containing the variant. Tumor cell alternate observations may be determined based on the number of sequencing reads (e.g., reads from WES) containing the TSV in the tumor cells (e.g., a biological sample of tumor cells) of the patient. For example, tumor cell alternate observations may be the number of sequencing reads (e.g., reads from WES) containing the TSV of the tumor cells (e.g., a biological sample of tumor cells) of the patient. Tumor cell alternative observations may be indicative of the detectability of the TSV in ctDNA of the patient. Tumor allele frequency may be calculated based on (1) the total number of sequencing reads covering the TSV in tumor cells (e.g., a biological sample of tumor cells) of the patient and (2) the total number sequencing reads covering the locus comprising the variant allele (e.g., all the sequencing reads covering the locus including the sequencing reads covering the TSV). Tumor cell variant allele frequency may be calculated by dividing (1) and (2) above. A TNC may be a series of three sequential nucleotides in a sequence read (e.g., AAA, TAT, GTA, etc.). The TNC may comprise a variant (e.g., a TSV) in the middle position of the TNC. Generating the TNC error rate feature (i.e., the error rate in error corrected bins) may comprise generating the estimated probability of observing a TSV in the middle position of the TNC due to errors introduced during sample preparation and/or sequencing of one or more biological samples (e.g., biological samples collected and/or sequenced previously). Generating the TNC error rate feature may comprise obtaining data associated with the TNC error rate observed during sample preparation and/or sequencing (e.g., see Table 2). In some embodiments, generating the TNC error rate feature comprises generating an error rate feature comprising error rates that are within 50% of the error rates described in Table 2.
62 10940863.511975645.1
Figure imgf000065_0001
63 10940863.511975645.1
Figure imgf000066_0001
Table 2: TNC error rates based on previous sequencing experiments.
64 10940863.511975645.1 Generating a C to A variant mutation feature may refer to generating an indicator (e.g., binary indicator) for whether the variant is a C to A mutation. For example, generating a C to A variant mutation feature may comprise assigning a “1” to a variant mutation from C to A and a “0” to any other mutation type. In another example, generating a C to A variant mutation feature may comprise assigning a “0” to a variant mutation from C to A and a “1” to any other mutation type. Generating primer features may refer to generating a distance (e.g., measured in number of nucleotides) between a TSV and a binding site for a primer (e.g., a first primer, a target-specific primer) designed to detect the TSV. Generating primer features may refer to generating a distance (e.g., measured in number of nucleotides) between a TSV and binding site for a primer (e.g., a first primer), optionally different from a second primer, designed to detect the TSV. Generating a primer feature may comprise determining the minimum distance of (1) the binding site of a primer (e.g., a first primer) and the TSV and optionally (2) the binding site of a second primer and the TSV. In some embodiments, generating a primer feature may be determining the maximum distance of (1) the binding site of a primer (e.g., a first primer) and the TSV and optionally (2) the binding site of a second primer and the TSV. In some embodiments, the primers designed for a TSV comprise two sets of primers (e.g., nested primers): gene specific primer 1 forward (GSP1-F) and gene specific primer 1 reverse (GSP1-R); and gene specific primer 2 forward (GSP2-F) and gene specific primer 2 reverse (GSP2-R). GSP1-F and GSP1-R may be used in a first PCR reaction to amplify a polynucleotide comprising the TSV. GSP2-F and GSP2-R may be used in a subsequent PCR reaction to amplify a region within the polynucleotide amplified in the first reaction (e.g., to amplify a region of the amplicons generated in the first PCR reaction). Thus, generating primer features may comprise generating features using the minimum distance between (1) the distance between a primer binding site of GSP1-F and the TSV (e.g., measured from the 3’ or 5’ end of the primer binding site) and (2) the distance between a primer binding site of GSP1-R and the TSV (e.g., measured from the 3’ or 5’ end of the primer binding site). In another example, generating primer features may comprises generating using the minimum distance between (1) the distance between a primer binding site of GSP2-F and the TSV (e.g., measured from the 3’ or 5’ end of the GSP2-F primer binding site) and (2) the distance between a primer binding site of GSP2-R and the TSV (e.g., measured from the 3’ or 5’ end of the GSP2-R primer binding site). In another example, generating primer features may comprises generating using the maximum distance between (1) the distance between a primer
65 10940863.511975645.1 binding site of GSP1-F and the TSV (e.g., measured from the 3’ or 5’ end of the primer binding site) and (2) the distance between a primer binding site of GSP1-R and the TSV (e.g., measured from the 3’ or 5’ end of the primer binding site). In another example, generating primer features may comprises generating using the maximum distance between (1) the distance between a primer binding site of GSP2-F and the TSV (e.g., measured from the 3’ or 5’ end of the GSP2-F primer binding site) and (2) the distance between a primer binding site of GSP2-R and the TSV (e.g., measured from the 3’ or 5’ end of the GSP2-R primer binding site). In some embodiments, an absolute, average, mean, minimum, or maximum distance between a primer binding site of one or more primers and a TSV is in a range of 0 to 250, 0 to 150 or 0 to 50 nucleotides. In some embodiments, an absolute, average, mean, minimum or maximum distance between a primer binding site of one or more primers and a TSV is 20 to 40 nucleotides, or about 30 nucleotides. Generating sequence context features may include, but is not limited to, generating a conservation score of the sequence context comprising the TSV, a distance between the TSV and a nearest splice site in the sequence context, and/or a splice site score of the sequence context (e.g., a score of indicating a splice site is located within the sequence context). A sequence context may refer to the nucleotides on either side of the TSV in the primary sequence of the genome of the patient as is further described herein including with reference to the section “Sequence Context Features”. In some embodiments, the nucleotides within the sequence context may be the nucleotides between and including a first primer binding site and a second primer binding site. When using nested primers, the nucleotides within the sequence context may be the nucleotides between the primers of the outer set of primers (GSP1 primers) including the primer binding sites. In some embodiments, the nucleotides within the sequence context may be the nucleotides of the locus comprising the TSV. Generating a context feature may comprise determining a conservation score of a polynucleotide of the patient comprising a TSV, a distance between the TSV and a nearest splice site on the polynucleotide (e.g., using SpliceSiteFinder), and/or a splice site score of the polynucleotide (e.g., using MaxEntScan). In some embodiments, generating a sequence conservation score comprises determining the conservation of the sequence (e.g., % conservation between species), e.g., using standard methods including, but not limited to, BLAST, HMMER, OrthologR, and Infernal. In some embodiments, generating the conservation score comprises generating a phastCons conservation score and/or a phyloP
66 10940863.511975645.1 conservation score for a TSV of the subset of the plurality of TSVs. Methods for calculating phastCons conservation score and phyloP conservation score are known e.g., as described in Ramani et al. (2019) Bioinformatics 35(13):2320-2322. Additionally, a conservation score of a polynucleotide comprising a TSV may be determined using any suitable algorithm that determines conservation. In some embodiments generating the first set of features for the first TSV comprises determining for each TSV of the subset of the plurality of TSVs: the sequencing depth of coverage of plus strands and minus strands for the TSV, the non-tumor cell depth coverage for the TSV, the number of observations of the TSV in tumor cells of the patient, and the trinucleotide context (TNC) error rate feature. In some embodiments, the method further comprises determining for each TSV of the subset of the plurality of TSVs, the maximum distance between the TSV and a binding site for the second primer designed to detect the TSV, the ratio of depth of coverage between plus strands and minus strands of the variant data for the TSV, the tumor allele frequency of the TSV, the phastCons conservation score, the TSV and the binding site for the first primer designed to detect the TSV, the distance between the TSV and the nearest splice site on the polynucleotide, and a phyloP conservation score. In some embodiments, the method further comprises for each TSV, determining the C to A variant mutation feature, the minimum distance between the TSV and a binding site for the second primer designed to detect the TSV, the splice site score of the polynucleotide, the minimum distance between the TSV and the binding site for the second primer designed to detect the TSV. Process 250 step 254 involves processing the plurality of sets of features using the trained ML model to obtain a corresponding plurality of scores. Processing the plurality of sets of features may comprise processing using a trained machine learning (ML) model. In some embodiments, the trained ML model may be a classification model. In some embodiments, the trained ML model may be a regression model. In some embodiments, the trained ML model may be a linear model. In some embodiments, the trained ML model may be a nonlinear model. In some embodiments, the trained ML model may be any suitable ML model including, but not limited to, a linear mixed effects model with a linked logistic function, a non-linear mixed effect model, a neural network, a support-vector machine, or a random forest. In some embodiments, the trained ML model may be a random forest model. In some embodiments, the trained ML model may be a random forest classifier. The random forest classification of each TSV may be indicative of the score of the TSV. The random
67 10940863.511975645.1 forest classification of each TSV (e.g., the bin assigned to the TSV) may be the score of the TSV. FIG.12 is a diagram depicting an illustrative technique 1200 for training the trained machine learning model to generate a score indicative of the predicted detectability of a TSV 1218, according to some embodiments of the technology described herein. MRD positive patients 1202 were monitored with previously designed patient-specific panels 1208 (e.g., panels comprising primers that were designed using a different method). The previously designed patient-specific panels provided the TSV detectability 1206 in the corresponding patient 1202 with 1214 indicative of the presence of a TSV and 1212 indicative of the absence of a TSV. Analyzing previously designed patient-specific panels 1208 (e.g., by sequencing amplicons produced by the patient specific panel) produced variant data 1210 associated with each previously designed patient-specific panel 1208. The training data 1204 comprises the TSV Detectability 1206 and Variant Data 1210. The training data 1204 is used to in training the machine learning model 1216 with the objective of generating scores indicative of the predicted detectability of a TSV in a MRD positive patient 1218 (e.g., in sequencing data of a biological sample of an MRD positive patient. In some embodiments, training data 1204 may have been collected from biological samples of a plurality of MRD positive patients 1202 (e.g., as described herein). Patients of the plurality of MRD positive patients 1202 may have been previously diagnosed with cancer as described herein. The plurality of patients may consist of MRD positive patients. The plurality of MRD positive patients 1202 may comprise patients that have been previously diagnosed with a specific cancer type. The plurality of MRD positive patients 1202 may consist essentially of patients that have been previously diagnosed with a specific cancer type. The plurality of MRD positive patients 1202 may consist of patients that have been previously diagnosed with a specific cancer type (e.g., lung cancer or melanoma). Patients previously treated for lung cancer may comprise patients previously treated non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC), and lung adenocarcinoma. In some embodiments, the machine learning model may be trained using data from a plurality of MRD positive patients 1202 who have been treated for the same type of cancer. In some embodiments, the machine learning model may be trained using data from a plurality of MRD positive patients 1202 who have been treated for melanoma. In other embodiments, the machine learning model may be trained using data from a plurality of MRD positive patients 1202 who have been treated for lung cancer. In these embodiments, the trained machine learning model may reflect cancer type-specific biases in TSV detectability, and achieve a
68 10940863.511975645.1 higher level of sensitivity for additional samples derived from that cancer type. However, a model trained using data from a plurality of MRD positive patients previously treated for a first cancer type (e.g., lung cancer) may also be predictive of the detectability of TSVs in a different cancer type (e.g., melanoma) (e.g., as described in the Example 3). In some embodiments, the plurality of MRD positive patients 1202 may comprise patients previously diagnosed with different cancer types (e.g., a first patient may be previously diagnosed with a first cancer type, a second patient may be previously diagnosed with a second cancer type, a third patient may be previously diagnosed with a third cancer type, etc.). For example, the plurality of patients 1202 may comprise patients previously treated for one or more of brain cancer, liver cancer, kidney cancer, immune cancer, breast cancer, skin cancer, bone cancer, uterine cancer, prostate cancer, testicular cancer, colon cancer, squamous cell carcinoma, etc. For example, the plurality of patients 1202 may comprise patients that have been previously diagnosed with lung cancer and patients that have been previously diagnosed with melanoma. For example, as described in Examples. Thus, in some embodiments, a machine learning model may be trained using data from different patients previously diagnosed with different cancer types. These machine learning models may be more generalized, as the features explaining TSV detectability that are common across different cancer types may be prioritized. In these embodiments, the fitted model may be beneficial for prioritizing variants in new cancer types for which there is not yet pre-existing data to train on, as well as rare cancer types that are limited in availability. Each TSV (e.g., 1212 and 1214) previously monitored in each biological sample of a plurality of biological samples collected from a plurality of MRD positive patients 1202 may have been monitored using one or more of previously designed patient-specific panels 1208. Thus the number of TSVs of the plurality of TSVs used for training the model may be dependent on the number of TSVs being targeted by the previously designed patient-specific panel 1208. The previously designed patient-specific panel 1208 may target at least 50, at least 75, at least 100, at least 150, at least 200, or at least 250 TSVs. In some embodiments, 100-300 TSVs are targeted by the previously designed patient-specific panels 1208. In some embodiments, 150-250 TSVs are targeted by the previously designed patient-specific panels 1208. In some embodiments, 200 TSVs are targeted by the previously designed patient- specific panels 1208. The plurality of MRD positive patients 1202 monitored using the previously designed patient-specific panels 1208 may comprise at least 25 MRD positive patients, at least 50 MRD positive patients, at least 75 MRD positive patients, at least 100 MRD positive patients,
69 10940863.511975645.1 or at least 150 MRD positive patients, at least 200 MRD positive patients, at least 300 MRD positive patients, at least 400 MRD positive patients, at least 500 MRD positive patients, or at least 1000 MRD positive patients. The plurality of MRD positive patients 1202 monitored using the previously designed patient-specific panels may comprise 25-500 MRD positive patients. The plurality of MRD positive patients 1202 monitored using the previously designed patient-specific panels may comprise 25-75 MRD positive patients. The plurality of MRD positive patients 1202 monitored using the previously designed patient-specific panels may comprise MRD positive 50 patients. The plurality of MRD positive patients may comprise 499 patients previously treated for melanoma and/or 57 patients previously treated for lung cancer. Generating the trained machine learning model 1216 may comprise training a machine learning model to generate a score indicative of the predicted detectability of a TSV 1218 in a biological sample (e.g., plasma) of a MRD positive patient. Training the trained machine learning model may comprise: obtaining, for a plurality of previously monitored TSVs 1212 and 1214 in each biological sample of a plurality of biological samples collected from a plurality of MRD positive patients 1202, sets of training data, each set of training data comprising: (i) an indication (e.g., a binary indication) of whether the TSV is present or absent 1206 in the biological sample; and (ii) variant data 1210 associated with the TSV (e.g., features derived from the variant data). Training the trained machine learning model may further comprise using the sets of training data to estimate a score indicative of detectability of a TSV in a biological sample from a MRD positive patient. The previously designed patient-specific panels 1208 may be used to monitor at least 50, at least 75, at least 100, at least 150, at least 200, or at least 250 TSVs. In some embodiments, the previously patient-specific panels 1208 may be used to monitor 100-300 TSVs. In some embodiments, the previously patient-specific panels 1208 may be used to monitor 150-250 TSVs. In some embodiments, the previously designed patient-specific panels 1208 may be used to monitor 200 TSVs. Because the previously designed patient- specific panels are patient-specific, the previously designed patient-specific panels may not monitor the same TSVs. The indication of whether the TSV is present or absent 1206 in the biological sample may be any suitable indication. For example, the indication may be a negative indication indicating the TSV was not detected (e.g., a value of 0) or a positive indication indicating the TSV was detected (e.g., a value of 1). Detection may be based on a threshold (e.g., the lower limit of detection of the sequencing instrument) thus a detected TSV may be a TSV with an
70 10940863.511975645.1 allele frequency in the biological sample of the patient that exceeds a threshold and an undetected TSV may be a TSV with an allele frequency in the biological sample of the patient that does not exceed the threshold. The indication may be the allele frequency at which the TSV was detected. The indication of whether the TSV is present or absent may be the TSV being present in the biological sample at an allele frequency that exceeds a threshold. The threshold may be the limit of detection of the patient-specific panel. The threshold may be an allele frequency of at least 0.0001, 0.001, 0.01, 0.1 or higher. The threshold may exceed the expected experimental noise associated with preparation and/or sequencing of the biological sample. The variant data 1210 associated with each TSV may comprise one or more of the features described herein. For example, the variant data associated with each TSV may comprise one or more of at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. The variant data associated with each TSV may comprises at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. The objective of training the model 1218 may be generating a score indicative of the predicted likelihood that the TSV will be observed in the biological sample of an MRD positive patient. Training the machine learning model 1216 is further described herein and with reference to FIG.13. FIG.13 is a flowchart of an illustrative process 1300 for training the trained machine learning model (e.g., a random forest), according to some embodiments of the technology described herein. Process 1300 beings with obtaining training data 1302. Obtaining training data 1302 may comprise obtaining training data in any suitable way. Obtaining training data 1302 may comprise obtaining data from previously performed monitoring of TSVs in MRD positive patients (as described herein). Obtaining training data 1302 may comprise obtaining for each previously monitored MRD positive patient, variant data associated with the TSVs of the MRD positive patient, and corresponding indication of whether the TSV is present or absent in a biological sample of the patient. Training data may be obtained from a plurality of MRD positive patients, as described herein. The training data obtained may be used to estimate the trained machine learning model 1304 by estimating the model parameters 1306 based on estimated model hyperparameters 1308. This may be an iterative process where the
71 10940863.511975645.1 hyperparameters are estimated (e.g., using a grid search as described herein) then the model parameters are estimated using the estimated hyperparameters and the training data, and the fit of the model to validation data is determined. This may be repeated with adjusted hyperparameters based on the grid search. To avoid over fitting of the model during selection of hyperparameters, cross validation may be used. Cross validation may refer to training using a first portion of the training data and then using the remaining training data to assess the accuracy of the model (e.g., in determining an indication of the detectability of a TSV). Estimating model parameters 1306 and estimating model hyperparameters 1308 may be performed using known methods e.g., using scikit-learn. Estimating model hyperparameters 1306 may comprise performing a grid search of the hyperparameters. A grid may refer to a method where for each hyperparameter, a set of values encompassing the range of potential values are predefined, and combinations of these values for each hyperparameter are considered (e.g., the model’s parameters are estimated using a given set of hyperparameters and the predictability of the model is determined). In exhaustive grid search, all possible combinations of the predefined values for all hyperparameters may be considered. In a randomized grid search, a subset of the possible combinations may be sampled and considered. A randomized grid search may be used to identify a near optimal set of hyperparameters in potentially less time than an exhaustive grid search (due to the combinatorics of the problem, the grid can potentially be quite large, and take a long time to compute). In some embodiments, the grid search may be an exhaustive grid search. In some embodiments, the grid search may be a random grid search. In some embodiments, the hyperparameters may be determined one at a time. In some embodiments, the grid search may comprise adjusting the value of one hyperparameter at a time. In some embodiments, the grid search may comprise adjusting the values of at least 1 hyperparameter at a time. Estimating model hyperparameters 1306 may comprise estimating hyperparameters of a random forest model. For example, for a random forest model, the hyperparameters may be one or more of the number of trees in the forest, the function to measure the quality of a split, the number of features to consider when looking for the best split, the maximum depth of the tree, the minimum number samples required to split an internal node, the minimum number of samples in newly created leaves, and/or the maximum number of leaves to grow in the tree. Training the trained machine learning model may also comprise selecting features for use in the trained machine learning model. Selecting features may comprise: training a
72 10940863.511975645.1 machine learning model using one or more features described herein, determining values indicative of the predictive contribution of each feature in the trained machine learning model (e.g., partial dependence, Shapley Additive exPlanations (SHAP), Local interpretable model- agnostic explanations (LIME), or Individual Conditional Expectation (ICE) values), and training a second machine learning model using features with a predictive contribution above a threshold (e.g., see Example 2). For example, SHAP values may be used to identify features that perform better than a randomly generated feature. These feature may be selected for use in training the model whereas features that perform worse than a randomly generated feature may be exclude from use in the model. FIG.3 is a diagram depicting an illustrative technique 300 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using the TSVs using a trained machine learning model, according to some embodiments of the technology described herein. Diagram 302 represents the TSVs identified previously using methods described herein, with 304 representing specific TSVs, 306 representing the sequence of the tumor sample, and 308 representing the sequence of the non-tumor sample. Diagram 310 represents features associated with each TSV of diagram 302 (e.g., as described above). The TSVs in diagram 302 and features in diagram 310 are processed in trained machine learning model 312 to generate a score indicative of the predicted detectability of each of the TSVs. The TSVs are ranked 317 by score 316 and the top X TSVs (e.g., TSVs exceeding a threshold 318, where X may be the number of TSVs desired or needed for inclusion in the patient-specific panel) are selected as the subset of the plurality of TSVs. Technique 300 begins with obtaining data indicative of tumor specific variants (TSVs) 302 and data indicative of features for each TSV 310. TSVs 304 may refer to variants that are present in the tumor sequence 306 (e.g., the genome of the tumor) and absent in the non-tumor sequence 308 (e.g., the genome of non-tumor tissue). However, this need not always be the case. TSVs may also refer to variants that have a greater allele frequency in the tumor sequence 306 than the non-tumor sequence 308. Additionally, TSVs 302 may refer to TSVs that are identified using any suitable method including, but not limited to the methods described herein including in the section “Subset of the Plurality of Tumor Specific Variants and Features For Selecting the Subset.” Data indicative of TSVs 302 may be stored in any suitable file format. Features for each TSV 310 may refer to any suitable features that are indicative of the detectability of a TSV. For example, these features may include features related to the sequence coverage, allele frequency, sequencing error rate, primer design (e.g., primers
73 10940863.511975645.1 designed for use in detection) and variant sequence context, including as described herein including with reference to FIG.2B and the section “Subset of the Plurality of Tumor Specific Variants and Features For Selecting the Subset.” Data indicative of these features may be generated by any suitable method, including but not limited to as described herein including in with reference to FIG.2B. Data indicative of features for each TSV 310 may be stored in any suitable file format. Technique 300 may continue with, for one or more the TSVs of 302, inputting the data indicative of a TSV of the one or more TSVs and the features indicative of the TSV into the trained ML model 312, and outputting a score indicative of the predicted detectability of the TSV. This may be repeated for each TSVs of the TSVs 302 (e.g., each TSV of the subset of the plurality of TSVs). The trained ML model 312 may be a ML model of any suitable type. For example, the trained ML model 312 may be a trained ML model as described herein including in reference to FIG.2B. The trained ML model 312 may be a nonlinear model (e.g., a random forest). The trained ML model 312 may be trained using any suitable method including, but not limited to, the methods described herein including with reference to FIG.2B. The trained machine learning model 312 may be trained with data comprising (1) an indication of the detectability of a TSV (e.g., in a patient with or without MRD) and (2) features associated with the TSV, including, but not limited to, as described herein including with reference to FIG.2B. An indication of MRD in the patient may be used as a proxy for the detectability of a TSV. The trained ML model 312 may output scores 316 associated with each TSV (and corresponding features) that are inputted into the model. The score 316 may be indicative of the detectability of a corresponding TSV in a biological sample of the patient. The score 316 may be the predicted likelihood of detecting the TSV in a biological sample (e.g., plasma) of an MRD positive patient (e.g., detecting using a patient-specific panel). The score 316 may be the predicted likelihood of detecting the TSV in a biological sample of an MRD positive patient. When using a random forest classifier model, a likelihood of detecting the TSV in a biological sample of an MRD positive patient may be determined as follows: Each tree of the random forest may provide a probability defined as the proportion of samples in the terminal leaf that belong to class "1" (i.e., the TSV is detected). These probabilities may be aggregated across all of the trees to determine a likelihood that, given input features, that TSV is detected. After scoring using the trained ML model, the scores 316 may be ranked 317 and the top TSVs may be selected for the panel (e.g., the patient specific panel) in step 314. The top TSVs may be selected based on a threshold 318. In some embodiments, the threshold may be
74 10940863.511975645.1 determined based on the total number of TSVs that are needed for the panel. For example, if 50 TSVs are needed for the panel then the threshold may be at the 50th TSV. In some embodiments, the threshold may be selected by any suitable method including, but not limited to, the methods described herein, including the methods described in reference to FIG.2B. In some embodiments, selecting TSVs for inclusion in the panel comprises selecting the TSVs with the highest scores. In some embodiments, electing TSVs for inclusion in the panel comprises selecting TSVs with scores above a threshold. In some embodiments, at least 10 (e.g., at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, or at least 100) TSVs may be selected from the subset of the plurality of TSVs. In some embodiments, the number of TSVs (e.g., top X TSVs) selected for the subset of the plurality of TSVs may be determined based on the number of TSVs that need to be monitored to indicate MRD (e.g., to have confidence in an indication of MRD). FIG.4 is a flowchart of an illustrative process 400 for identifying the subset of the plurality of TSVs for use in a patient-specific panel, according to some embodiments of the technology described herein. Process 400 has the following acts: act 402, obtain the variant data indicative of a plurality of variants of the patient; act 404, identify a plurality of TSVs for the patient based on allele frequencies of the variants; act 406, identify a subset of the plurality of the TSVs for use in the patient-specific panel (which comprises acts 408, 410 and 412); act 408, generate a set of features corresponding to at least some of the TSVs of the subset of the plurality of TSVs; act 410, process the features using a trained ML model to obtain scores for each TSV that are indicative of the detectability of the TSV; and act 412, select, using the scores, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient-specific panel. Illustrative process 400 begins with obtaining variant data indicative of a plurality of variants of the patient, the variant data previously-generated by analyzing sequence data generated by sequencing at least one biological sample obtained from the patient 402. Obtaining variant data may comprise obtaining variant data comprising data indicative of the variants identified in at least one biological sample (e.g., in the genome of a tumor cell sample and/or a non-tumor cell sample). Data indicative of the variants may be any suitable data and may include, but is not limited to, the variant data described herein including in the section “Variant Data”. The variant data may comprise the data needed to calculate the features described herein. In some embodiments, the variant data may comprise data associated with primer design (e.g., primers designed to detect variants), as described herein.
75 10940863.511975645.1 The variant data in act 402 is previously-generated by analyzing sequence data of at least one biological sample obtained from the patient. Sequence data may be any suitable sequence data of at least one biological sample for the patient. For example, the sequence data may be sequence data as described herein including in the section “Sequence data”. Act 402 may also comprise generating the sequence data by sequencing the DNA and/or RNA of the biological sample using any suitable method including, but not limited to, the methods described herein. Additionally, sequencing of the samples and generation of the variant data for one or more of the biological samples may be performed by a third party. The at least one or more biological samples obtained from the patient may comprise a tumor cell sample (e.g., tumor cells). The at least one or more biological samples obtained from the patient may comprise a non-tumor cell sample (e.g., non-tumor cells). The biological samples may be any suitable biological samples (e.g., tumor cells or non-tumor cells) including as described herein including in the section “Biological Samples”. The patient may be any patient having diseased tissue (e.g., cancer or a tumor) including but not limited to as described herein including with reference to the section “Patient”. The patient may be a patient having lung cancer or melanoma. Process 400 continues with act 404, identifying a plurality of tumor-specific variants (TSVs) for the patient. A plurality of TSVs may comprise any suitable number of TSVs including as described herein including with reference to FIG.1. For example, the plurality of TSVs may contain at least twice the number of TSVs as needed in the patient-specific panel (e.g., needed to determine an indication of MRD). Having a plurality of TSVs comprising twice the number of TSVs required for the panel may increase the chances that at least the minimum number of suitable TSVs are identified for use in the patient-specific panel. A plurality of TSVs may comprise at least 50, at least 100, at least 150, or at least 200 TSVs. The plurality of TSVs may be selected using any suitable methods including but not limited to as described herein and in reference to the section “Tumor Specific Variants (TSVs) and Features For Selecting TSVs.” For example, the plurality of TSVs may be selected using one or more features that are indicative of the detectability of the feature (e.g., the detectability using primer amplification and/or next generation sequencing). Identifying TSVs for the plurality of TSVs may comprise applying thresholds based on the features described herein. For example, the healthy population allele frequency features may be used to identify a variant(s) that are TSV when the variant does not exceed a threshold allele frequency in a healthy population, as described herein and in the section “Healthy Population Variant Allele Frequency” (indicating that the variant is not tumor associated).
76 10940863.511975645.1 Process 400 next continues with identifying a subset of the plurality of TSVs for use in the patient-specific panel for use in detecting MRD in the patient 406. A subset of the plurality of TSVs may refer to TSVs whose presence in a biological sample (e.g., plasma) is indicative of MRD. A subset of the plurality of TSVs is described herein including in the section, “Subset of the Plurality of Tumor Specific Variants and Features For Selecting the Subset.” A patient-specific panel may be any suitable patient specific panel including as described herein in the section “Patient-specific panel” and with reference to FIG.1. The patient-specific panel may comprise a techniques (e.g., PCR primers) for detecting the subset of the plurality of TSVs in a biological sample (e.g., plasma from a patient). Any suitable method for detecting MRD may be used including but not limited to methods described herein including in the section “Minimal Residual Disease (MRD)”. Act 406 of process 400 comprises three acts: 408, 410 and 412. Act 408 comprises generating for each of at least some of the plurality of TSVs a respective set of features to obtain a plurality of sets of features. Features, for use in the respective set of features, may comprise any suitable features. Suitable features may be features that are indicative of the detectability of a TSV in a patient (e.g., in ctDNA of a patient). The respective set of features may include, but is not limited to, the features described herein including in the section “Subset of the Plurality of Tumor Specific Variants and Features For Selecting the Subset”. Generating for each of at least some of the plurality of TSVs a respective set of features to obtain a plurality of sets of features may comprise generating the respective set of features using any suitable method including but not limited to the methods described herein, including with reference to FIG.2B. Generating the plurality of sets of features may comprise generating using the variant data described herein. Generating the plurality of sets of features may produce a corresponding input for the trained machine learning algorithm comprising features associated with at least some of the TSVs (e.g., each of the TSVs of the plurality of TSVs). Act 410 of process 400 comprises processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores indicative of the predicted detectability of a corresponding TSV. Processing the plurality of sets of features may comprise: (1) inputting a set of features of the plurality of sets of features into the trained ML algorithm and (2) outputting a score indicative of the predicted detectability of a TSV associated with the set of features. Processing the plurality of sets of features may be performed using any suitable trained machine learning model, including but not limited to, the machine learning models described herein including with reference to FIG.2B. For
77 10940863.511975645.1 example, processing may be performed using a trained random forest model. The trained machine learning model may be trained using any suitable method including but not limited to the methods described herein including with reference to FIG.2B. The training may be performed using previous MRD data collected from patients using patient-specific panels and sequencing (e.g., patients previously treated for cancer). Act 412 of process 400 comprises selecting, using the plurality of scores and from among the at least some of the TSVs, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient specific panel. Selecting the TSVs may be performed using any suitable method including but not limited to the methods described herein including with reference to FIG.2B and FIG.3. For example, selecting may include ranking the TSVs according to score and then selecting the top highest ranking TSVs. The “top” may be determined based on the number of TSVs needed in the patient-specific panel. For example, in a specific patient with a specific cancer/tumor type, 50 TSVs may be needed in the patient- specific panel to produce an accurate indication of MRD. In a different patient, 100 TSVs may be needed. In some embodiments, the number of TSVs may be selected based on the type of cancer. Selecting TSVs may also include selecting TSVs with scores that exceed a threshold as described herein. TSVs above the threshold may all be sufficiently detectable to include in the patient-specific panel. For example, 50 TSVs may be needed for the patient- specific panel, but there are 100 TSVs with scores that exceed the threshold, thus, any 50 of the TSVs that exceed the threshold may be selected. In some embodiments, the TSVs selected from the TSVs that exceed the threshold may be selected based on the cancer type. In some embodiments, the TSVs selected from the TSVs that exceed the threshold may be selected based on patient-specific attributes. FIG.5 is a diagram depicting an illustrative technique 500 for identifying the subset of the plurality of TSVs for use in a patient-specific panel using variants identified by sequencing non-tumor cells and tumor cells of the patient to identify TSVs and exclude non- tumor-specific variants, scoring the TSVs using a trained machine learning model, and selecting TSVs for the patient-specific panel using the scores, according to some embodiments of the technology described herein. At 502 and 504 variants from whole exome sequencing of non-tumor cells (502) and whole exome sequencing of tumor cells (504) are obtained. Obtaining variants in steps 502 and 504 may comprise obtaining variant data using any suitable method including methods described herein including in the section “Variant Data” and in reference to FIG.1. For example, obtaining variant data may comprise
78 10940863.511975645.1 obtaining the sequence of at least some of the variants and corresponding data about the sequence context of at least some of the variants. Following steps 502 and 504, non-specific tumor variants 506 are excluded from consideration 518 and tumor specific variants 510 are identified. Tumor-specific variants 510 may be identified using any suitable methods, including but not limited to methods described herein including in the section “Tumor Specific Variants (TSVs) and Features For Selecting TSVs.” In some embodiments, non-tumor specific variants 506 may be variants that are not tumor specific variants 510. However, this is not always the case. Lower tier variants 508 may refer to variants that were placed into lower tiers when identifying tumor- specific variants as described herein, and were subsequently excluded from consideration with the non-tumor specific variants. For example, variants 502 may be tiered according to detectability of each variant. If 50 TSVs are needed at step 510 then 50 variants may be selected from among the top tiers of variants (e.g., the top 50 variants), whereas the lower tier variants are excluded from consideration 518. Methods for tiering variants and identifying lower-tier variants are described herein including with reference to FIG.2A. After selecting TSVs, technique 500 involves obtaining scored TSVs 512. TSV scoring may be performed using any suitable method including but not limited to any method described herein including with reference to FIG.2B. Scoring variants may comprise two steps: (1) generating sets of features associated with at least some of the TSVs and processing each set of features using a trained machine learning model that outputs a score indicative of the predicted detectability of the TSV. Generating features may comprise generating features associated with each TSV (e.g., TSV mutation type, allele frequency etc.) using any suitable method including but not limited to generating features as described herein including as described in reference to FIG.2B. Processing each set of features using a trained machine learning model may comprise processing using a nonlinear trained machine learning model (e.g., a random forest model). The trained machine learning model may be trained using any suitable method including but not limited to the methods described herein, including with reference to FIG.2B. For example, the trained machine learning model may be trained with data comprising patient-specific panel data previously collected when monitoring patient’s for MRD and data indicating the MRD status of the patient (e.g., positive for MRD or negative for MRD). Processing the set of features using the trained machine learning model may produce scores that are indicative of the detectability of the TSVs in a biological sample (e.g., plasma) of a patient.
79 10940863.511975645.1 The scored TSVs may be selected 514 for a panel 516 according using any suitable method including, but not limited to, the methods described herein including with reference to FIG.3. For example, the scored TSVs 512 may be ranked according to score and then a TSV may be selected 514 for a panel 516 if the TSV score exceeds a threshold. Alternatively, in some embodiments, the TSVs with the top scores may be selected from the patient specific panel 516. Methods of Determining an Indication of MRD In some embodiments, monitoring a patient for MRD using a patient specific panel as disclosed herein comprises sequencing nucleic acids obtained from a suitable sample. In some embodiments, monitoring a patient for MRD using a patient specific panel as disclosed herein comprises sequencing nucleic acids derived from circulating cells. In some embodiments, monitoring a patient for MRD using a patient specific panel as disclosed herein comprises sequencing cfDNA and/or cfRNA. In some embodiments, monitoring a patient for MRD using a patient specific panel as disclosed herein comprises sequencing tumor DNA and/or tumor RNA. In some embodiments, monitoring a patient for MRD using a patient specific panel as disclosed herein comprises sequencing ctDNA. In some aspects, this disclosure describes a method for determining whether sequence data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD), the method comprising: generating sequence data from the biological sample of the patient, the generating comprising contacting the biological sample with primers identified for the subset of the plurality of TSVs identified using a method of designing a patient-specific panel as described herein; detecting TSVs using the sequence data; and determining, using the detected TSVs, whether the biological sample provides an indication of MRD (e.g., as described herein. In some embodiments, the biological sample is a biological sample from the patient that is expected to contain ctDNA when MRD is present. In some embodiments, the biological sample is a fluid, secretion, or mucosae of the patient (e.g., as described herein). In some embodiments, the biological sample is a blood, serum or plasma sample of the patient. In some embodiments, determining whether the biological sample provides an indication of MRD comprises using a statistical test to compare the sequencing error rate and the allele frequency of the TSV in tumor-derived polynucleotides. In some embodiments, determining whether the biological sample provides an indication of MRD comprises using a statistical test to compare the sequencing error rate and the allele frequency of the TSV in circulating tumor DNA (ctDNA) (e.g., as described herein and in reference to the section entitled “Minimal Residual Disease (MRD)”. In some embodiments,
80 10940863.511975645.1 determining an indication of MRD comprises determining if the total number of times all of the TSVs are observed in sequence data of a biological sample of the patient exceeds the expected number of TSVs to be observed due to error associated with sample preparation (e.g., DNA extraction and amplification with primers of the patient-specific panel) and detection (e.g., sequencing). In some embodiments, the method further comprises administering a therapeutic (e.g., a cancer therapeutic) to a patient with a positive indication of MRD or continuing MRD monitoring (e.g., as described herein) in a patient with a negative indication of MRD. In some embodiments, the method further comprises administering a therapeutic (e.g., a cancer therapeutic) to a patient with a positive indication of MRD or collecting one or more additional samples (e.g., blood) from the patient with a negative indication of MRD (e.g., for use in determining an indication of MRD). The one or more additional biological samples may be collected from the patient over a specified time interval. For example, when a first biological sample from a patient does not have an indication of MRD then a second biological sample may be collected 6 months after the first biological sample is collected. This may continue until determined to be unnecessary (e.g., the patient has a positive indication of MRD, the patient dies, a medical professional determines MRD monitoring is no longer necessary, or the patient decides to no longer monitor for MRD). The time interval between collecting biological samples may be any suitable time interval. Suitable time intervals may be determined based on the type of MRD being monitored. For example, MRD associated with faster growing cancers/tumors may be monitored in shorter time intervals than MRD associated with slowing growing cancers/tumors. In some embodiments, the time interval is determined by the skilled person. In some embodiments, the time interval between collecting biological samples may be 1 month, 2 months, 3 months, 6 months, 1 year or more. The time interval need not be consistent over time. For example, a biological sample may be collected every month for the first six months after cancer treatment and then every 6 months thereafter. In some embodiments, the method further comprises treating cancer (e.g., by administering an anti- cancer therapeutic that is expected to treat the cancer of the patient) in a patient with a positive indication or MRD (e.g., the method indicates the patient has, may have, or possibly will have disease relapse (e.g., cancer relapse)) or continuing MRD monitoring (e.g., as described herein) in a patient with a negative indication of MRD (e.g., the method indicates that the patient does not have cancer relapse). In some embodiments, determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with sensitivity greater than
81 10940863.511975645.1 a threshold probability of detecting MRD in a patient who has MRD. The threshold may be between 0.8 and 1. The threshold may be between 0.85 and 1. The threshold may be between 0.85 and 0.97. The threshold may be 0.85, 0.86, 0.87, 0.88, 0.89, 0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, or 0.99. In some embodiments, probability of detecting MRD in a patient who has MRD is a probability of detecting MRD 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 weeks prior to detection of a new tumor by surveillance imaging. In some embodiments, probability of detecting MRD in a patient who has MRD is a probability of detecting MRD at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9 or at least 10 weeks prior to detection of a new tumor by surveillance imaging. In some embodiments, probability of detecting MRD in a patient who has MRD is a probability of detecting MRD at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9 or at least 10 months prior to detection of a new tumor by surveillance imaging. In some embodiments, determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with specificity greater than a threshold probability of not detecting MRD in a patient that does not have MRD. The threshold may be between 0.95 and 1. The threshold may be between 0.95 and 1. The threshold may be between 0.98 and 1. The threshold may be 0.95, 0.96, 0.97, 0.98, 0.99 or 1. Methods of Administering Therapeutics In some embodiments, this disclosure provides a method of administering a therapeutic to a patient having an indication of MRD (e.g., determined using a method described herein) or continuing MRD monitoring (e.g., using a patient-specific panel) in a patient that does not have an indication of MRD. In some embodiments, administering the therapeutic comprises administering a therapeutic designed to treat the disease of the patient (e.g., a therapeutics designed, known or expected to treat the cancer of the patient). Anti- cancer therapeutics are well known in the art. For example, Pantziarka et al. describes an open access database of licensed cancer drugs. Pantziarka et al. (2021) Frontiers in Pharmacology 12:627574. In another example, the National Cancer Institute maintains a list of approved cancer drugs for treating a variety of different cancers (A to Z List of Cancer Drugs[online][retrieved on Dec.5, 2022]; retrieved from the internet <URL:https://www.cancer.gov/about-cancer/treatment/drugs>). In some embodiments, administering a therapeutic comprising administering one or more of a chemotherapeutic, an immunotherapeutic (e.g., an antibody), a cellular therapeutic (e.g., a CAR-T cell), a pain
82 10940863.511975645.1 relieving therapeutic, a hormone therapy or radiation therapy. In some embodiments, administering a therapeutic comprises performing surgery on the patient (e.g., surgery to remove a tumor). Administering the therapeutic may be performed by any suitable means. In some embodiments, the method comprises selecting a patient for administration of a therapeutic when the patient has a positive indication of MRD (e.g., as determined using a patient-specific panel as described herein) and repeating the method with one or more further biological samples from the patient for use in monitoring when the patient has a negative indication of MRD. In some embodiments, this disclosure provides a method comprising: designing a patient-specific panel as described herein; using the patient-specific panel determine whether a biological sample of a patient (e.g., plasma) is indicative of MRD; and either (1) administering a therapeutic (e.g., a therapeutic for use in treating the cancer/tumor of the patient) to the patient if the biological sample is indicative of MRD or (2) continuing to monitor the patient for MRD (e.g., using the patient-specific panel). In some embodiments, the method comprises treating the patient using a therapeutic. EXAMPLES Example 1: Training the Machine Learning Model Data Used for Training The random forest classifier model was trained using patient ctDNA data (data from sequencing circulating DNA comprising ctDNA) from previously generated patient-specific panels (e.g., the patient-specific panel were generated using a different model) targeting up to 200 tumor specific variants each. Only samples in which MRD was detected were used for training the model. Data included data from 57 patients previously diagnosed with lung cancer and data from 499 patients previously diagnosed with melanoma. Summary of Training Criteria: Given that MRD was present in the patient ctDNA data, the model was trained to predict whether each variant was detected above a baseline level, using the set of input features in FIG.6. To reduce the chance of overfitting, cross validation was used where, for each iteration, a portion of the panels were selected to be used for training the model, and then the remaining panels were used to assess either the accuracy of the model in predicting
83 10940863.511975645.1 which variants would be observed in the ctDNA, or whether a subpanel would give the same result as the full panel (as described below in “Evaluation of the Fitted Model”). Tuning of the Hyperparameters: Hyperparameters were initially tuned by varying one hyperparameter at a time across a range of values while holding the other hyperparameters constant at their default values. A grid search was used to explore the space around these initial values, maximizing the balanced accuracy of the model. The hyperparameters tuned in this fashion were: the number of estimators included in the random forest, the minimum number of samples in a node that was required to further split that node, the maximum depth of each tree in the forest, the maximum number of features to consider at each bifurcation in the tree, and the minimum number of samples required to be in a terminal leaf of the tree. Feature Selection: An important aspect of training a machine learning model to predict the detectability of a TSV in a biological sample (e.g., plasma) of a patient is identifying informative features for use in training the machine learning model. One way to accomplish this is training a machine learning model using a large number of possible features and then testing the model to determine which features are indicative of TSV detectability using SHapley Additive exPlanations (SHAP) values. SHAP values show a given feature’s effect on the predicted outcome for a given sample. To accomplish this, the random forest classifier model was trained using a plurality of features that may be predictable of the detectability of a TSV, including features that are used by various rules-based algorithms in determining the detectability of a TSV, as well as features with a biological basis for impacting the detectability of a TSV over time (FIG.6). Training was performed using variant data from cancer patients (e.g., lung cancer patients and melanoma patients). Corresponding SHAP importance scores associated with each feature in the random forest model were determined (FIG 6). To provide a baseline determination of importance, a vector of randomly generated values sampled from a uniform distribution was included as an additional feature in the model for the calculation of the SHAP values. Any feature with a relative importance lower than the randomized vector was excluded from the model. Surprisingly, there were features that were expected to be predictive of TSV detectability that were ultimately excluded from the model. In particular, features aggregated from the Catalogue Of Somatic Mutations In Cancer (COSMIC) database were surprisingly
84 10940863.511975645.1 uninformative, given that variants in that database have previously annotated associations with cancer, and might therefore be useful in generating a tumor signature. Similarly, functional annotations of amino acid characteristics (such as amino acid polarity and volume) were less predictive than expected. These features were expected to be informative, since it is likely that changes in amino acid characteristics would be strongly selected against as they might be expected to disrupt the structure and/or function the corresponding proteins, however not all variants are in coding regions and so it is understandable that these features were excluded given the number of missing values that would be present in the plurality of TSV features. Characterizations of the various mutation types (such as indicators for A > C or C >T mutations) were also expected to be predictive for sequence biases that might be present and interfere with predictions, but were generally found to be less informative for training the machine learning model. Primer quality results from primer design also were less predictive than expected, though primer quality was expected to be a highly predictive feature because primer quality is related to the ability to amplify and detect the TSV. This may be a result of the primer quality requirements established elsewhere in the pipeline. The wildtype max Ent score (i.e., a score indicative of splice site near the TSV in the wildtype (non-tumor) cells genomes) was expected to give context to the “variant max Ent score” feature that was also included in the model, but ultimately wild-type max Ent score did not enhance the performance of the model. Based on the results in FIG.6, 15 features were selected for training the machine learning model (FIG.7). The “FDP” feature was found to be highly correlated with "NormalFDP" and therefore was not included in the model. FIG.8 further shows the overall impact of each of the selected features on the SHAP value, where a broader distribution of SHAP values indicates a stronger effect on predictability. Evaluation of the Fitted Model: The model was explicitly trained to predict whether a TSV would be detected in a sample comprising ctDNA of an MRD positive patient. However, an additional objective was to rank TSVs for inclusion in a patient-specific panel such that monitoring the TSVs of the patient-specific panel in a biological sample of a patient gives an accurate indication of whether MRD is present or absent in the patient. As such, the final evaluation used to determine whether the model was accurate was determined by using a cohort of previously analyzed samples based on panels targeting up to 200 tumor specific variants. For each of the panels, the trained machine learning model was used to assign a probability to each variant. This probability is the predicted likelihood that the variant will be observed in a biological
85 10940863.511975645.1 sample given that the source patient has MRD. The variants are then ranked by this probability, and the subset of most likely variants is selected as a “subpanel”. In this case the top 50 variants were used, but other subpanel sizes (for example 16, 100, etc.) could be used. Once the subpanel was generated, the original sample comprising ctDNA was reanalyzed in silico to determine the MRD status based on this subpanel. This MRD result was then compared to the original result. If both the original result from the full panel and the new subpanel result were positive for MRD, the result was considered a true positive. Conversely, if both the original and subpanel results were negative for MRD, the result was considered a true negative. If the full panel yielded a positive result and the subpanel was negative, that result was considered a false negative. In the situation when the full panel yielded a negative result but the subpanel was positive, that result was a false positive. These determinations of true positives, true negative, false positive and false negatives may be used to calculate the sensitivity and specificity of the trained machine learning model (e.g., as done in Example 3). Example 2: Measuring the relationship between feature values and TSV detectability Another important consideration when selecting a machine learning model is selecting a model that can capture the relationship between a feature and the desired prediction (e.g., scoring the detectability of a TSV). For example, the random forest classifier model used in Example 1 is capable of capturing complicated nonlinear relationships. One way to observe the relationships between a feature and a designed prediction is using SHAP plots, which compare the feature values to corresponding SHAP values (FIGs.9A-9O). A positive SHAP value indicates the corresponding feature value contributes to a higher prediction that the TSV will be detectable. A negative SHAP value indicates the corresponding feature value leads the model to predict the TSV is not detectable. Results in FIGs.9A-9O show a variety of relationships between the features and the SHAP values, including a relatively simple monotonically decreasing relationship like that of the phastCons conservation score (FIG. 9H) to relatively more complex non-linear and non-monotonic relationships like that of the non-tumor cell depth coverage (FIG.9J), stand bias (FIG.9K) and minimum strand coverage (FIG.9L). These complicated relationships indicate that simple rules based methods are unlikely to be able to perform as well as non-linear machine learning based models. Example 3: Comparing random forest-based and rules-based patient-specific panel design The sensitivity and specificity of the random forest-based patient-specific panel and a rules-based patient-specific panel were compared. Sensitivity and specificity were calculated
86 10940863.511975645.1 as explained in Example 1. The rule-based patient specific panel was produced using similar features as the random forest patient specific panel (e.g., at least some of the features of FIG. 6). Results show that the random forest-based patient specific panel had significantly more sensitivity (an average increase of about 6%) and specificity (an average increase of about 1%) than the rule based-patient specific panel in detecting MRD (FIG.10A). Next it was determined whether a model trained using data from a first type of cancer could be used to predict MRD in samples from a second type of cancer. The random forest classifier model was trained using lung cancer MRD data and tested on melanoma MRD data. Similar sensitivity and specificity were observed on the tested melanoma MRD data as on the lung cancer MRD as in FIG.10A despite using a random forest model trained on lung cancer data (FIG.10B). Thus, indicating that this random forest classifier model has pan-cancer applicability. Computer Implementation An illustrative implementation of a computer system 1100 that may be used in connection with any of the embodiments of the technology described herein (e.g., such as the processes of FIGS.2A-2B and 4) is shown in FIG.11. The computer system 1100 includes one or more processors 1104 and one or more articles of manufacture that comprise non- transitory computer-readable storage media (e.g., memory 1110 and one or more non-volatile storage media 1106). The processor 1104 may control writing data to and reading data from the memory 1110 and the non-volatile storage device 1106 in any suitable manner, as the aspects of the technology described herein are not limited to any particular techniques for writing or reading data. To perform any of the functionality described herein, the processor 1104 may execute one or more processor-executable instructions stored in one or more non- transitory computer-readable storage media (e.g., the memory 1110), which may serve as non-transitory computer-readable storage media storing processor-executable instructions for execution by the processor 1104. Computer system device 1100 may also include a network input/output (I/O) interface 1102 via which the computer system may communicate with other computing devices (e.g., over a network), and may also include one or more user I/O interfaces 1108, via which the computer system may provide output to and receive input from a user. The user I/O interfaces 1108 may include devices such as a keyboard, a mouse, a microphone, a display device (e.g., a monitor or touch screen), speakers, a camera, and/or various other types of I/O devices.
87 10940863.511975645.1 The above-described embodiments can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software, or a combination thereof. When implemented in software, the software code can be executed on any suitable processor (e.g., a microprocessor) or collection of processors, whether provided in a single computing device or distributed among multiple computing devices. In this respect, it should be appreciated that one implementation of the embodiments described herein comprises at least one non-transitory computer-readable storage medium (e.g., RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other tangible, non-transitory computer-readable storage medium) encoded with a computer program (i.e., a plurality of executable instructions) that, when executed on one or more processors, performs the above- described functions of one or more embodiments (e.g., part of or all of the processes described above with reference to FIG.2A, FIG.2B, and FIG.4). The computer-readable medium may be transportable such that the program stored thereon can be loaded onto any computing device to implement aspects of the techniques described herein. In addition, it should be appreciated that the reference to a computer program which, when executed, performs any of the above-described functions, is not limited to an application program running on a host computer. Rather, the terms computer program and software are used herein in a generic sense to reference any type of computer code (e.g., application software, firmware, microcode, or any other form of computer instruction) that can be employed to program one or more processors to implement aspects of the techniques described herein. Having thus described several aspects and embodiments of the technology set forth in the disclosure, it is to be appreciated that various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be within the spirit and scope of the technology described herein. For example, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the embodiments described herein. Those skilled in the art will recognize or be able to ascertain using no more than routine experimentation many equivalents to the specific embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be
88 10940863.511975645.1 practiced otherwise than as specifically described. In addition, any combination of two or more features, systems, articles, materials, kits, and/or methods described herein, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present disclosure. The above-described embodiments can be implemented in any of numerous ways. One or more aspects and embodiments of the present disclosure involving the performance of processes or methods may utilize program instructions executable by a device (e.g., a computer, a processor, or other device) to perform, or control performance of, the processes or methods. In this respect, various inventive concepts may be embodied as a computer readable storage medium (or multiple computer readable storage media) (e.g., a computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement one or more of the various embodiments described above. The computer readable medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various ones of the aspects described above. In some embodiments, computer readable media may be non-transitory media. The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects as described above. Additionally, it should be appreciated that according to one aspect, one or more computer programs that when executed perform methods of the present disclosure need not reside on a single computer or processor, but may be distributed in a modular fashion among a number of different computers or processors to implement various aspects of the present disclosure. Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically the functionality of the program modules may be combined or distributed as desired in various embodiments. Also, data structures may be stored in computer-readable media in any suitable form. For simplicity of illustration, data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by
89 10940863.511975645.1 assigning storage for the fields with locations in a computer-readable medium that convey a relationship between the fields. However, any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish a relationship between data elements. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Further, it should be appreciated that a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer, as non-limiting examples. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smartphone, a tablet, or any other suitable portable or fixed electronic device. Also, a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible formats. Such computers may be interconnected by one or more networks in any suitable form, including a local area network or a wide area network, such as an enterprise network, and intelligent network (IN) or the Internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks. Also, as described, some aspects may be embodied as one or more methods. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments. All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
90 10940863.511975645.1 The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.” The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc. As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc. In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively.
91 10940863.511975645.1 The terms “approximately,” “substantially,” and “about” may be used to mean within ±20% of a target value in some embodiments, within ±10% of a target value in some embodiments, within ±5% of a target value in some embodiments, within ±2% of a target value in some embodiments. The terms “approximately,” “substantially,” and “about” may include the target value.
92 10940863.511975645.1

Claims

CLAIMS 1. A method for designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient, the method comprising: using at least one computer hardware processor to perform: obtaining variant data indicative of a plurality of variants present in tumor cells of the patient, the variant data being derived from at least one biological sample obtained from the patient; identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient; and identifying a subset of the plurality of TSVs for use in the patient-specific panel for use in detecting MRD in the patient, the identifying comprising: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features; processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the predicted detectability of a corresponding TSV in tumor-derived polynucleotides of the patient to be monitored using the patient-specific panel; and selecting, using the plurality of scores and from among the at least some of the TSVs, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient specific panel. 2. The method of claim 1, further comprising: identifying primers for use in detecting presence, in a biological sample, of at least some variants in the subset of the plurality of TSVs. 3. The method of claim 1 or claim 2, wherein obtaining the variant data indicative of the plurality of variants of the patient comprises: obtaining at least one data structure encoding variant genomic location data, variant type data, variant sequence data, variant sequence context data, variant sequencing coverage data, variant sequencing depth data, variant allele frequency data, variant sequencing error rate data, and/or variant primer data.
93 10940863.511975645.1
4. The method of claim 3, wherein the variant sequence context data comprises sequence context homopolymer data, sequence context splice site data, sequence context mutation data, and/or sequence context conservation data. 5. The method of any one of claims 1-4, wherein obtaining variant data indicative of a plurality of variants of the patient comprises obtaining the variant data previously-generated by analyzing sequence data generated by sequencing at least one biological sample obtained from the patient, optionally wherein obtaining variant data comprises sequencing the at least one biological sample obtained from the patient and analyzing sequencing data produced by the sequencing. 6. The method of any one of claims 1-5, wherein the variant data indicative of a plurality of variants present in tumor cells of the patient comprises data characterizing a variant derived from sequencing data from a sample comprising genomic material derived from tumor cells of the patient. 7. The method of any one of claims 1-6, wherein sequencing the at least one biological sample comprises sequencing using whole genome sequencing (WGS) or whole exome sequencing (WES). 8. The method of any one of claims 1-7, wherein obtaining variant data comprises obtaining sequence data of a tumor cell sample and a non-tumor cell sample of the patient. 9. The method of claim 8, wherein the tumor cell sample comprises melanoma cells or lung cancer cells. 10. The method of any one of claims 1-9, wherein obtaining the variant data indicative of the plurality of variants of the patient comprises using at least one variant caller to identify the plurality of variants. 11. The method of any one of claims 1-9, wherein obtaining the variant data indicative of the plurality of variants of the patient comprises analyzing sequence data generated by
94 10940863.511975645.1 sequencing the tumor cells obtained from the patient and using at least one variant caller to identify the plurality of variants. 12. The method of any one of claims 1-11, wherein identifying the plurality of TSVs comprises: selecting variants from among the plurality of variants using at least one feature selected from the group consisting of: variant bi-directional support, healthy population variant allele frequency, sequence context homopolymer size, sequence coverage in non- tumor cells, ratio of variant allele frequency between tumor cells and non-tumor cells, and tumor cell variant allele frequency. 13. The method of claim 12, wherein identifying the plurality of TSVs comprises identifying the plurality of TSVs in a biological sample of a tumor comprising the tumor cells of the patient. 14. The method of claim 12 or claim 13, wherein identifying the plurality of TSVs comprises selecting variants using at least two features selected from the group of claim 10. 15. The method of any one of claims 12-14, wherein identifying the plurality of TSVs comprises selecting variants using at least three features selected from the group of claim 10. 16. The method of any one of claims 12-15, wherein identifying the plurality of TSVs comprises selecting variants using at least four features selected from the group of claim 10. 17. The method of any one of claims 12-16, wherein identifying the plurality of TSVs comprises selecting variants using at least five features selected from the group of claim 10. 18. The method of any one of claims 12-17, wherein identifying the plurality of TSVs comprises selecting variants using all the features in the group of claim 10. 19. The method of any one of claims 12-18, wherein identifying the plurality of TSVs comprises selecting variants using variant bi-directional support, and
95 10940863.511975645.1 wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant is observed at least a threshold number of times in plus strand sequencing reads and minus strand sequencing reads of the variant data. 20. The method of claim 19, wherein the threshold number of times is between 2 and 15. 21. The method of any one of claims 12-20, wherein identifying the plurality of TSVs comprises selecting variants using the healthy population variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the variant has a variant allele frequency in a healthy population, as defined by at least one genomic database, of less than a threshold percentage. 22. The method of claim 21, wherein the threshold percentage is 1%. 23. The method of any one of claims 12-22, wherein identifying the plurality of TSVs comprises selecting variants using sequence context homopolymer size, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether a homopolymer sequence exceeding a threshold size is present between the variant and a binding site of a primer designed to detect presence of the variant. 24. The method of claim 23, wherein selecting variants using sequence context homopolymer size comprises selecting variants using sequence data a biological sample of a tumor comprising the tumor cells of the patient. 25. The method of claim 23 and claim 24, wherein the threshold size is 6 nucleotides. 26. The method of any one of claims 12-25, wherein identifying the plurality of TSVs comprises selecting variants using sequence coverage in non-tumor cells, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether sequencing coverage of the variant in the non-tumor cells of the patient exceeds a threshold.
96 10940863.511975645.1
27. The method of claim 26, wherein the threshold is between 45X and 100X. 28. The method of any one of claims 12-27, wherein identifying the plurality of TSVs comprises selecting variants using the ratio of variant allele frequency between tumor cells and non-tumor cells, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the ratio of the variant exceeds a threshold ratio. 29. The method of claim 28, wherein identifying the plurality of TSVs comprises determining the ratio of variant allele frequency between sequence data of a biological sample of a tumor comprising the tumor cells of the patient and sequence data of non-tumor cells of the patient. 30. The method of claim 28 or claim 29, wherein the threshold ratio is between a ratio of 20:1 and 10:1 of tumor cell variant allele frequency and non-tumor cell variant allele frequency. 31. The method of any one of claims 12-30, wherein identifying the plurality of TSVs comprises selecting variants using the tumor cell variant allele frequency, and wherein the selecting comprises determining, for each variant of at least some of the plurality of variants, whether the tumor cell variant allele frequency exceeds a threshold. 32. The method of claim 31, wherein selecting variants using the tumor cell variant allele frequency comprises selecting using sequence data a biological sample of a tumor comprising the tumor cells of the patient. 33. The method of claim 31 or claim 32, wherein the threshold is between a 0.05 and a 0.1 tumor cell variant allele frequency. 34. The method of any one of claims 1-33, wherein generating the set of features comprises generating: at least one sequencing coverage feature, at least one allele frequency feature, a
97 10940863.511975645.1 trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. 35. The method of any one of claims 1-34, wherein the plurality of TSVs comprises a first TSV, wherein generating the respective set of features comprises generating a first set of features for the first TSV, and wherein generating the first set of features for the first TSV comprises generating at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. 36. The method of claim 35, wherein generating the first set of features for the first TSV comprises generating the at least one sequencing coverage feature for the first TSV, and wherein generating the at least one sequencing coverage feature comprises determining sequencing depth of coverage of plus strands and minus strands for the first TSV, and/or a ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV. 37. The method of claim 36, wherein generating the at least one sequencing coverage feature for the first TSV comprising generating the at least one sequencing coverage feature using sequence data of a biological sample of a tumor comprising the tumor cells of the patient. 38. The method of any one of claims 35-37, wherein generating the first set of features for the first TSV comprises generating the at least one allele frequency feature, and wherein generating the at least one allele frequency feature comprises determining non-tumor cell depth coverage for the first TSV, a number of observations of the first TSV in tumor cells of the patient, and/or a tumor allele frequency of the first TSV.
98 10940863.511975645.1
39. The method of claim 38, wherein generating the at least one allele frequency feature comprises generating the at least one allele frequency feature using sequence data of a biological sample of a tumor comprising the tumor cells of the patient. 40. The method of any one of claims 35-39, wherein generating the first set of features for the first TSV comprises generating the at least one primer feature, and wherein generating the at least one primer feature comprises determining a distance between the first TSV and a binding site for a primer designed to detect the first TSV. 41. The method of claim 40, generating the at least one primer feature comprises determining a distance between a first TSV and a PCR primer designed to amplify a portion of a polynucleotide comprising the first TSV. 42. The method of claim 40 or claim 41, wherein generating the at least one primer feature comprises determining a maximum distance between the first TSV and a binding site for a first primer designed to detect the first TSV and/or a maximum distance between the first TSV and binding site for a second primer, different from the first primer, designed to detect the first TSV. 43. The method of claim 40 or claim 41, wherein generating the at least one primer feature comprises determining a minimum distance between the first TSV and a binding site for a first primer designed to detect the first TSV and/or a minimum distance between the first TSV and binding site for a second primer designed to detect the first TSV. 44. The method of claim any one of claims 41-43, wherein the first primer and the second primer are PCR primers designed to amplify a portion of a polynucleotide comprising the first TSV. 45. The method of claim 35-44, wherein generating the first set of features for the first TSV comprises generating the at least one sequence context feature, and wherein generating the at least one sequence context feature comprises determining a conservation score of a polynucleotide of the patient comprising the first TSV, a distance
99 10940863.511975645.1 between the first TSV and a nearest splice site on the polynucleotide, and/or a splice site score of the polynucleotide. 46. The method of claim 45, wherein generating the conservation score comprises generating a phastCons conservation score and/or a phyloP conservation score. 47. The method of any one of claims 35-46, wherein generating the first set of features for the first TSV comprises determining: the sequencing depth of coverage of plus strands and minus strands for the first TSV, the non-tumor cell depth coverage for the first TSV, the number of observations of the first TSV in tumor cells of the patient, and the trinucleotide context (TNC) error rate feature. 48. The method of claim 47, further comprising determining one or more of the maximum distance between the first TSV and a binding site for the second primer designed to detect the first TSV, the ratio of depth of coverage between plus strands and minus strands of the variant data for the first TSV, the tumor allele frequency of the first TSV, the phastCons conservation score of the first TSV, the maximum distance between the first TSV and a binding site for the first primer designed to detect the first TSV, the distance between the first TSV and the nearest splice site on a polynucleotide of the patient comprising the first TSV, and a phyloP conservation score. 49. The method of claim 47 or claim 48, further comprising determining one or more of the C to A variant mutation feature, the minimum distance between the first TSV and a binding site for the second primer designed to detect the first TSV, the splice site score of the polynucleotide, the minimum distance between the first TSV and the binding site for the second primer designed to detect the first TSV. 50. The method of any one of claims 1-49, wherein processing the plurality of sets of features using the trained machine learning model to obtain a corresponding plurality of scores comprises processing the plurality of sets of features using a trained nonlinear classification model. 51. The method of claim 50, wherein the trained nonlinear classification model comprises a random forest model.
100 10940863.511975645.1
52. The method of any one of claims 1-51, wherein the trained machine learning model comprises a plurality of parameters having respective values and wherein processing a set of features of the plurality of sets of features comprises computing a score using the set of features and the respective values of the plurality of parameters. 53. The method of claim 52, wherein the score is the predicted likelihood that the TSV will be observed in the biological sample of an MRD positive patient. 54. The method of any one of claims 1-53, wherein selecting the TSVs for inclusion into the subset of the plurality of TSVs comprises selecting a threshold number of TSVs based on their respective scores. 55. The method of claim 54, wherein selecting a threshold number of TSVs based on their respective scores comprises selecting TSVs with the highest scores. 56. The method of claims 54 or claim 55, wherein selecting a threshold number of TSVs based on their respective scores comprises selecting 50 TSVs with the highest scores. 57. The method of any one of claims 1-56, wherein the trained machine learning model is trained using TSVs from a plurality of MRD positive patients having a first cancer and is predictive of the likelihood of detecting a TSV in a biological sample from a MRD positive patient having a second cancer that is different from the first cancer. 58. The method of claim 57, wherein the first cancer is lung cancer and the second cancer is melanoma. 59. The method of any one of claims 2-58, further comprising: synthesizing primers corresponding to at least some of the TSVs in the subset of the plurality of TSVs. 60. A method of training a machine learning model to generate a score indicative of the predicted detectability of a tumor-specific variant (TSV) in a biological sample of a minimal
101 10940863.511975645.1 residual disease (MRD) positive patient, the machine learning model comprising a plurality of parameters, the method comprising: obtaining training data, the training data derived from data collected during previously performed monitoring for presence of a plurality of TSVs in a plurality of biological samples collected from MRD positive patients, the training data comprising: for each TSV in the plurality of TSVs and each biological sample in which the TSV was previously monitored, (i) variant data associated with the TSV; and (ii) and an indication of whether the TSV was present or absent in the biological sample; and training the machine learning model by using the training data to estimate values of the plurality of parameters to obtain a trained machine learning model. 61. The method of claim 60, wherein obtaining training data comprises obtaining variant data associated with each TSV, the variant data comprising at least one sequencing coverage feature, at least one allele frequency feature, a trinucleotide context (TNC) error rate feature, a C to A variant mutation feature, at least one primer feature, and/or at least one sequence context feature. 62. The method of claim 60 or claim 61, wherein obtaining training data comprises obtaining an indication of whether the TSV is present or absent in the biological sample, the indication determined based on the TSV being present in the biological sample at an allele frequency that exceeds a threshold. 63. The method of any one of claims 60-62, wherein training a machine learning model to predict a score indicative of detectability of a TSV in a biological sample comprises training the machine learning model to predict a likelihood that the TSV will be observed in the biological sample of an MRD positive patient. 64. The method of any one of claims 60-63, wherein the MRD positive patients comprise patients that have been previously diagnosed with lung cancer and/or patients that have been previously diagnosed with melanoma. 65. The method of claim 64, wherein the plurality of TSVs comprises at least 200 TSVs.
102 10940863.511975645.1
66. The method of any one of claims 60-65, wherein the MRD positive patients comprises at least 50 MRD positive patients. 67. The method of any one of claims 60-65, wherein the MRD positive patients comprises at least 500 MRD positive patients. 68. The method of any one of claims 60-67, wherein training the machine learning model comprises training a nonlinear machine learning model. 69. The method of any one of claim 60-68, wherein training the machine learning model comprises training a nonlinear regression machine learning model. 70. The method of any one of claim 60-68, wherein training the machine learning model comprises training a nonlinear classification machine learning model. 71. The method of any one of claim 60-70, wherein training the machine learning model comprises training a random forest model. 72. The method of any one of claims 60-71, wherein training the machine learning model to estimate values of the plurality of parameters, comprises estimating the values of 5 parameters. 73. The method of any one of claims 60-72, wherein training the machine learning model comprises training the trained machine learning model of any one of claims 1-59. 74. A method for determining whether patient-specific panel data of a biological sample of a patient provides an indication that the patient has minimal residual disease (MRD), the method comprising: identifying primers for use in detecting a subset of a plurality of TSVs using the method of any one of claims 1-59; generating sequence data from the biological sample of the patient, the generating comprising contacting the biological sample with the primers; detecting TSVs using the sequence data; and
103 10940863.511975645.1 determining, using the detected TSVs, whether the biological sample provides an indication of MRD. 75. The method of claim 74, wherein the biological sample is a blood, serum or plasma sample of the patient. 76. The method of claim 74 or claim 75, wherein detecting the TSVs using the sequence data comprises determining the allele frequency of the TSVs in the biological sample. 77. The method of claim 76, wherein determining whether the biological sample provides an indication of MRD comprises determining whether the allele frequency of at least some of the TSVs exceeds an error rate of generating sequencing data of the biological sample. 78. The method of any one of claims 74-77, further comprising administering a therapeutic when the patient has a positive indication of MRD or continuing to collect biological samples from the patient for use in monitoring the patient for MRD when the patient has a negative indication of MRD. 79. The method of claim 78, wherein administering the therapeutic comprises administering a therapeutic to treat a cancer and/or tumor associated with the indication of MRD. 80. The method of any one of claims 74-79, wherein determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with sensitivity greater than a 0.85 probability of detecting MRD in a patient that has MRD. 81. The method of any one of claims 74-80, wherein determining whether the biological sample provides an indication of MRD comprises determining an indication of MRD with specificity greater than a 0.98 probability of not detecting MRD in a patient that does not have MRD. 82. A method, comprising: selecting a patient for administration of a therapeutic, the selecting comprising:
104 10940863.511975645.1 determining whether sequence data of a biological sample of the patient provides an indication that the patient has minimal residual disease (MRD) using the method of any one of claims 74-81; and selecting the patient when the patient has a positive indication of MRD; or repeating the method with one or more further biological samples from the patient. 83. A system for designing a patient-specific panel for use in detecting minimal residual disease (MRD) in a patient, the system comprising: at least one computer hardware processor; and at least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining variant data indicative of a plurality of variants of the patient present in tumor cells of the patient; identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient; identifying a subset of the plurality of TSVs for use in the patient-specific panel for use in detecting MRD in the patient, the identifying comprising: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features; processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the predicted detectability of a corresponding TSV in tumor-derived polynucleotides of the patient to be monitored using the patient-specific panel; and selecting, using the plurality of scores and from among the at least some of the TSVs, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient specific panel. 84. The system of claim 52, wherein the at least one computer hardware processor stores processor executable instructions that cause the at least one computer hardware processor to perform the method of any of claims 2-59.
105 10940863.511975645.1
85. At least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform: obtaining variant data indicative of a plurality of variants of the patient present in tumor cells of the patient,; identifying, using the variant data and from among the plurality of variants, a plurality of tumor-specific variants (TSVs) for the patient; identifying a subset of the plurality of TSVs for use in the patient-specific panel for use in detecting MRD in the patient, the identifying comprising: generating, for each of at least some of the plurality of TSVs and using the variant data, a respective set of features to obtain a plurality of sets of features; processing the plurality of sets of features using a trained machine learning model to obtain a corresponding plurality of scores, each of the plurality of scores indicative of the predicted detectability of a corresponding TSV in circulating-tumor DNA (ctDNA) of the patient to be monitored using the patient-specific panel; and selecting, using the plurality of scores and from among the at least some of the TSVs, the TSVs for inclusion into the subset of the plurality of TSVs for use in the patient specific panel. 86. The at least one non-transitory computer readable storage medium storing processor executable instructions of claim 54, wherein the at least one computer hardware processor stores processor executable instructions that cause the at least one computer hardware processor to perform the method of any of claims 2-59.
106 10940863.511975645.1
PCT/US2023/083809 2022-12-14 2023-12-13 Techniques for designing patient-specific panels and methods of use thereof for detecting minimal residual disease WO2024129844A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263432639P 2022-12-14 2022-12-14
US63/432,639 2022-12-14

Publications (1)

Publication Number Publication Date
WO2024129844A1 true WO2024129844A1 (en) 2024-06-20

Family

ID=91486294

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/083809 WO2024129844A1 (en) 2022-12-14 2023-12-13 Techniques for designing patient-specific panels and methods of use thereof for detecting minimal residual disease

Country Status (1)

Country Link
WO (1) WO2024129844A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200402613A1 (en) * 2018-03-06 2020-12-24 Cancer Research Technology Limited Improvements in variant detection
US20210125683A1 (en) * 2017-09-15 2021-04-29 The Regents Of The University Of California Detecting somatic single nucleotide variants from cell-free nucleic acid with application to minimal residual disease monitoring
WO2021237105A1 (en) * 2020-05-22 2021-11-25 Invitae Corporation Methods for determining a genetic variation
US20220025469A1 (en) * 2016-04-14 2022-01-27 Guardant Health, Inc. Methods for computer processing sequence reads to detect molecular residual disease
US20220380852A1 (en) * 2019-08-27 2022-12-01 Fundación Para La Investigación Biomédica Del Hospital Universitario 12 De Octubre Method for determining the presence or absence of minimal residual disease (mrd) in a subject who has been treated for a disease

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220025469A1 (en) * 2016-04-14 2022-01-27 Guardant Health, Inc. Methods for computer processing sequence reads to detect molecular residual disease
US20210125683A1 (en) * 2017-09-15 2021-04-29 The Regents Of The University Of California Detecting somatic single nucleotide variants from cell-free nucleic acid with application to minimal residual disease monitoring
US20200402613A1 (en) * 2018-03-06 2020-12-24 Cancer Research Technology Limited Improvements in variant detection
US20220380852A1 (en) * 2019-08-27 2022-12-01 Fundación Para La Investigación Biomédica Del Hospital Universitario 12 De Octubre Method for determining the presence or absence of minimal residual disease (mrd) in a subject who has been treated for a disease
WO2021237105A1 (en) * 2020-05-22 2021-11-25 Invitae Corporation Methods for determining a genetic variation

Similar Documents

Publication Publication Date Title
US11621083B2 (en) Cancer evolution detection and diagnostic
US20230272483A1 (en) Systems and methods for analyzing circulating tumor dna
US11043304B2 (en) Systems and methods for using sequencing data for pathogen detection
US20210358626A1 (en) Systems and methods for cancer condition determination using autoencoders
KR20220133868A (en) Cancer Classification Using Patch Convolutional Neural Networks
WO2021183917A9 (en) Systems and methods for deconvolution of expression data
US20210102262A1 (en) Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data
US11217329B1 (en) Methods and systems for determining biological sample integrity
US20200199685A1 (en) Determination of a physiological condition with nucleic acid fragment endpoints
US20220223227A1 (en) Machine learning techniques for identifying malignant b- and t-cell populations
WO2024129844A1 (en) Techniques for designing patient-specific panels and methods of use thereof for detecting minimal residual disease
CN114694745A (en) Method, apparatus, computer device and storage medium for predicting an immune efficacy
CN110462056A (en) Samples sources detection method, device and storage medium based on DNA sequencing data
CN115428087A (en) Significance modeling of clone-level deficiency of target variants
Chieruzzi Identification of RAS co-occurrent mutations in colorectal cancer patients: workflow assessment and enhancement
WO2024173242A2 (en) Systems and methods for minimal residual disease analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23904502

Country of ref document: EP

Kind code of ref document: A1