[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113450871B - Method for identifying sample identity based on low-depth sequencing - Google Patents

Method for identifying sample identity based on low-depth sequencing Download PDF

Info

Publication number
CN113450871B
CN113450871B CN202110723066.3A CN202110723066A CN113450871B CN 113450871 B CN113450871 B CN 113450871B CN 202110723066 A CN202110723066 A CN 202110723066A CN 113450871 B CN113450871 B CN 113450871B
Authority
CN
China
Prior art keywords
sample
samples
snp
sequencing
low
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110723066.3A
Other languages
Chinese (zh)
Other versions
CN113450871A (en
Inventor
陈样宜
刘燕霞
黄楷胜
刘远如
焦伟刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Boao Medical Laboratory Co ltd
Original Assignee
Guangdong Boao Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Boao Medical Laboratory Co ltd filed Critical Guangdong Boao Medical Laboratory Co ltd
Priority to CN202110723066.3A priority Critical patent/CN113450871B/en
Publication of CN113450871A publication Critical patent/CN113450871A/en
Application granted granted Critical
Publication of CN113450871B publication Critical patent/CN113450871B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for identifying sample identity based on low-depth sequencing by utilizing the characteristic that a certain crowd high-frequency heterozygous site exists between two samples and is simultaneously covered by one reads, which comprises the following steps: s1, sequence alignment; s2, filtering the sequence; s3, selecting a SNP locus data set of high-frequency heterozygous crowd; s4, acquiring cdSNP site lists; s5, counting CR values among samples; s6, sample identity analysis is conducted, and a conclusion is obtained. The invention can identify the identity of the sample by analyzing the low-depth whole genome sequencing original data file without changing an experimental scheme or increasing the sequencing quantity, and has the advantages of low detection cost, short analysis time and capability of carrying out noninvasive fetal paternity test by utilizing the peripheral blood of the pregnant woman.

Description

Method for identifying sample identity based on low-depth sequencing
Technical Field
The invention relates to the technical field of prenatal diagnosis molecular genetics detection, in particular to a method for identifying sample identity based on low-depth sequencing.
Background
The current method for identifying the identity of human DNA samples in forensic science, such as the fields of individual identification and parental identification, mainly uses a comparison analysis method for analyzing specific short tandem repeats (short TANDEM REPEAT, STR) as biomarkers, and the development of a gene chip technology and a new generation high-throughput gene detection technology is accompanied by the beginning of the comparison analysis method for using single nucleotide polymorphisms (single nucleotide polymorphism, SNP) as biomarkers in the aspect.
STRs, also known as microsatellite DNA (micro SATELLITE DNA), are a class of DNA polymorphic loci that are widely found in the human genome. They generally consist of 2-6 bases constituting a core sequence, which is arranged in tandem repeats, resulting in length polymorphisms from variations in the number of core sequence repeats. The number of repeats of a repeat sequence at a particular location on a chromosome is fixed for a particular individual, and may vary from individual to individual at the same location, which constitutes a polymorphism in these repeat sequences in the population. Since the human genome has a large number of such repeats, individual-to-individual distinction can be made clearly by detecting such polymorphisms. Because of the characteristics that it has high sensitivity and high discrimination ability to and easily standardized, automatic typing's advantage, wide application in fields such as forensic science individual identification and parent identification.
For paternity test, the method needs to sample the child, father and mother respectively, and judges whether the parent and the child are in paternity or not according to whether the STR detection results of the child, father and mother accord with genetic characteristics or not. The child needs to be an independent individual to accurately sample, so that certain defects exist in noninvasive fetal paternity test.
SNPs refer mainly to DNA sequence polymorphisms at the genomic level caused by single nucleotide variations. It is one of the most common human heritable variants. Accounting for over 90% of all known polymorphisms. SNPs are widely found in the human genome, 1 for every 500-1000 base pairs on average, and a total number of 300 or more is estimated. The method for carrying out chip or high-depth sequencing by selecting specific SNP loci as markers can be stably and accurately applied to identifying individuals and carrying out paternity test, even can analyze pollution samples of low-proportion mixed samples, and can also be used for carrying out noninvasive fetal paternity test by utilizing maternal peripheral blood. However, the technology mainly uses the accurate typing of specific SNP loci for comparison, and has the problems of high detection cost and long analysis time.
The identification of the identity of the sample by STR and SNP techniques can generally only be performed on the retained sample and the newly acquired sample by typing comparison at specific sites, so as to analyze whether the sample backed up at that time in the laboratory has identity with the sample to be checked, however, for the fragmented high-throughput sequencing DNA library and the data backed up thereof, since the coverage of the specific sites is insufficient or the storage capacity is insufficient, accurate typing cannot be obtained, and thus, an effective analysis means is lacking. In the field of prenatal screening and diagnosis, NIPT and CNV-seq are widely applied detection items of low-depth whole genome sequencing technology, and the identity of prenatal diagnosis samples generally requires analysis by adopting STR and SNP methods, so that the detection steps and operation cost are increased, and the detection methods cannot be used for quality control of detection flows and backtracking of the most original results.
With more and more molecular detection projects based on high-throughput sequencing being developed in clinical laboratories, three major difficulties exist in retrospective analysis of detection data in the laboratory: 1. the experimental process is not retrospective, and contamination or confusion often occurs in the course of unconscious errors in the operation of the experiment. 2. Mixing the contaminated samples, pollution occurs from the source, or mixing samples in the detection process, and the like, later finding that the results are problematic, and the samples cannot be repeated by a laboratory to cause serious quality accidents. 3. The insufficient retention of the sample results in a failure to trace back, such as degraded samples or insufficient plasma samples; too long or storage problems lead to degradation of DNA; the lack of effective reservoir capacity results in insufficient information available for STR and SNP analysis methods due to the fact that the plasma free DNA fragments are too short.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for identifying sample identity based on low-depth sequencing, which can identify sample identity by analyzing a low-depth whole genome sequencing original data file without changing an experimental scheme or increasing sequencing quantity, and has the advantages of low detection cost, short analysis time and capability of carrying out noninvasive fetal paternity test by utilizing maternal peripheral blood.
In order to solve the technical problems, the technical scheme of the invention is as follows: a method for identifying sample identity based on low depth sequencing, comprising the steps of:
S1, sequence alignment;
S2, filtering the sequence;
S3, selecting a SNP locus data set of high-frequency heterozygous crowd;
s4, acquiring cdSNP site lists;
s5, counting CR values among samples;
s6, sample identity analysis is conducted, and a conclusion is obtained.
As a further illustration of the present invention,
Preferably, the sequence alignment in the step S1 is based on the use of high throughput sequencing data for performing low depth sequencing detection projects such as NIPT, CNV-seq, etc., and includes selecting BWA alignment software (BWA-0.7.17, BWA-men) to perform sequence alignment on raw sequence data (FASTQ file) obtained by a sequencing-obtained semiconductor sequencer and human genome reference sequences (such as GRCh37/hg19 version) to obtain aligned sam file.
Preferably, the sequence filtering in step S2 includes filtering the aligned sam file to remove sequences that may be misidentified by non-alignment (ummaped), low alignment quality (MAPQ < 40), and multiple alignment peer-to-peer alignment, and obtain a valid sequencing.
Preferably, the step S3 of selecting the SNP locus data set of the high frequency heterozygosity of the population comprises selecting loci with genotypes of SNP only two types and minimum allele frequency (Minor Allele Frequency, MAF) not lower than 0.3 as the SNP locus data set of the high frequency heterozygosity of the population by downloading SNP data files (version 151) of people in a database ftp:// ftp.
Preferably, the step S4 of obtaining the cdSNP locus list includes counting bases of the SNP locus dataset of high-frequency heterozygous hit population in the comparison result of each file, and then obtaining locus base information (co-DETECTED SNPS, CDSNPS) with one sequence coverage for each two files.
Preferably, the step S5 counts CR values among samples, including a process of calculating a consistency value (CR) of site base information (co-DETECTED SNPS, CDSNPS).
Preferably, the sample identity analysis in step S6, based on the CR value, yields the following classification decision:
1) When CR <0.616, it is determined that there is no significant relationship;
2) When CR >0.672 and CR <0.725, the two samples are judged to be related;
3) When CR >0.753, the same individual relationship is determined;
4) When the CR value is not the above, the fetal DNA concentration, which may be a laboratory contamination or noninvasive detection of the sample, is high.
Preferably, the invention is derived by taking advantage of the feature that a certain population of high frequency heterozygous sites exist between two samples while being covered by one reads.
The beneficial effects of the invention are as follows:
1. The invention can analyze the sequencing original data file backed up by the detection mechanism, can finish the identification of whether the identity exists in different samples on the premise of not changing the experimental scheme and not increasing the sequencing amount and the detection cost, is convenient for a laboratory to control the quality of the detection flow, and is beneficial to clinically examining sample pollution, and tracing the sample when mixed samples are likely to occur or serious quality accidents (false positive/false negative) occur.
2. The invention provides a method for judging sample identity by only comparing the identity of SNP loci with depth of 1 obtained jointly in two sequencing data, and both binomial distribution models and practices show that under extremely low sequencing depth, even if the depth is as low as 0.05X coverage depth, a plurality of SNP loci still have one reads coverage (co-DETECTED SNPS, CDSNPS) between the two samples, and cdSNPs identity analysis can be accurately carried out by selecting high-frequency heterozygous SNP loci of crowd.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following specific embodiments.
The invention discloses a method for identifying sample identity based on low-depth sequencing, which comprises the following steps:
S1, sequence alignment;
S2, filtering the sequence;
S3, selecting a SNP locus data set of high-frequency heterozygous crowd;
s4, acquiring cdSNP site lists;
s5, counting CR values among samples;
s6, sample identity analysis is conducted, and a conclusion is obtained.
Further, the sequence alignment in the step S1 is the basis for performing low-depth sequencing detection projects such as NIPT, CNV-seq and the like by using high-throughput sequencing data, and comprises the process of selecting BWA alignment software (BWA-0.7.17, BWA-men) to perform sequence alignment on raw sequence data (FASTQ file) obtained by a sequencing acquisition semiconductor sequencer and human genome reference sequences (such as GRCh37/hg19 version) to obtain aligned sam files.
Further, the sequence filtering in the step S2 includes filtering the aligned sam file to remove sequences that may generate false base recognition by non-alignment (ummaped), low alignment quality (MAPQ < 40) and multiple alignment peer-to-peer alignment, and obtain a process of efficient sequencing.
Further, the step S3 of selecting the SNP locus data set of the high frequency heterozygosity of the population comprises the process of selecting loci with genotypes of SNP only of two types and minimum allele frequency (Minor Allele Frequency, MAF) not lower than 0.3 as the SNP locus data set of the high frequency heterozygosity of the population by downloading SNP data files (version 151) of people in a database ftp:// ftp.
Further, the step S4 of obtaining cdSNP locus list includes counting the bases of SNP locus data sets of high frequency heterozygous hit population in the comparison result of each file, and then obtaining locus base information (co-DETECTED SNPS, CDSNPS) with one sequence coverage for each two files.
Further, the step S5 counts the CR values among samples, including the process of calculating the consistency value (CR) of the site base information (co-DETECTED SNPS, CDSNPS).
Preferably, in the step S6 sample identity analysis, according to the known CR value reference range calculated by 50 unrelated samples, the CR value reference range of 30 samples detected twice by the same sample, and the CR value reference range of 20 relatives, it is possible to analyze which classification the CR value of the current two-time sequencing data is.
Sample relationship type CR value Standard deviation of CR CR-1.96*SD CR+1.96*SD
Irrelevant samples 0.603 0.006 0.591 0.616
Identical sample 0.778 0.013 0.753 0.803
Parent-child relationship 0.698 0.014 0.672 0.725
Based on the above table CR values, the following classification decisions are derived:
1) When CR <0.616, it is determined that there is no significant relationship;
2) When CR >0.672 and CR <0.725, the two samples are judged to be related;
3) When CR >0.753, the same individual relationship is determined;
4) When the CR value is not the above, the fetal DNA concentration, which may be a laboratory contamination or noninvasive detection of the sample, is high.
Furthermore, the invention is obtained by utilizing the characteristic that a certain crowd high-frequency heterozygous site exists between two samples and is simultaneously covered by one reads.
The theoretical basis of the invention is as follows: the base consistency of cdSNPs sites of twice low-depth sequencing of the same sample is obviously different from that of cdSNPs of two samples which are not related, and whether the two samples are derived from the same sample can be distinguished by simply calculating the consistency value of the sample to be analyzed and a group of samples. At very low sequencing depths, the genotype of each SNP site is unknown, but it is known by theoretical deduction that also at sequencing depths equal to 1, the base identity of the high frequency heterozygous SNP observed in the whole genome region is different for different samples.
Assuming that one cdSNP is obtained from the raw data of two low-depth whole genome sequencing, assuming that genotypes are A and B, respectively, and that SNP with a population frequency of p and q is obtained, if the two SNP are different samples, the probability that the locus is identical with the base covered by only one reads is 1-2pq, and if the two SNP are the same sample, the probability that the locus is identical with the base covered by only one reads is 1-pq. For example, the frequency of heterozygosity for a population of SNP loci is 0.5, then the expected value for CR for the different samples is 0.5, whereas in the same sample the expected value for CR is 0.75.
The cdSNPs expected values of consistency for the two samples that are not related are calculated as follows:
Sample 1 genotype Sample 2 genotype CdSNP consistent expected value of depth 1X
AA AA E=p4
BB BB E=q4
AB AB E=2*2*0.5*p2q2=2p2q2
AA AB E=p2*2pq*0.5=p3q
BB AB E=q2*2pq*0.5=pq3
For SNP loci with genotype frequencies p, q, (p+q) =1, a consensus probability (CR) for the above genotypes can be calculated by summing up the consensus probability values of the above genotypes:
CR=p4+q4+2p2q2+2p3q+2pq3=p3(p+q)+q3(p+q)+p2q(p+q)+p2q(p+q)=p3(p+q)+q3(p+q)+p2q(p+q)+p2q(p+q)=p2(p+q)2+q2(p+q)2=p2+q2=1-2pq
when p=0.5, cr=1-2×0.5×0.5=0.5
The cdSNP expected values of consistency when two measurements of the same individual are taken are calculated as follows:
detection 1 Detection 2 CdSNP consistent expected value of depth 1X
AA AA E=p 2 (e=0.25 when p=0.5)
BB BB E=q 2 (e=0.25 when p=0.5)
AB AB E=2×pq×0.5=pq (e=0.25 when p=0.5)
Then the SNP locus with the genotype frequency of p and q of the crowd at the moment can be calculated from the sum of the probability of the genotypes to obtain the CR value:
CR=p2+q2+pq=1-pq
when p=0.5, cr=1-0.5×0.5=0.75
The expected value of identity at a certain site of the same sample is more than the expected value of the unrelated sample by pq, so that when the number of cdSNP is sufficiently large for a selected high frequency SNP set, there is a significant difference in the CR values between the two types of sequencing data, and this value can be used.
The following are specific examples of the application of the present invention.
Example 1
The fetal concentration of the NIPT detection result is abnormally high, and the sample identity identification is carried out by suspected pollution or misleading the sample:
The standing laboratory finds one example of samples, the first detection finds that the fetal concentration is very high and exceeds 85%, after 3 repetitions, the two latter two are female fetuses, and the technical support hopes to carry out sample uniqueness analysis.
Run to which the three NIPT results respectively belong is as follows:
First NIPT results: abnormal high fetal concentration, male fetal signal
Second NIPT results: fetal concentration is normal, and female fetal signals
Nip results for the third time: fetal concentration is normal, and female fetal signals
CdSNP consistency analysis is carried out on the original data of the three results:
1.1 obtaining original files 2702-IonXpress _042, 2702-IonXpress _040 and 2702-IonXpress _032 of the three-time machine-down data; and using the original data of the irrelevant sample EJ042423 as an external reference;
1.2 comparing the above original files to human reference genome hg19;
1.3, filtering multiple repeated sequences without alignment and with low quality;
1.4 forming a unique comparison sequence;
1.5 obtaining cdSNP site list by combining high-frequency heterozygous SNP sites of human beings;
1.6, counting expected values (CR values) of consistency among samples;
and 1.7, judging the identity result between samples according to the CR value.
The analysis results were as follows:
Conclusion:
analyzing the 3 raw data of sample 2702 to find IonXpress _040 and IonXpress _032 as the same sample source data; ionXpress _042 this sample was severely contaminated with male genome, suggesting that the laboratory should be aware of the sample cross-contamination problem.
Example 2
Identity identification of non-uniformity of positive sample review:
And feeding back a certain sample NIPT detection 21 trisomy from the follow-up result, wherein the diagnosis result of male embryo and amniotic fluid is female embryo negative, and the NIPT result is inconsistent with the amniotic fluid result and the gender is inconsistent, so that a customer suspects that the sample is possibly wrong, and an auditor analyzes to find whether the sample is clinically remarked with embryo reduction, whether the influence of embryo reduction is that the sample is wrong, and hopefully carrying out non-invasive and amniotic fluid data sample identity analysis.
First NIPT results: fetal concentration: 7.4%; t21, male fetus;
Second NIPT results: fetal concentration: 8.1%; t21, male fetus;
Results for amniotic fluid CNVseq: negative, female fetus;
cdSNP consistency analysis was performed on the raw data of the above NIPT results and CNVseq results:
2.1 obtaining original files EP100342 and EM100872D of NIPT and CNVseq off-machine data; and using the original data of the irrelevant sample EJ042423 as an external reference; and using the same specimen twice library results (EP 100190_ IonXpress _016, EP 100190_IonXpress_025), and the original data of the same-egg twin-embryo sample (EM 004201F, EM 004202F) as a control;
2.2 comparing the above original files to human reference genome hg19;
2.3, filtering multiple repeated sequences without alignment and with low quality;
2.4 forming a unique comparison sequence;
2.5 obtaining cdSNP site list by combining high-frequency heterozygous SNP sites of human beings;
2.6, counting expected values (CR values) of consistency among samples;
And 2.7, judging the identity result between samples according to the CR value.
The analysis results were as follows:
Conclusion:
Analysis of the two samples of EM100872D and EP100342 suggests that the consistent ratio of 0.68 for EM100872D and EP100342 meets the expectations of the two samples for complete relatives, so that non-invasive detection does not confound the samples, the Y signal and the signal of trisomy 21 coming from the possibility of miscarriage.
In order to verify the accuracy of the invention, we recall that maternal blood leukocytes further do STR verification: (EM 100872D parent sample number: GEM100872B; non-invasive sample EP100342 parent blood leukocyte sample number: ES100003B; parent samples of two samples): GEM100872B and ES100003B: STR typing is consistent, and the STR typing is in primary parent relation with EM100872D, so that it is confirmed that samples are not confused, and Y signals and signals of the 21 trisomy are from miscarriage.
Example 3
Identification of false negative sample identity:
Some sample EP007057 has double fetuses, and the noninvasive detection result has no abnormality, 46, XY; pregnancy outcome: twin fetuses dead intrauterine (induced labor for two men and infants), embryo tissue CMA results: 47, XXY; it is uncertain whether the sample is confusing or not, and it is desirable to perform sample uniqueness analysis on the maternal leukocyte EP007057R and the non-invasive sample EP 007057;
The procedure was as in example 2;
The analysis results were as follows:
File1 File2 cdSNP consistent expected value (CR value) Identity result judgment
EP007057R EJ042423 0.591 Two independent samples, external parameters
EP007057 EJ042423 0.593 Two independent samples, external parameters
EP007057R EP007057 0.759 Identical sample
Conclusion:
Analysis of the two samples of EP007057 and EP007057R suggests that the consistent ratio of 0.759 of the two samples meets the expected value of the same sample, so that the non-invasive detection does not confuse the samples, and the puncture karyotype inconsistency may be caused by placenta chimerism.
Example 4
Periodic quality control in laboratory: in view of the advantages of no need of increasing experimental steps, experimental cost, convenience and rapidness, the laboratory is additionally provided with the step of carrying out sample identity analysis on the twice-repeated sample results, and the possibility of sample mixing is examined. And identity analysis is regularly carried out on samples of the same label among different run, so that the possibility of cross contamination is checked, and the occurrence of quality accidents is effectively intercepted.
Reworking a sample identity analysis step:
4.1 comparing the original file to human reference genome hg19;
4.2, the multiple sequences are filtered without comparison and with low quality;
4.3 forming a unique comparison sequence;
4.4 obtaining cdSNP site list by combining human high-frequency heterozygous SNP site;
4.5, counting expected values (CR values) of consistency among samples;
the above steps have been packaged into an automated analysis flow cdSNP analysis plug-in;
4.6, when the system recognizes that the system has the redo result, starting cdSNP the analysis plug-in by acquiring the original data file;
4.7 when CR value <0.616, the system automatically prompts: irrespective of the two samples, the mixed sample is possibly examined;
when CR value > =0.672 & <0.725, the system automatically prompts: a certain parent relationship exists, and the pollution possibility is eliminated;
When CR value > =0.753, the system suggests: the same sample passes through quality control;
The two results of the reworked samples were analyzed as follows:
Conclusion:
The CR value=0.707, CR value > =0.672 & <0.725 for this lot of rework samples EP005415, the system automatically prompts: a certain parent relationship exists, so that the pollution possibility is checked, and a laboratory is required to check and improve the pollution cause; other sample CR values were >0.753, so the system suggests: the same sample passes the quality control.
According to the invention, on the premise of not changing the existing NIPT and CNV-seq experimental scheme and sequencing quantity, the function of the high-frequency heterozygous site of a certain crowd between two samples is expanded by utilizing the characteristic that the high-frequency heterozygous site is covered by one reads at the same time, and a whole set of method for identifying sample identity based on low-depth sequencing is developed.
The invention can analyze the sequencing original data file backed up by the detection mechanism, can finish the identification of whether the identity exists in different samples on the premise of not changing the experimental scheme and not increasing the sequencing amount and the detection cost, is convenient for a laboratory to control the quality of the detection flow, and is beneficial to clinically examining sample pollution, and tracing the sample when mixed samples are likely to occur or serious quality accidents (false positive/false negative) occur.
The invention can rapidly and economically identify and detect sample uniqueness and effectively check sample confusion and pollution. The invention utilizes the original sequence data of the low-depth whole genome sequencing detection project based on the new generation sequencing technology, such as NIPT, CNV-seq sequencing bam and fastq files, namely, comparison and analysis can be carried out, and whether the two detection samples are the same sample is identified by calculating the consistency value of a crowd polymorphic Site (SNP) with one sequence coverage between any two different samples. For detection projects based on low-depth whole genome sequencing, the laboratory periodically controls quality, checks sample pollution, and checks mixed sample conditions possibly or tracing false positive/false negative analysis samples.
In the foregoing, only the preferred embodiment of the present invention is described, and any minor modifications, equivalent changes and modifications made to the above embodiments according to the technical solutions of the present invention fall within the scope of the technical solutions of the present invention.

Claims (2)

1. A method for identifying sample identity based on low depth sequencing, comprising the steps of:
S1, sequence comparison is the basis of NIPT, CNV-seq low-depth sequencing detection projects by using high-throughput sequencing data, and comprises the steps of selecting BWA comparison software, and comparing an original sequence data FASTQ file obtained by using a semiconductor sequencer with a human genome reference sequence GRCh37/hg19 version to obtain a compared sam file, wherein the BWA comparison software comprises BWA-0.7.17 and BWA-men;
s2, filtering the sequence, namely filtering the aligned sam file, and removing the sequence generating the false base identification, namely removing non-alignment, low alignment quality MAPQ <40 and multiple alignment to obtain effective sequencing data;
s3, selecting a SNP locus data set of high-frequency heterozygous crowd, wherein the SNP locus data set is obtained by downloading SNP data files of people in a database ftp:// ftp. Ncbi. Nih. Gov/SNP, the version of the SNP data files is version 151, and selecting loci of which the genotypes of SNP are only two types and the minimum allele frequency MAF is not lower than 0.3 as the SNP locus data set of high-frequency heterozygous crowd;
S4, acquiring cdSNP site lists, namely counting bases of SNP site datasets of high-frequency heterozygous hit population in comparison results of each sam file, and then acquiring site base information co-DETECTED SNPS, co-DETECTED SNPS of cdSNPs which is covered by one sequence for each two files;
S5, counting CR values among samples, wherein the CR values comprise a concordance value concordance rate of calculated site base information cdSNPs, and the concordance rate is simply called CR;
S6, sample identity analysis, and according to CR values, the following classification judgment is obtained: 1) When CR <0.616, it is determined that there is no significant relationship; 2) When CR >0.672 and CR <0.725, the two samples are judged to be related; 3) When CR >0.753, the same individual relationship is determined; 4) When the CR value does not fall under the above conditions, the fetal DNA concentration is high for laboratory contamination or noninvasive detection of the sample.
2. The method of preparing a low depth sequencing based method for identifying sample identity according to claim 1, wherein: the method is obtained by utilizing the characteristic that a certain crowd high-frequency heterozygous site exists between two samples and is covered by one reads.
CN202110723066.3A 2021-06-28 2021-06-28 Method for identifying sample identity based on low-depth sequencing Active CN113450871B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110723066.3A CN113450871B (en) 2021-06-28 2021-06-28 Method for identifying sample identity based on low-depth sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110723066.3A CN113450871B (en) 2021-06-28 2021-06-28 Method for identifying sample identity based on low-depth sequencing

Publications (2)

Publication Number Publication Date
CN113450871A CN113450871A (en) 2021-09-28
CN113450871B true CN113450871B (en) 2024-06-11

Family

ID=77813557

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110723066.3A Active CN113450871B (en) 2021-06-28 2021-06-28 Method for identifying sample identity based on low-depth sequencing

Country Status (1)

Country Link
CN (1) CN113450871B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113999900B (en) * 2021-10-14 2024-02-20 武汉蓝沙医学检验实验室有限公司 Method for evaluating fetal DNA concentration by using free DNA of pregnant woman and application
CN113969310B (en) * 2021-10-14 2024-02-20 武汉蓝沙医学检验实验室有限公司 Fetal DNA concentration evaluation method and application
CN114530200B (en) * 2022-03-18 2022-09-23 北京阅微基因技术股份有限公司 Mixed sample identification method based on calculation of SNP entropy
CN115810393B (en) * 2022-12-22 2023-08-25 南京普恩瑞生物科技有限公司 Sequencing sample homology detection method and system based on SNPs library of construction crowd

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104946773A (en) * 2015-07-06 2015-09-30 厦门万基生物科技有限公司 Method for judging antenatal parental right relation with SNP
CN109461473A (en) * 2018-09-30 2019-03-12 北京优迅医疗器械有限公司 Fetus dissociative DNA concentration acquisition methods and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3987525A1 (en) * 2019-06-21 2022-04-27 CooperSurgical, Inc. System and method for determining genetic relationships between a sperm provider, oocyte provider, and the respective conceptus
CN112885408B (en) * 2021-02-22 2024-10-01 中国农业大学 Method and device for detecting SNP marker loci based on low-depth sequencing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104946773A (en) * 2015-07-06 2015-09-30 厦门万基生物科技有限公司 Method for judging antenatal parental right relation with SNP
CN109461473A (en) * 2018-09-30 2019-03-12 北京优迅医疗器械有限公司 Fetus dissociative DNA concentration acquisition methods and device

Also Published As

Publication number Publication date
CN113450871A (en) 2021-09-28

Similar Documents

Publication Publication Date Title
CN113450871B (en) Method for identifying sample identity based on low-depth sequencing
CN109887548B (en) ctDNA ratio detection method and detection device based on capture sequencing
CN108604258B (en) Chromosome abnormality determination method
CN106778073B (en) A kind of method and system of assessment tumor load variation
EA017966B1 (en) Diagnosing fetal chromosomal aneuploidy using genomic sequencing
CN111091868B (en) Method and system for analyzing chromosome aneuploidy
CN108920899A (en) A kind of single exon copy number variation prediction technique based on target area sequencing
CN110021346B (en) Gene fusion and mutation detection method and system based on RNAseq data
CN107949845A (en) The new method of sex of foetus and fetus sex chromosomal abnormality can be distinguished on multiple platforms
US20210090687A1 (en) Methods of quality control using single-nucleotide polymorphisms in pre-implantation genetic screening
CN104846089A (en) Quantitative method for free fetal DNA (deoxyribonucleic acid) proportion in maternal peripheral blood
CN110592208B (en) Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN113593644A (en) Method for detecting chromosome uniparental disomy by low-depth sequencing based on family
US20230111097A1 (en) Array-based methods for analysing mixed samples using different allele-specific labels, in particular for detection of fetal aneuploidies
CN106778069B (en) Method and apparatus for determining microdeletion microreplication in fetal chromosomes
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
US7912652B2 (en) System and method for mutation detection and identification using mixed-base frequencies
CN111944807B (en) Human sequencing sample tracking marker, and monitoring method and monitoring device for human sequencing sample cross contamination
CN116994649A (en) Intelligent judging method and intelligent judging system for gene detection data
CN114171116A (en) Method for evaluating fetal DNA concentration by free and self DNA of pregnant woman and application
CN114093428B (en) System and method for detecting low-abundance mutation under ctDNA ultrahigh sequencing depth
KR102519739B1 (en) Non-invasive prenatal testing method and devices based on double Z-score
CN114093417B (en) Method and device for identifying chromosomal arm heterozygosity loss
CN117980504A (en) Genetic analysis method capable of performing two or more tests
CN113969310A (en) Fetal DNA concentration evaluation method and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant