CN113450871B - Method for identifying sample identity based on low-depth sequencing - Google Patents
Method for identifying sample identity based on low-depth sequencing Download PDFInfo
- Publication number
- CN113450871B CN113450871B CN202110723066.3A CN202110723066A CN113450871B CN 113450871 B CN113450871 B CN 113450871B CN 202110723066 A CN202110723066 A CN 202110723066A CN 113450871 B CN113450871 B CN 113450871B
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- snp
- sequencing
- low
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000001514 detection method Methods 0.000 claims abstract description 36
- 238000004458 analytical method Methods 0.000 claims abstract description 34
- 230000001605 fetal effect Effects 0.000 claims abstract description 17
- 238000001914 filtration Methods 0.000 claims abstract description 11
- IOSROLCFSUFOFE-UHFFFAOYSA-L 2-nitro-1h-imidazole;platinum(2+);dichloride Chemical compound [Cl-].[Cl-].[Pt+2].[O-][N+](=O)C1=NC=CN1.[O-][N+](=O)C1=NC=CN1 IOSROLCFSUFOFE-UHFFFAOYSA-L 0.000 claims description 16
- 241000282414 Homo sapiens Species 0.000 claims description 14
- 108700028369 Alleles Proteins 0.000 claims description 5
- 238000012165 high-throughput sequencing Methods 0.000 claims description 5
- 238000011109 contamination Methods 0.000 claims description 4
- 239000004065 semiconductor Substances 0.000 claims description 3
- 238000002864 sequence alignment Methods 0.000 abstract description 7
- 238000012360 testing method Methods 0.000 abstract description 6
- 238000012070 whole genome sequencing analysis Methods 0.000 abstract description 6
- 230000008901 benefit Effects 0.000 abstract description 5
- 210000005259 peripheral blood Anatomy 0.000 abstract description 3
- 239000011886 peripheral blood Substances 0.000 abstract description 3
- 108020004414 DNA Proteins 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 210000003754 fetus Anatomy 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 210000001161 mammalian embryo Anatomy 0.000 description 5
- 102000054765 polymorphisms of proteins Human genes 0.000 description 5
- 108091092878 Microsatellite Proteins 0.000 description 4
- 210000004381 amniotic fluid Anatomy 0.000 description 4
- 230000008774 maternal effect Effects 0.000 description 4
- 238000003908 quality control method Methods 0.000 description 4
- 108091081062 Repeated sequence (DNA) Proteins 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 210000000265 leukocyte Anatomy 0.000 description 3
- 239000002773 nucleotide Substances 0.000 description 3
- 125000003729 nucleotide group Chemical group 0.000 description 3
- 238000003793 prenatal diagnosis Methods 0.000 description 3
- 206010000234 Abortion spontaneous Diseases 0.000 description 2
- 208000037280 Trisomy Diseases 0.000 description 2
- 239000000090 biomarker Substances 0.000 description 2
- 238000012864 cross contamination Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 208000015994 miscarriage Diseases 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 208000000995 spontaneous abortion Diseases 0.000 description 2
- 206010068051 Chimerism Diseases 0.000 description 1
- 201000010374 Down Syndrome Diseases 0.000 description 1
- 206010021718 Induced labour Diseases 0.000 description 1
- 208000017924 Klinefelter Syndrome Diseases 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 206010044688 Trisomy 21 Diseases 0.000 description 1
- 208000003443 Unconsciousness Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 210000002826 placenta Anatomy 0.000 description 1
- 230000035935 pregnancy Effects 0.000 description 1
- 238000009609 prenatal screening Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method for identifying sample identity based on low-depth sequencing by utilizing the characteristic that a certain crowd high-frequency heterozygous site exists between two samples and is simultaneously covered by one reads, which comprises the following steps: s1, sequence alignment; s2, filtering the sequence; s3, selecting a SNP locus data set of high-frequency heterozygous crowd; s4, acquiring cdSNP site lists; s5, counting CR values among samples; s6, sample identity analysis is conducted, and a conclusion is obtained. The invention can identify the identity of the sample by analyzing the low-depth whole genome sequencing original data file without changing an experimental scheme or increasing the sequencing quantity, and has the advantages of low detection cost, short analysis time and capability of carrying out noninvasive fetal paternity test by utilizing the peripheral blood of the pregnant woman.
Description
Technical Field
The invention relates to the technical field of prenatal diagnosis molecular genetics detection, in particular to a method for identifying sample identity based on low-depth sequencing.
Background
The current method for identifying the identity of human DNA samples in forensic science, such as the fields of individual identification and parental identification, mainly uses a comparison analysis method for analyzing specific short tandem repeats (short TANDEM REPEAT, STR) as biomarkers, and the development of a gene chip technology and a new generation high-throughput gene detection technology is accompanied by the beginning of the comparison analysis method for using single nucleotide polymorphisms (single nucleotide polymorphism, SNP) as biomarkers in the aspect.
STRs, also known as microsatellite DNA (micro SATELLITE DNA), are a class of DNA polymorphic loci that are widely found in the human genome. They generally consist of 2-6 bases constituting a core sequence, which is arranged in tandem repeats, resulting in length polymorphisms from variations in the number of core sequence repeats. The number of repeats of a repeat sequence at a particular location on a chromosome is fixed for a particular individual, and may vary from individual to individual at the same location, which constitutes a polymorphism in these repeat sequences in the population. Since the human genome has a large number of such repeats, individual-to-individual distinction can be made clearly by detecting such polymorphisms. Because of the characteristics that it has high sensitivity and high discrimination ability to and easily standardized, automatic typing's advantage, wide application in fields such as forensic science individual identification and parent identification.
For paternity test, the method needs to sample the child, father and mother respectively, and judges whether the parent and the child are in paternity or not according to whether the STR detection results of the child, father and mother accord with genetic characteristics or not. The child needs to be an independent individual to accurately sample, so that certain defects exist in noninvasive fetal paternity test.
SNPs refer mainly to DNA sequence polymorphisms at the genomic level caused by single nucleotide variations. It is one of the most common human heritable variants. Accounting for over 90% of all known polymorphisms. SNPs are widely found in the human genome, 1 for every 500-1000 base pairs on average, and a total number of 300 or more is estimated. The method for carrying out chip or high-depth sequencing by selecting specific SNP loci as markers can be stably and accurately applied to identifying individuals and carrying out paternity test, even can analyze pollution samples of low-proportion mixed samples, and can also be used for carrying out noninvasive fetal paternity test by utilizing maternal peripheral blood. However, the technology mainly uses the accurate typing of specific SNP loci for comparison, and has the problems of high detection cost and long analysis time.
The identification of the identity of the sample by STR and SNP techniques can generally only be performed on the retained sample and the newly acquired sample by typing comparison at specific sites, so as to analyze whether the sample backed up at that time in the laboratory has identity with the sample to be checked, however, for the fragmented high-throughput sequencing DNA library and the data backed up thereof, since the coverage of the specific sites is insufficient or the storage capacity is insufficient, accurate typing cannot be obtained, and thus, an effective analysis means is lacking. In the field of prenatal screening and diagnosis, NIPT and CNV-seq are widely applied detection items of low-depth whole genome sequencing technology, and the identity of prenatal diagnosis samples generally requires analysis by adopting STR and SNP methods, so that the detection steps and operation cost are increased, and the detection methods cannot be used for quality control of detection flows and backtracking of the most original results.
With more and more molecular detection projects based on high-throughput sequencing being developed in clinical laboratories, three major difficulties exist in retrospective analysis of detection data in the laboratory: 1. the experimental process is not retrospective, and contamination or confusion often occurs in the course of unconscious errors in the operation of the experiment. 2. Mixing the contaminated samples, pollution occurs from the source, or mixing samples in the detection process, and the like, later finding that the results are problematic, and the samples cannot be repeated by a laboratory to cause serious quality accidents. 3. The insufficient retention of the sample results in a failure to trace back, such as degraded samples or insufficient plasma samples; too long or storage problems lead to degradation of DNA; the lack of effective reservoir capacity results in insufficient information available for STR and SNP analysis methods due to the fact that the plasma free DNA fragments are too short.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for identifying sample identity based on low-depth sequencing, which can identify sample identity by analyzing a low-depth whole genome sequencing original data file without changing an experimental scheme or increasing sequencing quantity, and has the advantages of low detection cost, short analysis time and capability of carrying out noninvasive fetal paternity test by utilizing maternal peripheral blood.
In order to solve the technical problems, the technical scheme of the invention is as follows: a method for identifying sample identity based on low depth sequencing, comprising the steps of:
S1, sequence alignment;
S2, filtering the sequence;
S3, selecting a SNP locus data set of high-frequency heterozygous crowd;
s4, acquiring cdSNP site lists;
s5, counting CR values among samples;
s6, sample identity analysis is conducted, and a conclusion is obtained.
As a further illustration of the present invention,
Preferably, the sequence alignment in the step S1 is based on the use of high throughput sequencing data for performing low depth sequencing detection projects such as NIPT, CNV-seq, etc., and includes selecting BWA alignment software (BWA-0.7.17, BWA-men) to perform sequence alignment on raw sequence data (FASTQ file) obtained by a sequencing-obtained semiconductor sequencer and human genome reference sequences (such as GRCh37/hg19 version) to obtain aligned sam file.
Preferably, the sequence filtering in step S2 includes filtering the aligned sam file to remove sequences that may be misidentified by non-alignment (ummaped), low alignment quality (MAPQ < 40), and multiple alignment peer-to-peer alignment, and obtain a valid sequencing.
Preferably, the step S3 of selecting the SNP locus data set of the high frequency heterozygosity of the population comprises selecting loci with genotypes of SNP only two types and minimum allele frequency (Minor Allele Frequency, MAF) not lower than 0.3 as the SNP locus data set of the high frequency heterozygosity of the population by downloading SNP data files (version 151) of people in a database ftp:// ftp.
Preferably, the step S4 of obtaining the cdSNP locus list includes counting bases of the SNP locus dataset of high-frequency heterozygous hit population in the comparison result of each file, and then obtaining locus base information (co-DETECTED SNPS, CDSNPS) with one sequence coverage for each two files.
Preferably, the step S5 counts CR values among samples, including a process of calculating a consistency value (CR) of site base information (co-DETECTED SNPS, CDSNPS).
Preferably, the sample identity analysis in step S6, based on the CR value, yields the following classification decision:
1) When CR <0.616, it is determined that there is no significant relationship;
2) When CR >0.672 and CR <0.725, the two samples are judged to be related;
3) When CR >0.753, the same individual relationship is determined;
4) When the CR value is not the above, the fetal DNA concentration, which may be a laboratory contamination or noninvasive detection of the sample, is high.
Preferably, the invention is derived by taking advantage of the feature that a certain population of high frequency heterozygous sites exist between two samples while being covered by one reads.
The beneficial effects of the invention are as follows:
1. The invention can analyze the sequencing original data file backed up by the detection mechanism, can finish the identification of whether the identity exists in different samples on the premise of not changing the experimental scheme and not increasing the sequencing amount and the detection cost, is convenient for a laboratory to control the quality of the detection flow, and is beneficial to clinically examining sample pollution, and tracing the sample when mixed samples are likely to occur or serious quality accidents (false positive/false negative) occur.
2. The invention provides a method for judging sample identity by only comparing the identity of SNP loci with depth of 1 obtained jointly in two sequencing data, and both binomial distribution models and practices show that under extremely low sequencing depth, even if the depth is as low as 0.05X coverage depth, a plurality of SNP loci still have one reads coverage (co-DETECTED SNPS, CDSNPS) between the two samples, and cdSNPs identity analysis can be accurately carried out by selecting high-frequency heterozygous SNP loci of crowd.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following specific embodiments.
The invention discloses a method for identifying sample identity based on low-depth sequencing, which comprises the following steps:
S1, sequence alignment;
S2, filtering the sequence;
S3, selecting a SNP locus data set of high-frequency heterozygous crowd;
s4, acquiring cdSNP site lists;
s5, counting CR values among samples;
s6, sample identity analysis is conducted, and a conclusion is obtained.
Further, the sequence alignment in the step S1 is the basis for performing low-depth sequencing detection projects such as NIPT, CNV-seq and the like by using high-throughput sequencing data, and comprises the process of selecting BWA alignment software (BWA-0.7.17, BWA-men) to perform sequence alignment on raw sequence data (FASTQ file) obtained by a sequencing acquisition semiconductor sequencer and human genome reference sequences (such as GRCh37/hg19 version) to obtain aligned sam files.
Further, the sequence filtering in the step S2 includes filtering the aligned sam file to remove sequences that may generate false base recognition by non-alignment (ummaped), low alignment quality (MAPQ < 40) and multiple alignment peer-to-peer alignment, and obtain a process of efficient sequencing.
Further, the step S3 of selecting the SNP locus data set of the high frequency heterozygosity of the population comprises the process of selecting loci with genotypes of SNP only of two types and minimum allele frequency (Minor Allele Frequency, MAF) not lower than 0.3 as the SNP locus data set of the high frequency heterozygosity of the population by downloading SNP data files (version 151) of people in a database ftp:// ftp.
Further, the step S4 of obtaining cdSNP locus list includes counting the bases of SNP locus data sets of high frequency heterozygous hit population in the comparison result of each file, and then obtaining locus base information (co-DETECTED SNPS, CDSNPS) with one sequence coverage for each two files.
Further, the step S5 counts the CR values among samples, including the process of calculating the consistency value (CR) of the site base information (co-DETECTED SNPS, CDSNPS).
Preferably, in the step S6 sample identity analysis, according to the known CR value reference range calculated by 50 unrelated samples, the CR value reference range of 30 samples detected twice by the same sample, and the CR value reference range of 20 relatives, it is possible to analyze which classification the CR value of the current two-time sequencing data is.
Sample relationship type | CR value | Standard deviation of CR | CR-1.96*SD | CR+1.96*SD |
Irrelevant samples | 0.603 | 0.006 | 0.591 | 0.616 |
Identical sample | 0.778 | 0.013 | 0.753 | 0.803 |
Parent-child relationship | 0.698 | 0.014 | 0.672 | 0.725 |
Based on the above table CR values, the following classification decisions are derived:
1) When CR <0.616, it is determined that there is no significant relationship;
2) When CR >0.672 and CR <0.725, the two samples are judged to be related;
3) When CR >0.753, the same individual relationship is determined;
4) When the CR value is not the above, the fetal DNA concentration, which may be a laboratory contamination or noninvasive detection of the sample, is high.
Furthermore, the invention is obtained by utilizing the characteristic that a certain crowd high-frequency heterozygous site exists between two samples and is simultaneously covered by one reads.
The theoretical basis of the invention is as follows: the base consistency of cdSNPs sites of twice low-depth sequencing of the same sample is obviously different from that of cdSNPs of two samples which are not related, and whether the two samples are derived from the same sample can be distinguished by simply calculating the consistency value of the sample to be analyzed and a group of samples. At very low sequencing depths, the genotype of each SNP site is unknown, but it is known by theoretical deduction that also at sequencing depths equal to 1, the base identity of the high frequency heterozygous SNP observed in the whole genome region is different for different samples.
Assuming that one cdSNP is obtained from the raw data of two low-depth whole genome sequencing, assuming that genotypes are A and B, respectively, and that SNP with a population frequency of p and q is obtained, if the two SNP are different samples, the probability that the locus is identical with the base covered by only one reads is 1-2pq, and if the two SNP are the same sample, the probability that the locus is identical with the base covered by only one reads is 1-pq. For example, the frequency of heterozygosity for a population of SNP loci is 0.5, then the expected value for CR for the different samples is 0.5, whereas in the same sample the expected value for CR is 0.75.
The cdSNPs expected values of consistency for the two samples that are not related are calculated as follows:
Sample 1 genotype | Sample 2 genotype | CdSNP consistent expected value of depth 1X |
AA | AA | E=p4 |
BB | BB | E=q4 |
AB | AB | E=2*2*0.5*p2q2=2p2q2 |
AA | AB | E=p2*2pq*0.5=p3q |
BB | AB | E=q2*2pq*0.5=pq3 |
For SNP loci with genotype frequencies p, q, (p+q) =1, a consensus probability (CR) for the above genotypes can be calculated by summing up the consensus probability values of the above genotypes:
CR=p4+q4+2p2q2+2p3q+2pq3=p3(p+q)+q3(p+q)+p2q(p+q)+p2q(p+q)=p3(p+q)+q3(p+q)+p2q(p+q)+p2q(p+q)=p2(p+q)2+q2(p+q)2=p2+q2=1-2pq
when p=0.5, cr=1-2×0.5×0.5=0.5
The cdSNP expected values of consistency when two measurements of the same individual are taken are calculated as follows:
detection 1 | Detection 2 | CdSNP consistent expected value of depth 1X |
AA | AA | E=p 2 (e=0.25 when p=0.5) |
BB | BB | E=q 2 (e=0.25 when p=0.5) |
AB | AB | E=2×pq×0.5=pq (e=0.25 when p=0.5) |
Then the SNP locus with the genotype frequency of p and q of the crowd at the moment can be calculated from the sum of the probability of the genotypes to obtain the CR value:
CR=p2+q2+pq=1-pq
when p=0.5, cr=1-0.5×0.5=0.75
The expected value of identity at a certain site of the same sample is more than the expected value of the unrelated sample by pq, so that when the number of cdSNP is sufficiently large for a selected high frequency SNP set, there is a significant difference in the CR values between the two types of sequencing data, and this value can be used.
The following are specific examples of the application of the present invention.
Example 1
The fetal concentration of the NIPT detection result is abnormally high, and the sample identity identification is carried out by suspected pollution or misleading the sample:
The standing laboratory finds one example of samples, the first detection finds that the fetal concentration is very high and exceeds 85%, after 3 repetitions, the two latter two are female fetuses, and the technical support hopes to carry out sample uniqueness analysis.
Run to which the three NIPT results respectively belong is as follows:
First NIPT results: abnormal high fetal concentration, male fetal signal
Second NIPT results: fetal concentration is normal, and female fetal signals
Nip results for the third time: fetal concentration is normal, and female fetal signals
CdSNP consistency analysis is carried out on the original data of the three results:
1.1 obtaining original files 2702-IonXpress _042, 2702-IonXpress _040 and 2702-IonXpress _032 of the three-time machine-down data; and using the original data of the irrelevant sample EJ042423 as an external reference;
1.2 comparing the above original files to human reference genome hg19;
1.3, filtering multiple repeated sequences without alignment and with low quality;
1.4 forming a unique comparison sequence;
1.5 obtaining cdSNP site list by combining high-frequency heterozygous SNP sites of human beings;
1.6, counting expected values (CR values) of consistency among samples;
and 1.7, judging the identity result between samples according to the CR value.
The analysis results were as follows:
Conclusion:
analyzing the 3 raw data of sample 2702 to find IonXpress _040 and IonXpress _032 as the same sample source data; ionXpress _042 this sample was severely contaminated with male genome, suggesting that the laboratory should be aware of the sample cross-contamination problem.
Example 2
Identity identification of non-uniformity of positive sample review:
And feeding back a certain sample NIPT detection 21 trisomy from the follow-up result, wherein the diagnosis result of male embryo and amniotic fluid is female embryo negative, and the NIPT result is inconsistent with the amniotic fluid result and the gender is inconsistent, so that a customer suspects that the sample is possibly wrong, and an auditor analyzes to find whether the sample is clinically remarked with embryo reduction, whether the influence of embryo reduction is that the sample is wrong, and hopefully carrying out non-invasive and amniotic fluid data sample identity analysis.
First NIPT results: fetal concentration: 7.4%; t21, male fetus;
Second NIPT results: fetal concentration: 8.1%; t21, male fetus;
Results for amniotic fluid CNVseq: negative, female fetus;
cdSNP consistency analysis was performed on the raw data of the above NIPT results and CNVseq results:
2.1 obtaining original files EP100342 and EM100872D of NIPT and CNVseq off-machine data; and using the original data of the irrelevant sample EJ042423 as an external reference; and using the same specimen twice library results (EP 100190_ IonXpress _016, EP 100190_IonXpress_025), and the original data of the same-egg twin-embryo sample (EM 004201F, EM 004202F) as a control;
2.2 comparing the above original files to human reference genome hg19;
2.3, filtering multiple repeated sequences without alignment and with low quality;
2.4 forming a unique comparison sequence;
2.5 obtaining cdSNP site list by combining high-frequency heterozygous SNP sites of human beings;
2.6, counting expected values (CR values) of consistency among samples;
And 2.7, judging the identity result between samples according to the CR value.
The analysis results were as follows:
Conclusion:
Analysis of the two samples of EM100872D and EP100342 suggests that the consistent ratio of 0.68 for EM100872D and EP100342 meets the expectations of the two samples for complete relatives, so that non-invasive detection does not confound the samples, the Y signal and the signal of trisomy 21 coming from the possibility of miscarriage.
In order to verify the accuracy of the invention, we recall that maternal blood leukocytes further do STR verification: (EM 100872D parent sample number: GEM100872B; non-invasive sample EP100342 parent blood leukocyte sample number: ES100003B; parent samples of two samples): GEM100872B and ES100003B: STR typing is consistent, and the STR typing is in primary parent relation with EM100872D, so that it is confirmed that samples are not confused, and Y signals and signals of the 21 trisomy are from miscarriage.
Example 3
Identification of false negative sample identity:
Some sample EP007057 has double fetuses, and the noninvasive detection result has no abnormality, 46, XY; pregnancy outcome: twin fetuses dead intrauterine (induced labor for two men and infants), embryo tissue CMA results: 47, XXY; it is uncertain whether the sample is confusing or not, and it is desirable to perform sample uniqueness analysis on the maternal leukocyte EP007057R and the non-invasive sample EP 007057;
The procedure was as in example 2;
The analysis results were as follows:
File1 | File2 | cdSNP consistent expected value (CR value) | Identity result judgment |
EP007057R | EJ042423 | 0.591 | Two independent samples, external parameters |
EP007057 | EJ042423 | 0.593 | Two independent samples, external parameters |
EP007057R | EP007057 | 0.759 | Identical sample |
Conclusion:
Analysis of the two samples of EP007057 and EP007057R suggests that the consistent ratio of 0.759 of the two samples meets the expected value of the same sample, so that the non-invasive detection does not confuse the samples, and the puncture karyotype inconsistency may be caused by placenta chimerism.
Example 4
Periodic quality control in laboratory: in view of the advantages of no need of increasing experimental steps, experimental cost, convenience and rapidness, the laboratory is additionally provided with the step of carrying out sample identity analysis on the twice-repeated sample results, and the possibility of sample mixing is examined. And identity analysis is regularly carried out on samples of the same label among different run, so that the possibility of cross contamination is checked, and the occurrence of quality accidents is effectively intercepted.
Reworking a sample identity analysis step:
4.1 comparing the original file to human reference genome hg19;
4.2, the multiple sequences are filtered without comparison and with low quality;
4.3 forming a unique comparison sequence;
4.4 obtaining cdSNP site list by combining human high-frequency heterozygous SNP site;
4.5, counting expected values (CR values) of consistency among samples;
the above steps have been packaged into an automated analysis flow cdSNP analysis plug-in;
4.6, when the system recognizes that the system has the redo result, starting cdSNP the analysis plug-in by acquiring the original data file;
4.7 when CR value <0.616, the system automatically prompts: irrespective of the two samples, the mixed sample is possibly examined;
when CR value > =0.672 & <0.725, the system automatically prompts: a certain parent relationship exists, and the pollution possibility is eliminated;
When CR value > =0.753, the system suggests: the same sample passes through quality control;
The two results of the reworked samples were analyzed as follows:
Conclusion:
The CR value=0.707, CR value > =0.672 & <0.725 for this lot of rework samples EP005415, the system automatically prompts: a certain parent relationship exists, so that the pollution possibility is checked, and a laboratory is required to check and improve the pollution cause; other sample CR values were >0.753, so the system suggests: the same sample passes the quality control.
According to the invention, on the premise of not changing the existing NIPT and CNV-seq experimental scheme and sequencing quantity, the function of the high-frequency heterozygous site of a certain crowd between two samples is expanded by utilizing the characteristic that the high-frequency heterozygous site is covered by one reads at the same time, and a whole set of method for identifying sample identity based on low-depth sequencing is developed.
The invention can analyze the sequencing original data file backed up by the detection mechanism, can finish the identification of whether the identity exists in different samples on the premise of not changing the experimental scheme and not increasing the sequencing amount and the detection cost, is convenient for a laboratory to control the quality of the detection flow, and is beneficial to clinically examining sample pollution, and tracing the sample when mixed samples are likely to occur or serious quality accidents (false positive/false negative) occur.
The invention can rapidly and economically identify and detect sample uniqueness and effectively check sample confusion and pollution. The invention utilizes the original sequence data of the low-depth whole genome sequencing detection project based on the new generation sequencing technology, such as NIPT, CNV-seq sequencing bam and fastq files, namely, comparison and analysis can be carried out, and whether the two detection samples are the same sample is identified by calculating the consistency value of a crowd polymorphic Site (SNP) with one sequence coverage between any two different samples. For detection projects based on low-depth whole genome sequencing, the laboratory periodically controls quality, checks sample pollution, and checks mixed sample conditions possibly or tracing false positive/false negative analysis samples.
In the foregoing, only the preferred embodiment of the present invention is described, and any minor modifications, equivalent changes and modifications made to the above embodiments according to the technical solutions of the present invention fall within the scope of the technical solutions of the present invention.
Claims (2)
1. A method for identifying sample identity based on low depth sequencing, comprising the steps of:
S1, sequence comparison is the basis of NIPT, CNV-seq low-depth sequencing detection projects by using high-throughput sequencing data, and comprises the steps of selecting BWA comparison software, and comparing an original sequence data FASTQ file obtained by using a semiconductor sequencer with a human genome reference sequence GRCh37/hg19 version to obtain a compared sam file, wherein the BWA comparison software comprises BWA-0.7.17 and BWA-men;
s2, filtering the sequence, namely filtering the aligned sam file, and removing the sequence generating the false base identification, namely removing non-alignment, low alignment quality MAPQ <40 and multiple alignment to obtain effective sequencing data;
s3, selecting a SNP locus data set of high-frequency heterozygous crowd, wherein the SNP locus data set is obtained by downloading SNP data files of people in a database ftp:// ftp. Ncbi. Nih. Gov/SNP, the version of the SNP data files is version 151, and selecting loci of which the genotypes of SNP are only two types and the minimum allele frequency MAF is not lower than 0.3 as the SNP locus data set of high-frequency heterozygous crowd;
S4, acquiring cdSNP site lists, namely counting bases of SNP site datasets of high-frequency heterozygous hit population in comparison results of each sam file, and then acquiring site base information co-DETECTED SNPS, co-DETECTED SNPS of cdSNPs which is covered by one sequence for each two files;
S5, counting CR values among samples, wherein the CR values comprise a concordance value concordance rate of calculated site base information cdSNPs, and the concordance rate is simply called CR;
S6, sample identity analysis, and according to CR values, the following classification judgment is obtained: 1) When CR <0.616, it is determined that there is no significant relationship; 2) When CR >0.672 and CR <0.725, the two samples are judged to be related; 3) When CR >0.753, the same individual relationship is determined; 4) When the CR value does not fall under the above conditions, the fetal DNA concentration is high for laboratory contamination or noninvasive detection of the sample.
2. The method of preparing a low depth sequencing based method for identifying sample identity according to claim 1, wherein: the method is obtained by utilizing the characteristic that a certain crowd high-frequency heterozygous site exists between two samples and is covered by one reads.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110723066.3A CN113450871B (en) | 2021-06-28 | 2021-06-28 | Method for identifying sample identity based on low-depth sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110723066.3A CN113450871B (en) | 2021-06-28 | 2021-06-28 | Method for identifying sample identity based on low-depth sequencing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113450871A CN113450871A (en) | 2021-09-28 |
CN113450871B true CN113450871B (en) | 2024-06-11 |
Family
ID=77813557
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110723066.3A Active CN113450871B (en) | 2021-06-28 | 2021-06-28 | Method for identifying sample identity based on low-depth sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113450871B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113999900B (en) * | 2021-10-14 | 2024-02-20 | 武汉蓝沙医学检验实验室有限公司 | Method for evaluating fetal DNA concentration by using free DNA of pregnant woman and application |
CN113969310B (en) * | 2021-10-14 | 2024-02-20 | 武汉蓝沙医学检验实验室有限公司 | Fetal DNA concentration evaluation method and application |
CN114530200B (en) * | 2022-03-18 | 2022-09-23 | 北京阅微基因技术股份有限公司 | Mixed sample identification method based on calculation of SNP entropy |
CN115810393B (en) * | 2022-12-22 | 2023-08-25 | 南京普恩瑞生物科技有限公司 | Sequencing sample homology detection method and system based on SNPs library of construction crowd |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104946773A (en) * | 2015-07-06 | 2015-09-30 | 厦门万基生物科技有限公司 | Method for judging antenatal parental right relation with SNP |
CN109461473A (en) * | 2018-09-30 | 2019-03-12 | 北京优迅医疗器械有限公司 | Fetus dissociative DNA concentration acquisition methods and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3987525A1 (en) * | 2019-06-21 | 2022-04-27 | CooperSurgical, Inc. | System and method for determining genetic relationships between a sperm provider, oocyte provider, and the respective conceptus |
CN112885408B (en) * | 2021-02-22 | 2024-10-01 | 中国农业大学 | Method and device for detecting SNP marker loci based on low-depth sequencing |
-
2021
- 2021-06-28 CN CN202110723066.3A patent/CN113450871B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104946773A (en) * | 2015-07-06 | 2015-09-30 | 厦门万基生物科技有限公司 | Method for judging antenatal parental right relation with SNP |
CN109461473A (en) * | 2018-09-30 | 2019-03-12 | 北京优迅医疗器械有限公司 | Fetus dissociative DNA concentration acquisition methods and device |
Also Published As
Publication number | Publication date |
---|---|
CN113450871A (en) | 2021-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113450871B (en) | Method for identifying sample identity based on low-depth sequencing | |
CN109887548B (en) | ctDNA ratio detection method and detection device based on capture sequencing | |
CN108604258B (en) | Chromosome abnormality determination method | |
CN106778073B (en) | A kind of method and system of assessment tumor load variation | |
EA017966B1 (en) | Diagnosing fetal chromosomal aneuploidy using genomic sequencing | |
CN111091868B (en) | Method and system for analyzing chromosome aneuploidy | |
CN108920899A (en) | A kind of single exon copy number variation prediction technique based on target area sequencing | |
CN110021346B (en) | Gene fusion and mutation detection method and system based on RNAseq data | |
CN107949845A (en) | The new method of sex of foetus and fetus sex chromosomal abnormality can be distinguished on multiple platforms | |
US20210090687A1 (en) | Methods of quality control using single-nucleotide polymorphisms in pre-implantation genetic screening | |
CN104846089A (en) | Quantitative method for free fetal DNA (deoxyribonucleic acid) proportion in maternal peripheral blood | |
CN110592208B (en) | Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof | |
CN113593644A (en) | Method for detecting chromosome uniparental disomy by low-depth sequencing based on family | |
US20230111097A1 (en) | Array-based methods for analysing mixed samples using different allele-specific labels, in particular for detection of fetal aneuploidies | |
CN106778069B (en) | Method and apparatus for determining microdeletion microreplication in fetal chromosomes | |
CN109461473B (en) | Method and device for acquiring concentration of free DNA of fetus | |
US7912652B2 (en) | System and method for mutation detection and identification using mixed-base frequencies | |
CN111944807B (en) | Human sequencing sample tracking marker, and monitoring method and monitoring device for human sequencing sample cross contamination | |
CN116994649A (en) | Intelligent judging method and intelligent judging system for gene detection data | |
CN114171116A (en) | Method for evaluating fetal DNA concentration by free and self DNA of pregnant woman and application | |
CN114093428B (en) | System and method for detecting low-abundance mutation under ctDNA ultrahigh sequencing depth | |
KR102519739B1 (en) | Non-invasive prenatal testing method and devices based on double Z-score | |
CN114093417B (en) | Method and device for identifying chromosomal arm heterozygosity loss | |
CN117980504A (en) | Genetic analysis method capable of performing two or more tests | |
CN113969310A (en) | Fetal DNA concentration evaluation method and application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |