CN113450871B

CN113450871B - Method for identifying sample identity based on low-depth sequencing

Info

Publication number: CN113450871B
Application number: CN202110723066.3A
Authority: CN
Inventors: 陈样宜; 刘燕霞; 黄楷胜; 刘远如; 焦伟刚
Original assignee: Guangdong Boao Medical Laboratory Co ltd
Current assignee: Guangdong Boao Medical Laboratory Co ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2024-06-11
Anticipated expiration: 2041-06-28
Also published as: CN113450871A

Abstract

The invention discloses a method for identifying sample identity based on low-depth sequencing by utilizing the characteristic that a certain crowd high-frequency heterozygous site exists between two samples and is simultaneously covered by one reads, which comprises the following steps: s1, sequence alignment; s2, filtering the sequence; s3, selecting a SNP locus data set of high-frequency heterozygous crowd; s4, acquiring cdSNP site lists; s5, counting CR values among samples; s6, sample identity analysis is conducted, and a conclusion is obtained. The invention can identify the identity of the sample by analyzing the low-depth whole genome sequencing original data file without changing an experimental scheme or increasing the sequencing quantity, and has the advantages of low detection cost, short analysis time and capability of carrying out noninvasive fetal paternity test by utilizing the peripheral blood of the pregnant woman.

Description

Method for identifying sample identity based on low-depth sequencing

Technical Field

The invention relates to the technical field of prenatal diagnosis molecular genetics detection, in particular to a method for identifying sample identity based on low-depth sequencing.

Background

The current method for identifying the identity of human DNA samples in forensic science, such as the fields of individual identification and parental identification, mainly uses a comparison analysis method for analyzing specific short tandem repeats (short TANDEM REPEAT, STR) as biomarkers, and the development of a gene chip technology and a new generation high-throughput gene detection technology is accompanied by the beginning of the comparison analysis method for using single nucleotide polymorphisms (single nucleotide polymorphism, SNP) as biomarkers in the aspect.

STRs, also known as microsatellite DNA (micro SATELLITE DNA), are a class of DNA polymorphic loci that are widely found in the human genome. They generally consist of 2-6 bases constituting a core sequence, which is arranged in tandem repeats, resulting in length polymorphisms from variations in the number of core sequence repeats. The number of repeats of a repeat sequence at a particular location on a chromosome is fixed for a particular individual, and may vary from individual to individual at the same location, which constitutes a polymorphism in these repeat sequences in the population. Since the human genome has a large number of such repeats, individual-to-individual distinction can be made clearly by detecting such polymorphisms. Because of the characteristics that it has high sensitivity and high discrimination ability to and easily standardized, automatic typing's advantage, wide application in fields such as forensic science individual identification and parent identification.

For paternity test, the method needs to sample the child, father and mother respectively, and judges whether the parent and the child are in paternity or not according to whether the STR detection results of the child, father and mother accord with genetic characteristics or not. The child needs to be an independent individual to accurately sample, so that certain defects exist in noninvasive fetal paternity test.

SNPs refer mainly to DNA sequence polymorphisms at the genomic level caused by single nucleotide variations. It is one of the most common human heritable variants. Accounting for over 90% of all known polymorphisms. SNPs are widely found in the human genome, 1 for every 500-1000 base pairs on average, and a total number of 300 or more is estimated. The method for carrying out chip or high-depth sequencing by selecting specific SNP loci as markers can be stably and accurately applied to identifying individuals and carrying out paternity test, even can analyze pollution samples of low-proportion mixed samples, and can also be used for carrying out noninvasive fetal paternity test by utilizing maternal peripheral blood. However, the technology mainly uses the accurate typing of specific SNP loci for comparison, and has the problems of high detection cost and long analysis time.

The identification of the identity of the sample by STR and SNP techniques can generally only be performed on the retained sample and the newly acquired sample by typing comparison at specific sites, so as to analyze whether the sample backed up at that time in the laboratory has identity with the sample to be checked, however, for the fragmented high-throughput sequencing DNA library and the data backed up thereof, since the coverage of the specific sites is insufficient or the storage capacity is insufficient, accurate typing cannot be obtained, and thus, an effective analysis means is lacking. In the field of prenatal screening and diagnosis, NIPT and CNV-seq are widely applied detection items of low-depth whole genome sequencing technology, and the identity of prenatal diagnosis samples generally requires analysis by adopting STR and SNP methods, so that the detection steps and operation cost are increased, and the detection methods cannot be used for quality control of detection flows and backtracking of the most original results.

With more and more molecular detection projects based on high-throughput sequencing being developed in clinical laboratories, three major difficulties exist in retrospective analysis of detection data in the laboratory: 1. the experimental process is not retrospective, and contamination or confusion often occurs in the course of unconscious errors in the operation of the experiment. 2. Mixing the contaminated samples, pollution occurs from the source, or mixing samples in the detection process, and the like, later finding that the results are problematic, and the samples cannot be repeated by a laboratory to cause serious quality accidents. 3. The insufficient retention of the sample results in a failure to trace back, such as degraded samples or insufficient plasma samples; too long or storage problems lead to degradation of DNA; the lack of effective reservoir capacity results in insufficient information available for STR and SNP analysis methods due to the fact that the plasma free DNA fragments are too short.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for identifying sample identity based on low-depth sequencing, which can identify sample identity by analyzing a low-depth whole genome sequencing original data file without changing an experimental scheme or increasing sequencing quantity, and has the advantages of low detection cost, short analysis time and capability of carrying out noninvasive fetal paternity test by utilizing maternal peripheral blood.

In order to solve the technical problems, the technical scheme of the invention is as follows: a method for identifying sample identity based on low depth sequencing, comprising the steps of:

S1, sequence alignment;

S2, filtering the sequence;

S3, selecting a SNP locus data set of high-frequency heterozygous crowd;

s4, acquiring cdSNP site lists;

s5, counting CR values among samples;

s6, sample identity analysis is conducted, and a conclusion is obtained.

As a further illustration of the present invention,

Preferably, the sequence alignment in the step S1 is based on the use of high throughput sequencing data for performing low depth sequencing detection projects such as NIPT, CNV-seq, etc., and includes selecting BWA alignment software (BWA-0.7.17, BWA-men) to perform sequence alignment on raw sequence data (FASTQ file) obtained by a sequencing-obtained semiconductor sequencer and human genome reference sequences (such as GRCh37/hg19 version) to obtain aligned sam file.

Preferably, the sequence filtering in step S2 includes filtering the aligned sam file to remove sequences that may be misidentified by non-alignment (ummaped), low alignment quality (MAPQ < 40), and multiple alignment peer-to-peer alignment, and obtain a valid sequencing.

Preferably, the step S3 of selecting the SNP locus data set of the high frequency heterozygosity of the population comprises selecting loci with genotypes of SNP only two types and minimum allele frequency (Minor Allele Frequency, MAF) not lower than 0.3 as the SNP locus data set of the high frequency heterozygosity of the population by downloading SNP data files (version 151) of people in a database ftp:// ftp.

Preferably, the step S4 of obtaining the cdSNP locus list includes counting bases of the SNP locus dataset of high-frequency heterozygous hit population in the comparison result of each file, and then obtaining locus base information (co-DETECTED SNPS, CDSNPS) with one sequence coverage for each two files.

Preferably, the step S5 counts CR values among samples, including a process of calculating a consistency value (CR) of site base information (co-DETECTED SNPS, CDSNPS).

Preferably, the sample identity analysis in step S6, based on the CR value, yields the following classification decision:

1) When CR <0.616, it is determined that there is no significant relationship;

2) When CR >0.672 and CR <0.725, the two samples are judged to be related;

3) When CR >0.753, the same individual relationship is determined;

4) When the CR value is not the above, the fetal DNA concentration, which may be a laboratory contamination or noninvasive detection of the sample, is high.

Preferably, the invention is derived by taking advantage of the feature that a certain population of high frequency heterozygous sites exist between two samples while being covered by one reads.

The beneficial effects of the invention are as follows:

1. The invention can analyze the sequencing original data file backed up by the detection mechanism, can finish the identification of whether the identity exists in different samples on the premise of not changing the experimental scheme and not increasing the sequencing amount and the detection cost, is convenient for a laboratory to control the quality of the detection flow, and is beneficial to clinically examining sample pollution, and tracing the sample when mixed samples are likely to occur or serious quality accidents (false positive/false negative) occur.

2. The invention provides a method for judging sample identity by only comparing the identity of SNP loci with depth of 1 obtained jointly in two sequencing data, and both binomial distribution models and practices show that under extremely low sequencing depth, even if the depth is as low as 0.05X coverage depth, a plurality of SNP loci still have one reads coverage (co-DETECTED SNPS, CDSNPS) between the two samples, and cdSNPs identity analysis can be accurately carried out by selecting high-frequency heterozygous SNP loci of crowd.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following specific embodiments.

The invention discloses a method for identifying sample identity based on low-depth sequencing, which comprises the following steps:

S1, sequence alignment;

S2, filtering the sequence;

S3, selecting a SNP locus data set of high-frequency heterozygous crowd;

s4, acquiring cdSNP site lists;

s5, counting CR values among samples;

s6, sample identity analysis is conducted, and a conclusion is obtained.

Further, the sequence alignment in the step S1 is the basis for performing low-depth sequencing detection projects such as NIPT, CNV-seq and the like by using high-throughput sequencing data, and comprises the process of selecting BWA alignment software (BWA-0.7.17, BWA-men) to perform sequence alignment on raw sequence data (FASTQ file) obtained by a sequencing acquisition semiconductor sequencer and human genome reference sequences (such as GRCh37/hg19 version) to obtain aligned sam files.

Further, the sequence filtering in the step S2 includes filtering the aligned sam file to remove sequences that may generate false base recognition by non-alignment (ummaped), low alignment quality (MAPQ < 40) and multiple alignment peer-to-peer alignment, and obtain a process of efficient sequencing.

Further, the step S3 of selecting the SNP locus data set of the high frequency heterozygosity of the population comprises the process of selecting loci with genotypes of SNP only of two types and minimum allele frequency (Minor Allele Frequency, MAF) not lower than 0.3 as the SNP locus data set of the high frequency heterozygosity of the population by downloading SNP data files (version 151) of people in a database ftp:// ftp.

Further, the step S4 of obtaining cdSNP locus list includes counting the bases of SNP locus data sets of high frequency heterozygous hit population in the comparison result of each file, and then obtaining locus base information (co-DETECTED SNPS, CDSNPS) with one sequence coverage for each two files.

Further, the step S5 counts the CR values among samples, including the process of calculating the consistency value (CR) of the site base information (co-DETECTED SNPS, CDSNPS).

Preferably, in the step S6 sample identity analysis, according to the known CR value reference range calculated by 50 unrelated samples, the CR value reference range of 30 samples detected twice by the same sample, and the CR value reference range of 20 relatives, it is possible to analyze which classification the CR value of the current two-time sequencing data is.

Sample relationship type	CR value	Standard deviation of CR	CR-1.96*SD	CR+1.96*SD
					Irrelevant samples	0.603	0.006	0.591	0.616
Identical sample	0.778	0.013	0.753	0.803
					Parent-child relationship	0.698	0.014	0.672	0.725

Based on the above table CR values, the following classification decisions are derived:

1) When CR <0.616, it is determined that there is no significant relationship;

2) When CR >0.672 and CR <0.725, the two samples are judged to be related;

3) When CR >0.753, the same individual relationship is determined;

Furthermore, the invention is obtained by utilizing the characteristic that a certain crowd high-frequency heterozygous site exists between two samples and is simultaneously covered by one reads.

The theoretical basis of the invention is as follows: the base consistency of cdSNPs sites of twice low-depth sequencing of the same sample is obviously different from that of cdSNPs of two samples which are not related, and whether the two samples are derived from the same sample can be distinguished by simply calculating the consistency value of the sample to be analyzed and a group of samples. At very low sequencing depths, the genotype of each SNP site is unknown, but it is known by theoretical deduction that also at sequencing depths equal to 1, the base identity of the high frequency heterozygous SNP observed in the whole genome region is different for different samples.

Assuming that one cdSNP is obtained from the raw data of two low-depth whole genome sequencing, assuming that genotypes are A and B, respectively, and that SNP with a population frequency of p and q is obtained, if the two SNP are different samples, the probability that the locus is identical with the base covered by only one reads is 1-2pq, and if the two SNP are the same sample, the probability that the locus is identical with the base covered by only one reads is 1-pq. For example, the frequency of heterozygosity for a population of SNP loci is 0.5, then the expected value for CR for the different samples is 0.5, whereas in the same sample the expected value for CR is 0.75.

The cdSNPs expected values of consistency for the two samples that are not related are calculated as follows:

Sample 1 genotype	Sample 2 genotype	CdSNP consistent expected value of depth 1X
			AA	AA	E＝p⁴
BB	BB	E＝q⁴
			AB	AB	E＝220.5*p²q²＝2p²q²
AA	AB	E＝p²2pq0.5＝p³q
			BB	AB	E＝q²2pq0.5＝pq³

For SNP loci with genotype frequencies p, q, (p+q) =1, a consensus probability (CR) for the above genotypes can be calculated by summing up the consensus probability values of the above genotypes:

CR＝p4+q4+2p2q2+2p3q+2pq3＝p3(p+q)+q3(p+q)+p2q(p+q)+p2q(p+q)＝p3(p+q)+q3(p+q)+p2q(p+q)+p2q(p+q)＝p2(p+q)2+q2(p+q)2＝p2+q2＝1-2pq

when p=0.5, cr=1-2×0.5×0.5=0.5

The cdSNP expected values of consistency when two measurements of the same individual are taken are calculated as follows:

detection 1	Detection 2	CdSNP consistent expected value of depth 1X
			AA	AA	E=p ² (e=0.25 when p=0.5)
BB	BB	E=q ² (e=0.25 when p=0.5)
			AB	AB	E=2×pq×0.5=pq (e=0.25 when p=0.5)

Then the SNP locus with the genotype frequency of p and q of the crowd at the moment can be calculated from the sum of the probability of the genotypes to obtain the CR value:

CR＝p2+q2+pq＝1-pq

when p=0.5, cr=1-0.5×0.5=0.75

The expected value of identity at a certain site of the same sample is more than the expected value of the unrelated sample by pq, so that when the number of cdSNP is sufficiently large for a selected high frequency SNP set, there is a significant difference in the CR values between the two types of sequencing data, and this value can be used.

The following are specific examples of the application of the present invention.

Example 1

The fetal concentration of the NIPT detection result is abnormally high, and the sample identity identification is carried out by suspected pollution or misleading the sample:

The standing laboratory finds one example of samples, the first detection finds that the fetal concentration is very high and exceeds 85%, after 3 repetitions, the two latter two are female fetuses, and the technical support hopes to carry out sample uniqueness analysis.

Run to which the three NIPT results respectively belong is as follows:

First NIPT results: abnormal high fetal concentration, male fetal signal

Second NIPT results: fetal concentration is normal, and female fetal signals

Nip results for the third time: fetal concentration is normal, and female fetal signals

CdSNP consistency analysis is carried out on the original data of the three results:

1.1 obtaining original files 2702-IonXpress _042, 2702-IonXpress _040 and 2702-IonXpress _032 of the three-time machine-down data; and using the original data of the irrelevant sample EJ042423 as an external reference;

1.2 comparing the above original files to human reference genome hg19;

1.3, filtering multiple repeated sequences without alignment and with low quality;

1.4 forming a unique comparison sequence;

1.5 obtaining cdSNP site list by combining high-frequency heterozygous SNP sites of human beings;

1.6, counting expected values (CR values) of consistency among samples;

and 1.7, judging the identity result between samples according to the CR value.

The analysis results were as follows:

Conclusion:

analyzing the 3 raw data of sample 2702 to find IonXpress _040 and IonXpress _032 as the same sample source data; ionXpress _042 this sample was severely contaminated with male genome, suggesting that the laboratory should be aware of the sample cross-contamination problem.

Example 2

Identity identification of non-uniformity of positive sample review:

And feeding back a certain sample NIPT detection 21 trisomy from the follow-up result, wherein the diagnosis result of male embryo and amniotic fluid is female embryo negative, and the NIPT result is inconsistent with the amniotic fluid result and the gender is inconsistent, so that a customer suspects that the sample is possibly wrong, and an auditor analyzes to find whether the sample is clinically remarked with embryo reduction, whether the influence of embryo reduction is that the sample is wrong, and hopefully carrying out non-invasive and amniotic fluid data sample identity analysis.

First NIPT results: fetal concentration: 7.4%; t21, male fetus;

Second NIPT results: fetal concentration: 8.1%; t21, male fetus;

Results for amniotic fluid CNVseq: negative, female fetus;

cdSNP consistency analysis was performed on the raw data of the above NIPT results and CNVseq results:

2.1 obtaining original files EP100342 and EM100872D of NIPT and CNVseq off-machine data; and using the original data of the irrelevant sample EJ042423 as an external reference; and using the same specimen twice library results (EP 100190_ IonXpress _016, EP 100190_IonXpress_025), and the original data of the same-egg twin-embryo sample (EM 004201F, EM 004202F) as a control;

2.2 comparing the above original files to human reference genome hg19;

2.3, filtering multiple repeated sequences without alignment and with low quality;

2.4 forming a unique comparison sequence;

2.5 obtaining cdSNP site list by combining high-frequency heterozygous SNP sites of human beings;

2.6, counting expected values (CR values) of consistency among samples;

And 2.7, judging the identity result between samples according to the CR value.

The analysis results were as follows:

Conclusion:

Analysis of the two samples of EM100872D and EP100342 suggests that the consistent ratio of 0.68 for EM100872D and EP100342 meets the expectations of the two samples for complete relatives, so that non-invasive detection does not confound the samples, the Y signal and the signal of trisomy 21 coming from the possibility of miscarriage.

In order to verify the accuracy of the invention, we recall that maternal blood leukocytes further do STR verification: (EM 100872D parent sample number: GEM100872B; non-invasive sample EP100342 parent blood leukocyte sample number: ES100003B; parent samples of two samples): GEM100872B and ES100003B: STR typing is consistent, and the STR typing is in primary parent relation with EM100872D, so that it is confirmed that samples are not confused, and Y signals and signals of the 21 trisomy are from miscarriage.

Example 3

Identification of false negative sample identity:

Some sample EP007057 has double fetuses, and the noninvasive detection result has no abnormality, 46, XY; pregnancy outcome: twin fetuses dead intrauterine (induced labor for two men and infants), embryo tissue CMA results: 47, XXY; it is uncertain whether the sample is confusing or not, and it is desirable to perform sample uniqueness analysis on the maternal leukocyte EP007057R and the non-invasive sample EP 007057;

The procedure was as in example 2;

The analysis results were as follows:

File1	File2	cdSNP consistent expected value (CR value)	Identity result judgment
				EP007057R	EJ042423	0.591	Two independent samples, external parameters
EP007057	EJ042423	0.593	Two independent samples, external parameters
				EP007057R	EP007057	0.759	Identical sample

Conclusion:

Analysis of the two samples of EP007057 and EP007057R suggests that the consistent ratio of 0.759 of the two samples meets the expected value of the same sample, so that the non-invasive detection does not confuse the samples, and the puncture karyotype inconsistency may be caused by placenta chimerism.

Example 4

Periodic quality control in laboratory: in view of the advantages of no need of increasing experimental steps, experimental cost, convenience and rapidness, the laboratory is additionally provided with the step of carrying out sample identity analysis on the twice-repeated sample results, and the possibility of sample mixing is examined. And identity analysis is regularly carried out on samples of the same label among different run, so that the possibility of cross contamination is checked, and the occurrence of quality accidents is effectively intercepted.

Reworking a sample identity analysis step:

4.1 comparing the original file to human reference genome hg19;

4.2, the multiple sequences are filtered without comparison and with low quality;

4.3 forming a unique comparison sequence;

4.4 obtaining cdSNP site list by combining human high-frequency heterozygous SNP site;

4.5, counting expected values (CR values) of consistency among samples;

the above steps have been packaged into an automated analysis flow cdSNP analysis plug-in;

4.6, when the system recognizes that the system has the redo result, starting cdSNP the analysis plug-in by acquiring the original data file;

4.7 when CR value <0.616, the system automatically prompts: irrespective of the two samples, the mixed sample is possibly examined;

when CR value > =0.672 & <0.725, the system automatically prompts: a certain parent relationship exists, and the pollution possibility is eliminated;

When CR value > =0.753, the system suggests: the same sample passes through quality control;

The two results of the reworked samples were analyzed as follows:

Conclusion:

The CR value=0.707, CR value > =0.672 & <0.725 for this lot of rework samples EP005415, the system automatically prompts: a certain parent relationship exists, so that the pollution possibility is checked, and a laboratory is required to check and improve the pollution cause; other sample CR values were >0.753, so the system suggests: the same sample passes the quality control.

According to the invention, on the premise of not changing the existing NIPT and CNV-seq experimental scheme and sequencing quantity, the function of the high-frequency heterozygous site of a certain crowd between two samples is expanded by utilizing the characteristic that the high-frequency heterozygous site is covered by one reads at the same time, and a whole set of method for identifying sample identity based on low-depth sequencing is developed.

The invention can analyze the sequencing original data file backed up by the detection mechanism, can finish the identification of whether the identity exists in different samples on the premise of not changing the experimental scheme and not increasing the sequencing amount and the detection cost, is convenient for a laboratory to control the quality of the detection flow, and is beneficial to clinically examining sample pollution, and tracing the sample when mixed samples are likely to occur or serious quality accidents (false positive/false negative) occur.

The invention can rapidly and economically identify and detect sample uniqueness and effectively check sample confusion and pollution. The invention utilizes the original sequence data of the low-depth whole genome sequencing detection project based on the new generation sequencing technology, such as NIPT, CNV-seq sequencing bam and fastq files, namely, comparison and analysis can be carried out, and whether the two detection samples are the same sample is identified by calculating the consistency value of a crowd polymorphic Site (SNP) with one sequence coverage between any two different samples. For detection projects based on low-depth whole genome sequencing, the laboratory periodically controls quality, checks sample pollution, and checks mixed sample conditions possibly or tracing false positive/false negative analysis samples.

In the foregoing, only the preferred embodiment of the present invention is described, and any minor modifications, equivalent changes and modifications made to the above embodiments according to the technical solutions of the present invention fall within the scope of the technical solutions of the present invention.

Claims

1. A method for identifying sample identity based on low depth sequencing, comprising the steps of:

S1, sequence comparison is the basis of NIPT, CNV-seq low-depth sequencing detection projects by using high-throughput sequencing data, and comprises the steps of selecting BWA comparison software, and comparing an original sequence data FASTQ file obtained by using a semiconductor sequencer with a human genome reference sequence GRCh37/hg19 version to obtain a compared sam file, wherein the BWA comparison software comprises BWA-0.7.17 and BWA-men;

s2, filtering the sequence, namely filtering the aligned sam file, and removing the sequence generating the false base identification, namely removing non-alignment, low alignment quality MAPQ <40 and multiple alignment to obtain effective sequencing data;

s3, selecting a SNP locus data set of high-frequency heterozygous crowd, wherein the SNP locus data set is obtained by downloading SNP data files of people in a database ftp:// ftp. Ncbi. Nih. Gov/SNP, the version of the SNP data files is version 151, and selecting loci of which the genotypes of SNP are only two types and the minimum allele frequency MAF is not lower than 0.3 as the SNP locus data set of high-frequency heterozygous crowd;

S4, acquiring cdSNP site lists, namely counting bases of SNP site datasets of high-frequency heterozygous hit population in comparison results of each sam file, and then acquiring site base information co-DETECTED SNPS, co-DETECTED SNPS of cdSNPs which is covered by one sequence for each two files;

S5, counting CR values among samples, wherein the CR values comprise a concordance value concordance rate of calculated site base information cdSNPs, and the concordance rate is simply called CR;

S6, sample identity analysis, and according to CR values, the following classification judgment is obtained: 1) When CR <0.616, it is determined that there is no significant relationship; 2) When CR >0.672 and CR <0.725, the two samples are judged to be related; 3) When CR >0.753, the same individual relationship is determined; 4) When the CR value does not fall under the above conditions, the fetal DNA concentration is high for laboratory contamination or noninvasive detection of the sample.

2. The method of preparing a low depth sequencing based method for identifying sample identity according to claim 1, wherein: the method is obtained by utilizing the characteristic that a certain crowd high-frequency heterozygous site exists between two samples and is covered by one reads.