CN110504005A

CN110504005A - Data processing method

Info

Publication number: CN110504005A
Application number: CN201910795698.3A
Authority: CN
Inventors: 杨圳; 王文山
Original assignee: SHANGHAI GENMINIX INFORMATICS CO Ltd
Current assignee: SHANGHAI GENMINIX INFORMATICS CO Ltd
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2019-11-26

Abstract

Present invention discloses a kind of data processing method, the processing method of the lower machine data in ICB-scSeq technology is related in particular to.Using the data processing method, can at low cost, simply, rapidly carry out cell sequencing, there is sizable economic benefit and safety benefit.

Description

Data processing method

Technical field

The present invention relates to a kind of data processing methods.It particularly relates to ICB-scSeq (Intelligent Combinatorial Barcoding-single cell Sequencing, the unicellular sequencing of intelligences combination bar code method) skill The processing method of lower machine data in art.

Background technique

In past 10 years, as (next generation sequencing, NGS) technology and third is sequenced in the second generation The rapid development of generation sequencing (third generation sequencing, TGS) technology, causes the huge of life science It changes.Previous research needs to obtain enough nucleic acid from a large amount of cells and is sequenced, therefore sequencing result often indicates Be cell colony characterization, and the exclusive cell characteristics of individual cells are often ignored.In order to solve above-mentioned confinement problems, Unicellular sequencing technologies come into being.

Unicellular sequencing achieves achievement abundant in fields such as tumour, Developmental Biology, Neuscience.And it is slender The research of born of the same parents originally more can rapidly expand scientific payoffs, but unicellular sequencing technologies are there is also many problems, than if you need to Using fresh cell, sample utilisation is not high, expensive equipment and related reagent etc., this research to unicellular sequencing technologies Bring many inconvenience with popularization, and also have for the extensive development of unicellular life science it is many unfavorable.Therefore optimize The new unicellular sequencing technologies of exploitation, just seem very urgent.

ICB-scSeq (Intelligent Combinatorial Barcoding-single cell Sequencing, The unicellular sequencing of intelligences combination bar code method) it is the unicellular sequencing technologies researched and developed by the present inventors, it is a kind of Based on SPLIT (split-pool ligation-based transcriptome, the transcript profile sequencing based on the connection of segmentation pond) The method that technology passes through the unicellular sequencing of combination bar code (combinatorial barcoding) labeled RNA origin of cell.

Therefore, because there are above-mentioned technological deficiency, in the sequencing approach of ICB-scSeq, it is also desirable to find a kind of more preferable The data processing method from original lower machine data to downstream analysis, can at low cost, simply, rapidly carry out cell survey Sequence.

Summary of the invention

It is an object of the present invention to overcome the deficiencies of existing technologies, providing one kind can at low cost, simply, fastly The data processing method of cell sequencing is carried out fastly.

To achieve the above object, the following technical solutions are proposed: a kind of data processing method by the present invention, which is characterized in that packet It includes:

Initial data obtaining step, carry out both-end sequencing and to for the original of the unicellular sequencing of intelligences combination bar code method Data are obtained, and first end is the part cDNA, and second end is specific molecular label and cell bar code part；

Quality control and filtration step, are filtered acquired initial data and obtain filtered data；

Step is compared, filtered data is compared with reference genome sequence and obtains comparing rear data；

Specific molecular label duplicate removal step goes to the duplicate part of specific molecular label in data after comparison Remove data after obtaining duplicate removal；

Gene quantification step carries out gene quantification to data after duplicate removal and obtains quantitatively rear data；

Expression matrix construction step constructs expression matrix according to quantitatively rear data, which includes each cell In each gene original count value；

Cell screening step, Mitochondria content and expressing gene number to expression matrix are screened after obtaining screening Matrix；

Normalizing steps are standardized the original count value of matrix after screening and obtain normalized matrix；

Analytical procedure analyzes normalized matrix.

Provided data processing method according to the present invention, can at low cost, simply, rapidly carry out cell survey Sequence has sizable economic benefit and safety benefit.

Detailed description of the invention

Fig. 1 is the schematic diagram of the data processing method of first embodiment of the invention.

Fig. 2 is the schematic diagram of gene order used in the data processing method of Fig. 1.

Fig. 3 is the schematic diagram of the duplicate removal process in the data processing method of Fig. 1.

Fig. 4 is the display diagram of the achievement of the clustering in the data processing method of Fig. 1.

Fig. 5 is the display diagram of the achievement of the access enrichment analysis in the data processing method of Fig. 1.

Fig. 6 is another display diagram of the achievement of the access enrichment analysis in the data processing method of Fig. 1.

Specific embodiment

Below in conjunction with attached drawing of the invention, clear, complete description is carried out to the technical solution of the embodiment of the present invention.

First embodiment of the invention is a kind of data processing method.

Fig. 1 is the schematic diagram of the data processing method of first embodiment of the invention.As shown in Figure 1, the data processing Method includes: initial data obtaining step, quality control and filtration step, compares step, specific molecular label duplicate removal step Suddenly, gene quantification step, expression matrix construction step, cell screening step, normalizing steps, analytical procedure.

In initial data obtaining step, carry out both-end sequencing (paired-end sequencing) and to for intelligence The initial data of the unicellular sequencing (ICB-scSeq) of combination bar code method is obtained, and first end, that is, end read1 is the portion cDNA Point, second end, that is, end read2 is specific molecular label and the part cell bar code (UMI+cell barcode).CDNA is Refer to the DNA having with certain RNA chain in complementary base sequence.UMI (Unique Molecular indentifier) is specificity Molecular label.

In quality control and filtration step, acquired initial data is filtered and obtains filtered data. In the present embodiment, it schematically illustrates quality control and filtration step includes following sub-step: to acquired original The cell bar code part of the second end of beginning data is corrected；Construct the white list of cell bar code；It is extracted according to white list The sequence of first end；The sequence of extracted first end is screened to be filtered and obtain filtered data.But this Invention is not limited to this, and quality control and filtration step also may include other sub-steps.

Specifically, there are three sections of cell barcode in each read2, be BC1, BC2, BC3 respectively, every section is all The length (as shown in Figure 2) of 8bp.And the sequence of these barcode is fixed every time.For example, if barcode1 makes It is combined with 96 kinds, then illustrating that the sequence of barcode1 only has 96 kinds in total, each is 8bp.Therefore according to hamming Distance (Hamming distance) is equal to 1 calibration principle to be corrected to every read.

In each read, BC1 is extracted, three sections of 8bp sequences of BC2, the position BC3 are used as candidate barcode Sequence (is labeled as barcode1-new, barcode2-new, barcode3-new).Then successively to barcode1-new with All sequences in list through the barcode1 determined are compared, and calculate hamming distance, are denoted as hd.If Hd is equal to 0, then without changing, if hd is equal to 1, the sequence of the barcode1-new is changed to corresponding barcode1's Sequence.To complete the correction course of barcode sequence.

After the correction for completing cell barcode, cell barcode sequence is closed according to the cell number estimated And as the unique identification of a cell (cell UID), the white list of a cell barcode is constructed.In this white list List inside, be the UID of all cells that can be identified.

The white list of the cell barcode built up according to previous step extracts the cDNA sequence inside the end read1. To the sequence of any one read1, if the cell UID inside read2 corresponding to it is in cell barcode white list Face, then this read1 will be extracted.Open-Source Tools umi-tools can be used in building white list and abstraction sequence It is handled.

After having extracted read1 sequence, it is also necessary to further be screened to sequence, mainly remove the polyA structure at end The low quality value at (being shown below) and sequence both ends.In following formula, upper row is original series, below a behavior removal The sequence of the low quality value of the polyA structure and sequence both ends at end.

In comparing step, the read1 sequence obtained according to above-mentioned screening is compared with the sequence of reference genome, The comparison, which can be used, compares software STAR to carry out.

According to comparison as a result, having obtained the bam file for having already passed through sequence.The sequence root that each is compared It is annotated according to the GTF file of reference genome, that is, carries out the specified of gene.Purpose is the sequence on clear each compares Which gene column belong to after the annotation of GTF file.This, which is specified, can be used Open-Source Tools featureCounts to complete.

In the specific molecular label duplicate removal step, according to previous step as a result, it has been found that each compares Which gene read belongs to.Because PCR-bias when ICB-scSeq continues library after sample in order to eliminate and in every sequence In introduce the UMI sequence of one section of 10bp long.In this way, if there are two identical sequences within the scope of the same gene And if the 10bp of the UMI of sequence is also identical, it is considered that this two read are from same cDNA points Son needs duplicate removal.As shown in figure 3, show five read on the left of Fig. 3, but in this five read, above three Read be it is duplicate, below two read be also duplicate, therefore after duplicate removal, the read on right side only has two.

In gene quantification step, gene quantification is carried out to data after duplicate removal and obtains quantitatively rear data.

In expression matrix construction step, expression matrix is constructed according to quantitatively rear data, which includes each The original count value (raw counts) of each gene in cell.In this matrix, each column represent a cell UID, every a line represent the ID of a gene, as shown in the table.

In cell screening step, Mitochondria content and expressing gene number to expression matrix are screened and are sieved Matrix after choosing.Specifically, the data of each cell inside expression matrix are calculated, calculates all of chondriogen The ratio of expression value just screens out this cell if this ratio is more than the threshold value of setting.Threshold value is, for example, 5%, but It is to be not limited to this, also can be set to other threshold values.Further need exist for the number to expressing gene in cell each in expression matrix Amount is screened, and general screening criteria is, for example, that the number of minimum expression is 200, and most highly expressed number is 2500, still It is not limited to this, also can be set to other ranges.Seurat can be used to carry out in screening step.By screening twice, obtain Expression matrix after one screening, can carry out the processing of next step.

In normalizing steps, the standard of obtaining is standardized to the original count value of the expression matrix after screening Change matrix.Since in unicellular sequencing procedure, the number that each cell measures reads is inhomogenous, in order to eliminating because Quantitative error caused by depth is sequenced, needs to be standardized raw counts.Normalizing steps can be used Seurat is carried out, and standardized calculation formula is as follows:

The raw counts, AllCount that wherein CountOfGene represents each gene in each cell represent each thin The sum of the raw counts of all genes in born of the same parents.

In analytical procedure, normalized matrix is analyzed.In the present embodiment, analytical procedure is in cell level For clustering step, analytical procedure is enriched with analytical procedure in gene level for variance analysis step and access, but simultaneously It is without being limited thereto, it is also possible to other suitable analytical procedures.

The analysis method of clustering is as follows.

First carry out feature extraction, to it is all measure it is unicellular carry out cluster sub-clustering analysis.First to the table after standardization The feature that high variation is calculated up to matrix, comes out these feature extractions and carries out subsequent analysis.

Then sized analysis is carried out to matrix data, in order to be eliminated as much as some data sources error (including Technical error, the error of batch error and some biological origins), recurrence processing is carried out to matrix data, excludes these Error, to improve the effect of subsequent dimensionality reduction and cluster.

Then linear Dimension Reduction Analysis is carried out, is utilized PCA (principal component analysis, principal component analysis) Method to have been subjected to sized analysis data carry out Dimension Reduction Analysis.

Then cluster grouping analysis is carried out, according to the PC (principal component) for the conspicuousness that previous step identifies, using based on figure The clustering method of shape.This method is calculated according to KNN (K-nearest neighbor, the K arest neighbors) figure and Louvain of building Method clusters to be made iteratively, and finally all cells is gathered inside different monoids.Above analytic process can be used Seurat is carried out.

The displaying of UMAP two dimension is finally carried out, as shown in figure 4, according to previous step cluster as a result, using UMAP (uniform Manifold approximation and projection, uniform manifold is approximate and projection) method carry out two-dimentional displaying. Seurat can be used to be analyzed in the methods of exhibiting.

The analysis method of clustering is as follows.

According to cluster as a result, using Wilcoxon rank sum test (Wilcoxen order to all cluster And examine) method carry out differential gene screening analysis, obtain the column of a difference expression gene about all cluster Table.As a result as shown in the table.

The analysis method of access enrichment analysis is as follows.

The enrichment analysis of the first access is GO (Gene Ontology, Gene Ontology) enrichment analysis.As shown in figure 5, root According to previous step differential gene as a result, carrying out GO enrichment analysis to the differential gene of each cluster.

In addition to GO be enriched with analyze, moreover it is possible to carry out KEGG (Kyoto Encyclopedia of Genes and Genomes, Capital of a country gene and genomic encyclopedia) access enrichment analysis, as shown in fig. 6, identifying the difference base inside each cluster It because of conspicuousness is enriched to inside which access, bubble diagram has been used to be shown.

As described above, using the data processing method of first embodiment, can at low cost, simply, rapidly into The sequencing of row cell, has sizable economic benefit and safety benefit.

It should be noted that each unit mentioned in each equipment embodiment of the present invention is all logic unit, physically, One logic unit can be a physical unit, be also possible to a part of a physical unit, can also be with multiple physics The combination of unit realizes that the Physical realization of these logic units itself is not most important, these logic units institute reality The combination of existing function is only the key for solving technical problem proposed by the invention.In addition, in order to protrude innovation of the invention Part, there is no the technical problem relationship proposed by the invention with solution is less close for the above-mentioned each equipment embodiment of the present invention Unit introduce, this does not indicate above equipment embodiment and there is no other units.

It should be noted that in the claim and specification of this patent, such as first and second or the like relationship Term is only used to distinguish one entity or operation from another entity or operation, without necessarily requiring or implying There are any actual relationship or orders between these entities or operation.Moreover, the terms "include", "comprise" or its Any other variant is intended to non-exclusive inclusion so that include the process, methods of a series of elements, article or Equipment not only includes those elements, but also including other elements that are not explicitly listed, or further include for this process, Method, article or the intrinsic element of equipment.In the absence of more restrictions, being wanted by what sentence " including one " limited Element, it is not excluded that there is also other identical elements in the process, method, article or apparatus that includes the element.

Although being shown and described to the present invention by referring to some of the preferred embodiment of the invention, It will be understood by those skilled in the art that can to it, various changes can be made in the form and details, without departing from this hair Bright spirit and scope.

Claims

1. a kind of data processing method characterized by comprising

Initial data obtaining step, carry out both-end sequencing and to the initial data for the unicellular sequencing of intelligences combination bar code method It is obtained, first end is the part cDNA, and second end is specific molecular label and cell bar code part；

Specific molecular label duplicate removal step, the duplicate part of specific molecular label in data after comparison is removed and Obtain data after duplicate removal；

Expression matrix construction step constructs expression matrix according to quantitatively rear data, which includes in each cell The original count value of each gene；

Cell screening step, square after Mitochondria content and expressing gene number to expression matrix are screened and screened Battle array；

Analytical procedure analyzes normalized matrix.

2. data analysing method according to claim 1, which is characterized in that

Quality control and filtration step include following sub-step:

The cell bar code part of the second end of acquired initial data is corrected；

Construct the white list of cell bar code；

The sequence of first end is extracted according to white list；

The sequence of extracted first end is screened to be filtered and obtain filtered data.

3. data analysing method according to claim 1, which is characterized in that

In cell screening step, the threshold value to the screening of Mitochondria content is 5%.

4. data analysing method according to claim 1, which is characterized in that

In cell screening step, the range to the screening of expressing gene number is 200-2500.

5. data analysing method according to claim 1, which is characterized in that

It in normalizing steps, is standardized using following calculating formula, wherein CountOfGene is represented in each cell The original count value of each gene, AllCount represent the sum of the original count value of all genes in each cell,

6. data analysing method according to claim 1, which is characterized in that

The analytical procedure is clustering step in cell level.

7. data analysing method according to claim 1, which is characterized in that

The analytical procedure is enriched with analytical procedure in gene level for variance analysis step and access.