[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115064209A - Malignant cell identification method and system - Google Patents

Malignant cell identification method and system Download PDF

Info

Publication number
CN115064209A
CN115064209A CN202210988485.4A CN202210988485A CN115064209A CN 115064209 A CN115064209 A CN 115064209A CN 202210988485 A CN202210988485 A CN 202210988485A CN 115064209 A CN115064209 A CN 115064209A
Authority
CN
China
Prior art keywords
gene
cell
malignant cells
copy number
malignant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210988485.4A
Other languages
Chinese (zh)
Other versions
CN115064209B (en
Inventor
季序我
彭鑫鑫
赵义
李哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Pukang Ruiren Medical Laboratory Co ltd
Predatum Biomedicine Suzhou Co ltd
Precision Scientific Technology Beijing Co ltd
Original Assignee
Beijing Pukang Ruiren Medical Laboratory Co ltd
Predatum Biomedicine Suzhou Co ltd
Precision Scientific Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Pukang Ruiren Medical Laboratory Co ltd, Predatum Biomedicine Suzhou Co ltd, Precision Scientific Technology Beijing Co ltd filed Critical Beijing Pukang Ruiren Medical Laboratory Co ltd
Priority to CN202210988485.4A priority Critical patent/CN115064209B/en
Publication of CN115064209A publication Critical patent/CN115064209A/en
Application granted granted Critical
Publication of CN115064209B publication Critical patent/CN115064209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method and a system for identifying malignant cells, wherein the method comprises the following steps: obtaining single cell transcriptome sequencing data; calculating data required by gene copy number variation identification malignant cells based on single cell transcriptome sequencing data, and determining the direction and degree of gene copy number variation and the degree of allele imbalance based on the data; performing first cell clustering based on the direction and degree of gene copy number variation and the degree of allele imbalance to determine suspected non-malignant cells and suspected malignant cells which are determined for the first time; and performing second cell clustering according to the tumor characteristic score and the non-tumor characteristic score to determine final non-malignant cells and malignant cells based on the first determined suspected non-malignant cells and suspected malignant cells. The invention also discloses corresponding electronic equipment and a computer readable storage medium, integrates two technical routes of supervision and unsupervised, does not depend on the sequencing data of the cell transcriptome of the tumor tissue and the tissue group beside the cancer, and improves the identification sensitivity to malignant cells.

Description

Malignant cell identification method and system
Technical Field
The invention relates to the technical field of cell identification, in particular to a malignant cell identification method and system.
Background
With the advent and continued improvement of single cell transcriptome sequencing technology, it became possible to study genomic features of tumors at single cell resolution. However, one prerequisite for conducting genomic signature-related studies of tumors is the accurate identification of malignant cells from tens of thousands of single cell sequencing data. Therefore, the identification of malignant cells is an important research content in the field of single-cell transcriptome sequencing.
At present, the identification of malignant cells mainly comprises two technical routes, namely a supervised technology route and an unsupervised technology route.
(1) The supervised technical route is divided into three steps: firstly, identifying characteristic genes of tumor tissues and tissues beside cancer by using group cell transcriptome sequencing data of the tumor tissues and tissues beside cancer of corresponding cancer species; then, based on the characteristic genes and the sequencing data of the single-cell transcriptome, respectively calculating characteristic scores of malignant cells and non-malignant cells aiming at each cell, namely expression median values of characteristic genes of tumor tissues and tissues beside cancer; and finally, dividing all the cells into two groups based on the two characteristic scores, and taking the group with higher characteristic score of malignant cells as the malignant cells. The problem with the supervised technical route is: the technical route needs to be implemented by finding matched group cell transcriptome sequencing data of tumor tissues and paracancerous tissues, but the technical route is limited to be used in the malignant cell identification process due to the heterogeneity of tumors, the lack of paracancerous samples in public data resources, and the additional cost of the group cell transcriptome sequencing.
(2) The non-supervised technical route is divided into two steps: firstly, estimating the direction and degree of copy number variation for each cell in a region with a specific length based on single cell transcriptome sequencing data; then, based on the related information of copy number variation, adopting an unsupervised clustering method to cluster all cells into two types, and taking the type with larger copy number variation degree as a malignant cell. The problem of the unsupervised technical route lies in two aspects: firstly, because some copy number variations do not affect the ploidy of the whole genome and are not reflected in the variation of gene expression values, the sensitivity of estimating the copy number variations by only referring to the gene expression values of single cells is low; secondly, if malignant cell identification is performed based only on the estimated copy number variation distribution pattern without considering the characteristic gene expression difference, the sensitivity of malignant cell identification is reduced.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides the following technical scheme, a malignant cell identification method and a system, which integrate the characteristics of two major technical routes of supervision and unsupervised and improve the sensitivity to malignant cell identification on the premise of not depending on cell transcriptome sequencing data of tumor tissues and tissues beside cancers.
In one aspect, the present invention provides a method for identifying malignant cells, comprising:
s1, obtaining single cell transcriptome sequencing data;
s2, calculating data required by gene copy number variation identification malignant cells based on the single cell transcriptome sequencing data, and determining the direction, degree and allele imbalance degree of gene copy number variation based on the data;
s3, performing first cell clustering based on the direction and degree of gene copy number variation and the degree of allele imbalance to determine suspected non-malignant cells and suspected malignant cells which are determined for the first time;
and S4, performing second cell clustering according to the tumor characteristic score and the non-tumor characteristic score to determine final non-malignant cells and malignant cells based on the first determined suspected non-malignant cells and suspected malignant cells.
Preferably, the data comprises calculated gene expression values, mutated genomic positions and mutation frequencies of gene regions, and haplotyping information of the human genome, based on single cell transcriptome sequencing data; wherein:
calculating a gene expression value based on single cell transcriptome sequencing data for measuring the level of gene copy number variation of a genome region; if the gene expression value is high, the level of gene copy number variation in the genome region is high, and vice versa;
the mutant genomic position and mutation frequency of a gene region are used to determine the location and extent of allelic imbalance due to copy number variation; the mutation frequency is high, the allele imbalance degree caused by the gene copy number variation is high, and vice versa;
the haploid typing information of the human genome is used for improving the sensitivity of detecting the allelic gene imbalance, mutations on the same haploid typing are connected into a group based on the haploid typing information of the human genome, and the average mutation frequency of the group of mutations is taken as the allelic gene imbalance degree of the region where the mutations are located; if the average mutation frequency is high, the degree of allelic imbalance in the region in which the mutation is present is high, and vice versa.
Preferably, the S2 includes:
s21, calculating the gene expression value of each gene in each cell based on the single cell transcriptome sequencing data; normalizing said gene expression values by subtracting the average expression value of the gene in all cells from each of said gene expression values and dividing by the standard deviation of the expression value of the gene in all cells; determining the direction and degree of gene copy number variation of the gene in each cell based on the normalized gene expression value: if the gene expression value after the normalization treatment is positive, the gene is subjected to gene copy number amplification, and if the gene expression value after the normalization treatment is larger, the gene copy number amplification degree is larger; if the gene expression value after the normalization treatment is negative, the gene copy number deletion is generated, and the smaller the gene expression value after the normalization treatment is, the larger the gene copy number deletion degree is;
s22, identifying, for each gene, mutations occurring within its region based on the single cell transcriptome sequencing data, determining the mutant genomic position and mutation frequency of the gene region;
s23, linking the mutations that occur on the same haplotype in S22 into a set with reference to haplotyping information of the human genome; the difference between the mean mutation frequency and the first value of the first test for all mutations in a set of mutations is used to measure the degree of allelic imbalance: if the difference is larger than 0, the gene carrying the group of mutations is subjected to gene copy number amplification, and the larger the difference is, the larger the amplification degree of the gene copy number is; if the difference is less than 0, the gene carrying the group of mutations is subjected to gene copy number deletion, and the smaller the difference is, the larger the gene copy number deletion degree is.
Preferably, the first empirical value is 0.5.
Preferably, the S3 includes:
s31, according to the gene copy number variation direction, based on the two data indexes of the gene copy number variation value and the allele imbalance value, performing first clustering on all cells to obtain a plurality of first cell type groups;
s32, obtaining a plurality of first absolute values and a plurality of second absolute values by taking absolute values of the numerical values of the gene copy number variation and the numerical values of the allelic imbalance of all the cells of each of the plurality of first cell type groups, and calculating an average value of the plurality of first absolute values and an average value of the plurality of second absolute values to obtain an average value of the gene copy number variation and an average value of the allelic imbalance;
s33, the first cell category group with the product of the gene copy number variation average value and the allele imbalance average value lower than a first threshold value is the suspected non-malignant cell determined for the first time; the remaining first cell class group is the first identified suspected malignant cell.
Preferably, the first cluster is a k-means cluster, and the number of clusters of the k-means cluster is specified to be 2.
Preferably, the S4 includes:
s41, determining a benchmark for identifying the differential expression genes based on the suspected non-malignant cells and the suspected malignant cells which are determined for the first time, and determining the characteristic genes of the malignant cells and the characteristic genes of the non-malignant cells respectively based on the benchmark;
s42, for each cell, respectively taking the expression median of the characteristic genes of the malignant cells and the characteristic genes of the non-malignant cells as the malignant cell score and the non-malignant cell score of the cell;
s43, performing a second clustering on all cells based on the malignant cell score and the non-malignant cell score to obtain a plurality of second cell class groups;
s44, calculating a difference between the malignant cell score and the non-malignant cell score of each cell in each of the plurality of second cell category groups, and regarding the second cell category group with a mean value of the differences higher than a second threshold as a suspected malignant cell, and regarding the remaining second cell category groups as suspected non-malignant cells;
s45, adjusting the baseline of S41, repeating S42-S44, thereby determining final non-malignant cells and malignant cells.
Preferably, the reference of S41 is a screening standard for differential genes, comprising:
get log 2 FoldChange>1 and FDR<0.05 gene as characteristic gene of malignant cell, and log 2 FoldChange<-1 and FDR<0.05 as a non-malignant cell characteristic gene; log therein 2 FoldChange is the log2 transformation of the ratio of the mean of gene expression values in suspected malignant cells to the mean of gene expression values in suspected non-malignant cells.
Preferably, the second cluster is a k-means cluster, and the number of clusters of the k-means cluster is specified to be 2.
In a second aspect of the present invention, there is provided a malignant cell identification system comprising:
the single cell transcriptome sequencing data acquisition module is used for acquiring single cell transcriptome sequencing data;
a gene copy number variation and allele imbalance degree determination module, which is used for calculating data required by identifying malignant cells by gene copy number variation based on the single cell transcriptome sequencing data, and determining the direction and degree of gene copy number variation and allele imbalance degree based on the data;
the first clustering and identifying module is used for carrying out first cell clustering on the basis of the direction and degree of the gene copy number variation and the allele imbalance degree to determine suspected non-malignant cells and suspected malignant cells which are determined for the first time;
and the secondary clustering and identifying module is used for performing secondary cell clustering to determine final non-malignant cells and malignant cells according to the tumor characteristic score and the non-tumor characteristic score based on the first determined suspected non-malignant cells and suspected malignant cells.
A third aspect of the invention provides an electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being configured to read the instructions and to perform the method according to the first aspect.
A fourth aspect of the invention provides a computer readable storage medium storing a plurality of instructions readable by a processor and performing the method of the first aspect.
The malignant cell identification method, system and electronic equipment provided by the invention have the following beneficial effects:
in the process of identifying malignant cells, the method eliminates the dependence on cell transcriptome sequencing data of tumor tissues and tissues beside cancer, and simultaneously incorporates more abundant reference information, including gene copy number variation, allele imbalance degree after adopting haploid typing correction of human genome and characteristic gene expression values of malignant cells and non-malignant cells, thereby improving the sensitivity of identifying malignant cells.
Drawings
FIG. 1 is a schematic flow chart of the method for identifying malignant cells according to the present invention.
FIG. 2 is a schematic diagram of a malignant cell identification system according to the present invention.
Fig. 3 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.
A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.
The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.
The display screen is used for displaying user interfaces of all the application programs.
In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.
Example one
As shown in fig. 1, the present embodiment provides a malignant cell identification method, including:
s1, obtaining single cell transcriptome sequencing data;
s2, calculating data required by gene copy number variation identification malignant cells based on the single cell transcriptome sequencing data, and determining the direction, degree and allele imbalance degree of gene copy number variation based on the data;
s3, performing first cell clustering based on the direction and degree of gene copy number variation and the degree of allele imbalance to determine suspected non-malignant cells and suspected malignant cells which are determined for the first time;
and S4, performing second cell clustering according to the tumor characteristic score and the non-tumor characteristic score to determine final non-malignant cells and malignant cells based on the first determined suspected non-malignant cells and suspected malignant cells.
As a preferred embodiment, the data comprises calculated gene expression values, mutated genomic positions and mutation frequencies of gene regions, and haplotyping (haplotype) information of the human genome based on single cell transcriptome sequencing data. Wherein:
calculating a gene expression value based on single cell transcriptome sequencing data for measuring the level of gene copy number variation of a genome region; if the gene expression value is high, the level of gene copy number variation in the genome region is high, and vice versa;
the mutated genomic position and the mutation frequency of a gene region are used to determine the location and extent of allelic imbalance (allelicimbalance) due to copy number variation; the mutation frequency is high, the allele imbalance degree caused by the gene copy number variation is high, and vice versa;
the haploid typing information of the human genome is used for improving the sensitivity of detecting the allelic gene imbalance, mutations on the same haploid typing are connected into a group based on the haploid typing information of the human genome, and the average mutation frequency of the group of mutations is taken as the allelic gene imbalance degree of the region where the mutations are located; if the average mutation frequency is high, the degree of allelic imbalance in the region in which the mutation is present is high, and vice versa.
In order to improve the sensitivity of copy number variation identification, the present example considers the gene expression values, the mutation positions and mutation frequencies of the gene regions, and the haplotyping (haplotype) information of the human genome, which are calculated based on the sequencing data of the single cell transcriptome.
As a preferred embodiment, the S2 includes:
s21, calculating the gene expression value of each gene in each cell based on the single cell transcriptome sequencing data; normalizing said gene expression values by subtracting the average expression value of the gene in all cells from each of said gene expression values and dividing by the standard deviation of the expression value of the gene in all cells; determining the direction and degree of gene copy number variation of the gene in each cell based on the normalized gene expression value: if the gene expression value after the normalization treatment is positive, the gene is subjected to gene copy number amplification, and if the gene expression value after the normalization treatment is larger, the gene copy number amplification degree is larger; if the gene expression value after the normalization treatment is negative, the gene copy number deletion is generated, and the smaller the gene expression value after the normalization treatment is, the larger the gene copy number deletion degree is;
s22, identifying for each gene a mutation occurring within its region based on the single cell transcriptome sequencing data, determining the mutated genomic position and mutation frequency of the gene region;
s23, linking the mutations that occur on the same haplotype in S22 into a set with reference to haplotyping information of the human genome; the difference between the mean mutation frequency and the first value of the first test for all mutations in a set of mutations is used to measure the degree of allelic imbalance: if the difference is more than 0, the gene carrying the group of mutations is amplified by the gene copy number, and the larger the difference is, the larger the amplification degree of the gene copy number is; if the difference is less than 0, the gene carrying the group of mutations is subjected to gene copy number deletion, and the smaller the difference is, the larger the gene copy number deletion degree is.
As a preferred embodiment, the first empirical value is between 0.45 and 0.55, preferably 0.5.
As a preferred embodiment, the S3 includes:
s31, according to the gene copy number variation direction, based on the two data indexes of the gene copy number variation value and the allele imbalance value, performing first clustering on all cells to obtain a plurality of first cell type groups;
s32, obtaining a plurality of first absolute values and a plurality of second absolute values by taking absolute values of the numerical values of the gene copy number variation and the numerical values of the allelic imbalance of all the cells of each of the plurality of first cell type groups, and calculating an average value of the plurality of first absolute values and an average value of the plurality of second absolute values to obtain an average value of the gene copy number variation and an average value of the allelic imbalance;
s33, the first cell category group with the product of the gene copy number variation average value and the allele imbalance average value lower than a first threshold value is the suspected non-malignant cell determined for the first time; the remaining first cell class group is the first identified suspected malignant cell. In this embodiment, the first threshold is not a fixed threshold, but a relative threshold, i.e., a group having a lower product of the mean value of the gene copy number variation and the mean value of the allelic imbalance than the two cell type groups, is used as a suspected non-malignant cell.
As a preferred embodiment, the first cluster is a k-means cluster, and the number of clusters of the k-means cluster is specified to be 2.
As a preferred embodiment, the S4 includes:
s41, determining a benchmark for identifying the differential expression genes based on the suspected non-malignant cells and the suspected malignant cells which are determined for the first time, and determining the characteristic genes of the malignant cells and the characteristic genes of the non-malignant cells respectively based on the benchmark;
s42, for each cell, respectively taking the expression median of the characteristic genes of the malignant cells and the characteristic genes of the non-malignant cells as the malignant cell score and the non-malignant cell score of the cell;
s43, performing second clustering on all the cells based on the malignant cell score and the non-malignant cell score to obtain a plurality of second cell category groups;
s44, calculating a difference between the malignant cell score and the non-malignant cell score of each cell in each of the plurality of second cell category groups, and regarding the second cell category group with a mean value of the differences higher than a second threshold as a suspected malignant cell and regarding the rest of the second cell category groups as suspected non-malignant cells; in this embodiment, the second threshold is not a fixed threshold, but a relative threshold, i.e. a group with a higher average value of the difference between the malignant cell score and the non-malignant cell score than the two cell type groups is regarded as a suspected malignant cell.
S45, adjusting the baseline of S41, repeating S42-S44, thereby determining final non-malignant cells and malignant cells.
In a preferred embodiment, the reference of S41 is a screening standard for a differential gene, comprising:
get log 2 FoldChange>1 and FDR<0.05 gene as characteristic gene of malignant cell, and log 2 FoldChange<-1 and FDR<0.05 as a non-malignant cell characteristic gene; log therein 2 FoldChange is the log2 transformation of the ratio of the mean of gene expression values in suspected malignant cells to the mean of gene expression values in suspected non-malignant cells.
With the continuous reduction of sequencing cost, transcriptome sequencing analysis has gradually become a very common analysis means; in the gene expression analysis, a differential analysis (DE) is inevitably carried out, and the DE method mainly comprises two methods: fold change and t-test.
In transcriptome analysis of differentially expressed genes, the present invention uses log 2 FoldChange and FDR values, where log 2 FoldChange may also take log 2 FC represents, wherein FC is Fold Change, and represents the ratio of expression between two samples (groups) or the difference multiple of expression between samples; taking the logarithm with the base 2 as the log 2 FC, so that the difference between a particularly large value and a smaller value is reducedThe difference. General default log extraction 2 The FC absolute value is greater than 1.
FDR (i.e., False Discovery Rate) is obtained by correcting a difference significance p-value (p-value) obtained from t-test. Because the differential expression analysis of transcriptome sequencing is independent statistical hypothesis test on a large number of gene expression values, a false positive problem exists, and the differential expression analysis is mainly used in the analysis of the differential expression genes in the transcriptome analysis and controls the proportion of false positive results in final analysis results.
In transcriptome analysis, it is one of the core contents of the analysis to determine whether a transcript is expressed differently in different samples. Generally, transcripts whose expression amounts are more than twice different among different samples are transcripts having expression differences. In order to determine whether the difference in expression level between the two samples is due to various errors or is substantial, it is necessary to perform a hypothesis test based on the data of the expression levels of all genes in the two samples. Commonly used hypothesis testing methods include t-test, chi-square test, and the like. It is hypothesized that examination of p-value does not determine whether a transcript is differentially expressed, since transcriptome analysis is not performed on one or several transcripts, and it is all transcripts transcriptionally expressed in a sample that it analyzes. Therefore, the number of transcripts in a sample is determined by the hypothesis test. This can lead to a serious problem, as the low proportion of false positives in a single hypothesis test can accumulate to a very surprising extent. It was hypothesized that in the analyzed genomic gene samples:
(1) comprises two samples, 10000 transcripts are obtained in total,
(2) wherein the expression level of 100 transcripts was different between the two samples,
(3) differential expression analysis for individual genes gave 1% false positives.
Because of the 1% false positive results, after the 10000 genes are analyzed, 100 false positive results can be obtained, and the total of 200 results is obtained by adding 100 true results. In this example, 50% of the 200 differentially expressed genes obtained from one analysis were false positives, which is clearly unacceptable. To address this issue, this example introduces FDRs to control the proportion of false positives in the final analysis.
The calculation of FDR is corrected based on the p-value of hypothesis testing. Therefore, in the differential expression analysis process, a well-known Benjamini-Hochberg correction method is adopted to correct a significance p value (p-value) obtained by the original hypothesis test, and finally FDR is adopted as a key index for screening the differential expression genes. FDR <0.01 or 0.05, 0.1 is typically taken as a default criterion.
The selection of these two criteria is generally based on empirical values and is not completely adjustable. Since the index can be finely adjusted when the number of experimentally different genes is too low or too high, the reference is adjusted in the step S45, and the FDR is adjusted to 0.01 or 0.1 from 0.05 to the threshold in this embodiment.
In general, the calculation of FDR is as follows:
(1) arranging all p-values in an ascending order; p-value is marked as P, the serial number of the P-value is marked as i, and the total number of the P-value is marked as m;
(2)FDR(i)=P(i)*m/i;
(3) and sequentially executing the following steps according to the descending of the value of i: FDR (i) = min { FDR (i) }, FDR (i +1) }
In fact, the original algorithm of the BH method is to find a maximum i, satisfying the P ≦ i/m FDR threshold, at which time all data less than i can be considered significant. In this embodiment, in order to conveniently analyze data by using different FDR thresholds, the method in step (3) is adopted, so that it can be ensured that all significant data can be directly found according to the value of FDR no matter how many FDR thresholds are selected.
The selection of FDR threshold is a very important link in transcriptome analysis, and common thresholds include 0.01, 0.05, 0.1, etc. In practice, the selection can be flexibly carried out according to actual needs. For example, when the number of differentially expressed genes obtained by transcriptome analysis is small, the FDR threshold may be appropriately set higher due to the low degree of false positive accumulation, so that a larger number of differentially expressed results may be obtained, facilitating the subsequent analysis.
As a preferred embodiment, the second cluster is a k-means cluster, and the number of clusters of the k-means cluster is specified to be 2.
Example two
As shown in fig. 2, the present embodiment provides a malignant cell identification system including:
a single cell transcriptome sequencing data acquisition module 101, configured to acquire single cell transcriptome sequencing data;
a gene copy number variation and allele imbalance degree determination module 102, configured to calculate data required for identifying malignant cells by gene copy number variation based on the single cell transcriptome sequencing data, and determine a direction and a degree of gene copy number variation and a degree of allele imbalance based on the data;
a primary clustering and identifying module 103, configured to perform a first cell clustering based on the direction and degree of gene copy number variation and the degree of allele imbalance to determine a first determined suspected non-malignant cell and a suspected malignant cell;
and the secondary clustering and identifying module 104 is used for performing secondary cell clustering to determine final non-malignant cells and malignant cells according to the tumor characteristic score and the non-tumor characteristic score based on the suspected non-malignant cells and suspected malignant cells determined for the first time.
The system can implement the identification method provided in the first embodiment, and the specific identification method can be referred to the description in the first embodiment, which is not described herein again.
The invention also provides a memory storing a plurality of instructions for implementing the method of embodiment one.
As shown in fig. 3, the present invention further provides an electronic device, which includes a processor 301 and a memory 302 connected to the processor 301, where the memory 302 stores a plurality of instructions, and the instructions can be loaded and executed by the processor, so as to enable the processor to execute the method according to the first embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (12)

1. A method for identifying malignant cells, comprising:
s1, obtaining single cell transcriptome sequencing data;
s2, calculating data required by gene copy number variation identification malignant cells based on the single cell transcriptome sequencing data, and determining the direction, degree and allele imbalance degree of gene copy number variation based on the data;
s3, performing first cell clustering based on the direction and degree of gene copy number variation and the degree of allele imbalance to determine suspected non-malignant cells and suspected malignant cells which are determined for the first time;
and S4, performing second cell clustering according to the tumor characteristic score and the non-tumor characteristic score to determine final non-malignant cells and malignant cells based on the first determined suspected non-malignant cells and suspected malignant cells.
2. A method of identifying malignant cells as claimed in claim 1, wherein said data comprises calculated gene expression values, mutated genomic positions and mutation frequencies of gene regions, and haplotyping information of human genome based on single cell transcriptome sequencing data; wherein:
calculating a gene expression value based on single cell transcriptome sequencing data for measuring the level of gene copy number variation of a genome region; if the gene expression value is high, the level of gene copy number variation in the genome region is high, and vice versa;
the mutant genomic position and mutation frequency of a gene region are used to determine the location and extent of allelic imbalance due to copy number variation; the mutation frequency is high, the allele imbalance degree caused by the gene copy number variation is high, and vice versa;
the haploid typing information of the human genome is used for improving the sensitivity of detecting the allelic gene imbalance, mutations on the same haploid typing are connected into a group based on the haploid typing information of the human genome, and the average mutation frequency of the group of mutations is taken as the allelic gene imbalance degree of the region where the mutations are located; if the average mutation frequency is high, the degree of allelic imbalance in the region in which the mutation is present is high, and vice versa.
3. The method according to claim 2, wherein S2 comprises:
s21, calculating the gene expression value of each gene in each cell based on the single cell transcriptome sequencing data; normalizing said gene expression values by subtracting the average expression value of the gene in all cells from each of said gene expression values and dividing by the standard deviation of the expression value of the gene in all cells; determining the direction and degree of gene copy number variation of the gene in each cell based on the normalized gene expression value: if the gene expression value after the normalization treatment is positive, the gene copy number amplification of the gene is generated, and if the gene expression value after the normalization treatment is larger, the amplification degree of the gene copy number is larger; if the gene expression value after the normalization treatment is negative, the gene copy number deletion is generated, and the smaller the gene expression value after the normalization treatment is, the larger the gene copy number deletion degree is;
s22, identifying for each gene a mutation occurring within its region based on the single cell transcriptome sequencing data, determining the mutated genomic position and mutation frequency of the gene region;
s23, linking the mutations that occur on the same haplotype in S22 into a set with reference to haplotyping information of the human genome; the difference between the mean mutation frequency and the first value of the first test for all mutations in a set of mutations is used to measure the degree of allelic imbalance: if the difference is more than 0, the gene carrying the group of mutations is amplified by the gene copy number, and the larger the difference is, the larger the amplification degree of the gene copy number is; if the difference is less than 0, the gene carrying the group of mutations is subjected to gene copy number deletion, and the smaller the difference is, the larger the gene copy number deletion degree is.
4. The method of claim 3, wherein the first assay value is 0.5.
5. The method according to claim 4, wherein S3 includes:
s31, according to the gene copy number variation direction, based on the two data indexes of the gene copy number variation value and the allele imbalance value, performing first clustering on all cells to obtain a plurality of first cell type groups;
s32, obtaining a plurality of first absolute values and a plurality of second absolute values by taking absolute values of the numerical values of the gene copy number variation and the numerical values of the allelic imbalance of all the cells of each of the plurality of first cell type groups, and calculating an average value of the plurality of first absolute values and an average value of the plurality of second absolute values to obtain an average value of the gene copy number variation and an average value of the allelic imbalance;
s33, the first cell category group with the product of the gene copy number variation average value and the allele imbalance average value lower than a first threshold value is the suspected non-malignant cell determined for the first time; the remaining first cell class group is the first identified suspected malignant cells.
6. The method of claim 5, wherein the first cluster is a k-means cluster and the number of clusters of the k-means cluster is specified to be 2.
7. The method according to claim 6, wherein S4 includes:
s41, determining a benchmark for identifying the differential expression genes based on the suspected non-malignant cells and the suspected malignant cells which are determined for the first time, and determining the characteristic genes of the malignant cells and the characteristic genes of the non-malignant cells respectively based on the benchmark;
s42, for each cell, respectively taking the expression median of the characteristic genes of the malignant cells and the characteristic genes of the non-malignant cells as the malignant cell score and the non-malignant cell score of the cell;
s43, performing a second clustering on all cells based on the malignant cell score and the non-malignant cell score to obtain a plurality of second cell class groups;
s44, calculating a difference between the malignant cell score and the non-malignant cell score of each cell in each of the plurality of second cell category groups, and regarding the second cell category group with a mean value of the differences higher than a second threshold as a suspected malignant cell and regarding the rest of the second cell category groups as suspected non-malignant cells;
s45, adjusting the baseline of S41, repeating S42-S44, thereby determining final non-malignant cells and malignant cells.
8. The method of claim 7, wherein the criteria of S41 is a screening criteria for differential genes, comprising:
get log 2 FoldChange>1 and FDR<0.05 Gene as characteristic Gene of malignant cell, log was taken 2 FoldChange<-1 and FDR<0.05 as a non-malignant cell characteristic gene; log therein 2 FoldChange is the log2 transformation of the ratio of the mean of gene expression values in suspected malignant cells to the mean of gene expression values in suspected non-malignant cells.
9. The method of claim 7, wherein the second cluster is a k-means cluster and the number of clusters of the k-means cluster is specified to be 2.
10. A system for identifying malignant cells for performing the method for identifying malignant cells according to any one of claims 1 to 9, comprising:
a single cell transcriptome sequencing data acquisition module (101) for acquiring single cell transcriptome sequencing data;
a gene copy number variation and allele imbalance degree determination module (102) for calculating data required for identifying malignant cells by gene copy number variation based on the single cell transcriptome sequencing data, and determining the direction, degree and allele imbalance degree of gene copy number variation based on the data;
a primary clustering and identifying module (103) for performing a first cell clustering based on the direction and degree of gene copy number variation and the degree of allele imbalance to determine a first determined suspected non-malignant cell and a suspected malignant cell;
and the secondary clustering and identifying module (104) is used for carrying out secondary cell clustering on the basis of the suspected non-malignant cells and the suspected malignant cells which are determined for the first time according to the tumor characteristic score and the non-tumor characteristic score to determine final non-malignant cells and malignant cells.
11. An electronic device comprising a processor and a memory, said memory storing a plurality of instructions, said processor being configured to read said instructions and to perform the authentication method according to any one of claims 1 to 9.
12. A computer-readable storage medium storing a plurality of instructions readable by a processor and performing the authentication method of any one of claims 1-9.
CN202210988485.4A 2022-08-17 2022-08-17 Malignant cell identification method and system Active CN115064209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210988485.4A CN115064209B (en) 2022-08-17 2022-08-17 Malignant cell identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210988485.4A CN115064209B (en) 2022-08-17 2022-08-17 Malignant cell identification method and system

Publications (2)

Publication Number Publication Date
CN115064209A true CN115064209A (en) 2022-09-16
CN115064209B CN115064209B (en) 2022-11-01

Family

ID=83208498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210988485.4A Active CN115064209B (en) 2022-08-17 2022-08-17 Malignant cell identification method and system

Country Status (1)

Country Link
CN (1) CN115064209B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472222A (en) * 2022-11-02 2022-12-13 杭州链康医学检验实验室有限公司 Single cell transcriptome RNA pollution identification method, medium and equipment
CN116453593A (en) * 2023-06-12 2023-07-18 普瑞基准生物医药(苏州)有限公司 Method and device for obtaining cell state characteristic scores and electronic equipment
CN116758994A (en) * 2023-07-03 2023-09-15 杭州联川生物技术股份有限公司 Gene sets, methods, media and apparatus for distinguishing tumor cells from non-tumor cells

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1214738A (en) * 1996-03-26 1999-04-21 奥伊斯坦·福德斯塔德 Immuno-magnetic cell isolation techniques for identifying genes associated with site-preferred cancer metastasis formation
US7575926B1 (en) * 2005-12-19 2009-08-18 The Board Of Regents Of The University Of Oklahoma Method of identification of compounds effective against suppressed cancer cells
CN102421920A (en) * 2009-04-16 2012-04-18 加拿大国家研究委员会 Method for tumor characterization and marker set identification, tumor grading and marker set for cancer
CN106778830A (en) * 2016-06-30 2017-05-31 华南理工大学 Staging method based on double cluster results and AdaBoost
CN108504734A (en) * 2018-03-26 2018-09-07 河北医科大学 A kind of judgment method of malignant tumor tissue particular individual ownership and its application
CN110029157A (en) * 2018-01-11 2019-07-19 北京大学 A method of the unicellular genome monoploid of detection tumour copies number variation
CN112766428A (en) * 2021-04-08 2021-05-07 臻和(北京)生物科技有限公司 Tumor molecule typing method and device, terminal device and readable storage medium
CN114822690A (en) * 2022-03-08 2022-07-29 刘华平 Multi-class multifunctional intelligent classification method applied to whole genome expression profile data

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1214738A (en) * 1996-03-26 1999-04-21 奥伊斯坦·福德斯塔德 Immuno-magnetic cell isolation techniques for identifying genes associated with site-preferred cancer metastasis formation
US7575926B1 (en) * 2005-12-19 2009-08-18 The Board Of Regents Of The University Of Oklahoma Method of identification of compounds effective against suppressed cancer cells
CN102421920A (en) * 2009-04-16 2012-04-18 加拿大国家研究委员会 Method for tumor characterization and marker set identification, tumor grading and marker set for cancer
CN106778830A (en) * 2016-06-30 2017-05-31 华南理工大学 Staging method based on double cluster results and AdaBoost
CN110029157A (en) * 2018-01-11 2019-07-19 北京大学 A method of the unicellular genome monoploid of detection tumour copies number variation
CN108504734A (en) * 2018-03-26 2018-09-07 河北医科大学 A kind of judgment method of malignant tumor tissue particular individual ownership and its application
CN112766428A (en) * 2021-04-08 2021-05-07 臻和(北京)生物科技有限公司 Tumor molecule typing method and device, terminal device and readable storage medium
CN114822690A (en) * 2022-03-08 2022-07-29 刘华平 Multi-class multifunctional intelligent classification method applied to whole genome expression profile data

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472222A (en) * 2022-11-02 2022-12-13 杭州链康医学检验实验室有限公司 Single cell transcriptome RNA pollution identification method, medium and equipment
CN115472222B (en) * 2022-11-02 2023-03-24 杭州链康医学检验实验室有限公司 Single cell transcriptome RNA pollution identification method, medium and equipment
CN116453593A (en) * 2023-06-12 2023-07-18 普瑞基准生物医药(苏州)有限公司 Method and device for obtaining cell state characteristic scores and electronic equipment
CN116453593B (en) * 2023-06-12 2023-10-03 普瑞基准生物医药(苏州)有限公司 Method and device for obtaining cell state characteristic scores and electronic equipment
CN116758994A (en) * 2023-07-03 2023-09-15 杭州联川生物技术股份有限公司 Gene sets, methods, media and apparatus for distinguishing tumor cells from non-tumor cells
CN116758994B (en) * 2023-07-03 2024-02-27 杭州联川生物技术股份有限公司 Gene sets, methods, media and apparatus for distinguishing tumor cells from non-tumor cells

Also Published As

Publication number Publication date
CN115064209B (en) 2022-11-01

Similar Documents

Publication Publication Date Title
CN115064209B (en) Malignant cell identification method and system
Ivakhno et al. CNAseg—a novel framework for identification of copy number changes in cancer from second-generation sequencing data
CN111968701B (en) Method and device for detecting somatic copy number variation of designated genome region
Kim et al. rSW-seq: algorithm for detection of copy number alterations in deep sequencing data
US20220101944A1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN110648721B (en) Method and device for detecting copy number variation by aiming at exon capture technology
CN111292802A (en) Method, electronic device, and computer storage medium for detecting sudden change
CN111755068B (en) Method and device for identifying tumor purity and absolute copy number based on sequencing data
Dou et al. Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions
Morganella et al. Finding recurrent copy number alterations preserving within-sample homogeneity
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
Xi et al. Discovering recurrent copy number aberrations in complex patterns via non-negative sparse singular value decomposition
CN108595912A (en) Detect the method, apparatus and system of chromosomal aneuploidy
CN117153258A (en) Methods and apparatus for correcting sequencing data and detecting chromosomal aneuploidies
CN113284558B (en) Method for distinguishing gene expression difference and long copy number variation in RNA sequencing data
CN113299342B (en) Copy number variation detection method and detection device based on chip data
CN110570908B (en) Sequencing sequence polymorphic identification method and device, storage medium and electronic equipment
CN114974415A (en) Method and device for detecting chromosome copy number abnormality
CN115035951A (en) Mutation signature prediction method and device, terminal equipment and storage medium
CN113308545A (en) DNA methylation-based invasive glioma classification device
CN112562787B (en) Gene large fragment rearrangement detection method based on NGS platform
CN114703263B (en) Group chromosome copy number variation detection method and device
CN117497047B (en) Method, equipment and medium for screening tumor gene markers based on exon sequencing
Zheng et al. A structural variation genotyping algorithm enhanced by CNV quantitative transfer
Dewal et al. Power to detect selective allelic amplification in genome-wide scans of tumor data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant