CN109002689B - T cell data processing method and device - Google Patents
T cell data processing method and device Download PDFInfo
- Publication number
- CN109002689B CN109002689B CN201810813090.4A CN201810813090A CN109002689B CN 109002689 B CN109002689 B CN 109002689B CN 201810813090 A CN201810813090 A CN 201810813090A CN 109002689 B CN109002689 B CN 109002689B
- Authority
- CN
- China
- Prior art keywords
- cell
- family
- amino acid
- group
- acid sequences
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Investigating Or Analysing Biological Materials (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The embodiment of the invention provides a T cell data processing method and a device, wherein the T cell data processing method comprises the following steps: acquiring a plurality of groups of sample data, wherein each group of sample data comprises T cell data sets corresponding to different characteristics; calculating T cell receptor statistics for each set of T cell datasets; verifying the importance of each group of T cell receptor statistic by a cross-validation method, and screening out a characteristic group with high importance in a specified amount of sample data; and constructing a naive Bayes recognition network model according to the feature group with high importance.
Description
Technical Field
The invention relates to the field of data processing, in particular to a T cell data processing method and device.
Background
With the development of computer technology, computer technology is used in more and more fields, and the operating efficiency of each field can be improved through the computer technology. In the field of medical technology, more technology is also needed to improve efficiency in medical procedures.
Disclosure of Invention
In view of the above, an object of the embodiments of the present invention is to provide a method and an apparatus for processing T cell data.
The T cell data processing method provided by the embodiment of the invention comprises the following steps:
acquiring a plurality of groups of sample data, wherein each group of sample data comprises T cell data sets corresponding to different characteristics;
calculating T cell receptor statistics for each set of T cell datasets;
verifying the importance of each group of T cell receptor statistic by a cross-validation method, and screening out a characteristic group with high importance in a specified amount of sample data;
and constructing a naive Bayes recognition network model according to the feature group with high importance.
Optionally, after the step of constructing a naive bayes recognition network model based on the set of features with high importance, the method further comprises:
calculating T cell receptor statistics of data to be judged, wherein the data to be judged comprises T cell data of a target object;
and inputting the T cell receptor statistic of the data to be judged into the naive Bayes recognition network model to recognize the target disease.
Optionally, the T cell receptor statistic is calculated by:
obtaining VJ family frequencies of T cell data to be calculated, wherein V represents a V gene in the T cell and J represents a J gene in the T cell;
calculating to obtain the internal homology of the VJ family according to the T cell data to be calculated;
and calculating the T cell receptor statistic of the T cell data to be calculated according to the VJ family frequency and the internal homology of the VJ family.
Optionally, the T cell receptor statistic of the T cell data to be calculated based on the VJ family frequency and the homology within the VJ family is represented by the following expression:
wherein f represents a VJ family frequency; c represents homology within the VJ family.
Optionally, the step of calculating homology within the VJ family from the T cell data to be calculated comprises:
obtaining a number of amino acid sequence species in the VJ family of the T cell data to be calculated;
calculating the information entropy of the distance matrix between the amino acid sequences and the VJ family internal amino acid sequences of the T cell data to be calculated;
and calculating the product of the number of the amino acid types and the information entropy of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family to obtain the homology in the VJ family.
Optionally, the step of calculating the entropy of the distance matrix between amino acids and amino acids within the VJ family of the T cell data to be calculated comprises:
pairwise alignment of all amino acid sequences within the VJ family of the T cell data to be calculated;
using a scoring matrix to score the comparison result of every two amino acid sequences to obtain the distance between each pair of amino acid sequences;
calculating the distance between every two amino acid sequences to obtain a distance matrix between the amino acid sequences in the VJ family;
calculating the entropy of information of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family.
Optionally, the scoring of the comparison result of every two amino acid sequences by using a scoring matrix is performed, and the scoring rule for obtaining the distance between each pair of amino acid sequences is as follows:
distance (a, a) is 0;
distance (a, b) ═ min (4,4-BLOSUM62(a, b));
wherein a and b represent different amino acids.
Optionally, the step of verifying the importance of each group of T cell receptor statistics by a cross-validation method, and screening out a feature group with high importance in a specified amount of sample data, includes:
a. after each group of T cell receptor statistics is calculated through the cross-validation method, the importance is obtained according to the calculation result, and all the characteristics in the sample data are ranked according to the importance;
b. selecting the characteristics of the part group which is ranked at the top in the ranking;
and (c) repeating the steps a and b for the set times, and selecting a characteristic group with high importance in the sample data of a specified number from the characteristics of the part group which is screened out from the set times and is ranked at the top.
Optionally, the step of verifying the importance of each group of T cell receptor statistics by a cross-validation method, and screening out a feature group with high importance in a specified amount of sample data, includes:
after each group of T cell receptor statistics is calculated through the cross-validation method, sorting all the characteristics in the sample data according to the importance obtained according to the calculation result and the importance;
select the top-ranked set of a given number of highly important T cell receptors.
Optionally, the step of verifying the importance of each group of T cell receptor statistics by a cross-validation method, and screening out a feature group with high importance in a specified amount of sample data, includes:
and verifying the importance of the T cell receptor statistics of each group by a random forest and cross verification method, and screening out a characteristic group with high importance in the sample data of a specified number.
An embodiment of the present invention further provides a T cell data processing apparatus, including:
the acquisition module is used for acquiring a plurality of groups of sample data, wherein each group of sample data comprises T cell data sets corresponding to different characteristics;
a first calculation module for calculating T cell receptor statistics for each set of T cell data sets;
the screening module is used for verifying the importance of each group of T cell receptor statistic by a cross-validation method and screening out a characteristic group with high importance in a specified amount of sample data;
and the construction module is used for constructing a naive Bayes recognition network model according to the feature group with high importance.
Compared with the prior art, the T cell data processing method and the T cell data processing device provided by the embodiment of the invention have the advantages that the recognition network is constructed by calculating and screening multiple groups of detection data to obtain data with higher importance, and the constructed naive Bayesian recognition network model can better judge the data needing to be judged by screening the data.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention.
Fig. 2 is a flowchart of a T cell data processing method according to an embodiment of the present invention.
Fig. 3 is a flowchart of the method for processing T cell data according to the present invention for calculating T cell receptor statistics.
FIG. 4 is a partial flowchart of a T cell data processing method according to an embodiment of the present invention
Fig. 5 is a functional block diagram of a T cell data processing apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Fig. 1 is a block diagram of an electronic device 100. The electronic device 100 includes a memory 111, a memory controller 112, a processor 113, a peripheral interface 114, an input/output unit 115, and a display unit 116. It will be understood by those of ordinary skill in the art that the structure shown in fig. 1 is merely exemplary and is not intended to limit the structure of the electronic device 100. For example, electronic device 100 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1. The electronic device 100 described in this embodiment may be a computing device having an image processing capability, such as a personal computer, an image processing server, an in-vehicle device, or a mobile electronic device.
The memory 111, the memory controller 112, the processor 113, the peripheral interface 114, the input/output unit 115 and the display unit 116 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 111 stores at least one software functional module in the form of software or Firmware (Firmware), or an Operating System (OS) of the electronic device 100 is solidified with the software functional module. The processor 113 is configured to execute executable modules stored in the memory.
The Memory 111 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 111 is configured to store a program, and the processor 113 executes the program after receiving an execution instruction, and the method executed by the electronic device 100 defined by the process disclosed in any embodiment of the present invention may be applied to the processor 113, or implemented by the processor 113.
The processor 113 may be an integrated circuit chip having signal processing capabilities. The Processor 113 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The peripheral interface 114 couples various input/output devices to the processor 113 and memory 111. In some embodiments, the peripheral interface 114, the processor 113, and the memory controller 112 may be implemented in a single chip. In other examples, they may be implemented separately from the individual chips.
The input/output unit 115 is used to provide input data to a user. The input/output unit 115 may be, but is not limited to, a mouse, a keyboard, and the like.
The display unit 116 provides an interactive interface (e.g., a user operation interface) between the electronic device 100 and a user or is used to display image data to a user reference. In this embodiment, the display unit may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. The support of single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are sent to the processor for calculation and processing.
Graves' disease (GD) is an organ-specific autoimmune disease with a population prevalence as high as 0.5% -2%. Graves' ophthalmopathy (GO) is the most common complication of GD, and 25% -50% of GD patients accompany GO in the course of disease, manifested as eyeball protrusion, pain, impaired vision, etc. The existing treatment of GO mainly aims at treating patients with moderate and severe active periods, and hormone shock is given to suppress immune and inflammatory reactions. However, the treatment has undesirable effects on partial patients and has adverse reactions such as hypertension, infection, osteoporosis and the like. Therefore, GO seriously affects the physical and mental health of patients no matter the disease itself or the treatment mode, and the optimization of the existing diagnosis and treatment scheme is always a research hotspot in the field.
Under the traditional diagnosis system, when GO is diagnosed, the fibroblasts in the orbit of a patient are activated and proliferated early, and an irreversible tissue remodeling process is started. At this time, the hormonal shock only ameliorates the orbital inflammatory response and fails to stop or reverse tissue fibrotic remodeling. Therefore, if the occurrence of GO in GD patients can be predicted early, the progress of the disease can be blocked by preventive treatment before the activation and proliferation of the orbital fibroblasts, thereby fundamentally delaying or even preventing the occurrence of GO.
In recent years, many GO susceptibility studies have been reported, but there is no report of successful prediction of GO occurrence. Gene polymorphisms such as HLA, CTLA-4, TNF, TSHR, TPO, PTPN12, MTHFR (tetramethyltetrahydrofolate), TSLP (thymic stromal lymphopoietin) and the like may be related to GO, but due to factors such as gene linkage disequilibrium and experimental design defects, subsequent prediction related experiments are not carried out in these studies. High TRAb titers, treatment modalities of smoking or GD correlate with GO occurrence, but lack clear causal associations and cannot be predictive indicators.
The deep understanding of the pathogenesis of GO is helpful for finding appropriate layers and angles for prediction and intervention. GO is an autoimmune-mediated inflammatory response disease, and orbital tissues heavily infiltrate mononuclear cells, mainly including CD4+ T cells. GWAS studies report GO susceptibility genes including HLA, CTLA-4 and IL-23R, among others, focusing primarily on pathways of T cell antigen presentation and activation. The generation of GO is that the immune tolerance of T cells to self-antigen is broken under the combined action of heredity and environment, and then antigen specific T cells perform specific cross reaction on thyroid gland and orbital tissues to start the activation of orbital fibroblasts. T Cell Receptors (TCRs) are molecular structures that specifically recognize and bind to antigenic peptide-MHC by T cells, and are important markers for mediating T Cell-specific immune responses. Significant clonal expansion of T cells occurs when stimulated by antigen, and the overall diversity and dominance family of TCRs is significantly biased. Therefore, TCRs play an important mediating role in the development and progression of GO in GD patients, and it is thought by the inventors that TCRs characteristic of GO can help predict the development of GO in GD patients.
The TCR repertoire comprises the sum of all functional diverse T cells of a subject, which comprehensively reflects the cellular immune status. The TCR is high-dimensional data, which contains information of multiple angles such as diversity, clonal expansion and the like, and brings great difficulty to comprehensively reflect the real state of the TCR. The TCR diversity in humans has been conservatively estimated as high as 1012-. This diversity is due to rearrangement of V/D/J segments of TCR germline genes during T cell development and nucleotide insertions during rearrangement. For convenience of TCR data analysis, 80 va and 65V β genes of the TCR were divided into 32 va and 24V β subfamilies based on V gene homology. Thus, the frequency of the TCR VJ family can be calculated by comparison from the results of the high throughput sequencing data. On the other hand, TCR diversity is constantly changing with changes in the external environment, and T cells undergo significant clonal proliferation when stimulated by antigens or antigenic determinants, which is manifested by a large variety of TCR subfamilies in the TCR VJ family, but with high homology and small structural differences. Therefore, the actual state of the TCR cannot be fully described simply from the VJ family frequency of the TCR, and factors such as VJ family structure and homology of the TCR should be considered. Based on the above studies the present application can perform processing studies on T cell data by the following several examples.
Fig. 2 is a flowchart illustrating a T cell data processing method applied to the electronic device shown in fig. 1 according to an embodiment of the invention. The specific process shown in fig. 2 will be described in detail below.
Step S201, multiple sets of sample data are acquired.
In this embodiment, each set of the sample data includes T cell data sets corresponding to different features.
In this embodiment, each set of sample data may be T cell data corresponding to each feature acquired for a user or patient. Each feature may be a TCR data set.
In one application scenario, the T cell data processing method in this embodiment is used to build a data recognition network for recognizing the prediction that GD patients are likely to develop GO. In this application scenario the sets of sample data may be T cell datasets corresponding to different features of the GD patient. Each sample may carry multiple sets of features, which may include: TRBV12.5_ TRBJ2.7, TRBV2_ TRBJ2.3, TRBV5.1_ TRBJ1.1, TRBV5.1_ TRBJ1.2, etc., which will not be described one by one herein.
Step S202, the T cell receptor statistics for each set of T cell datasets are calculated.
In this example, it can be calculated from the internal homology of the T cell VJ family and VJ family frequency in the T cell dataset. The T cell receptor statistics reflect the frequency and internal homology of the T cell receptors.
And step S203, verifying the importance of each group of T cell receptor statistic by a cross-validation method, and screening out a characteristic group with high importance in a specified amount of sample data.
And step S204, constructing a naive Bayes recognition network model according to the feature group with high importance.
The data of the target user can be identified through the naive Bayes identification network model.
Further, when the naive bayesian recognition network model is used for predicting the body of a target user, the data corresponding to the feature group with high importance of the user is used for calculating and predicting.
According to the T cell data processing method, the recognition network is constructed by calculating and screening multiple groups of detection data to obtain data with higher importance, and the constructed naive Bayesian recognition network model can better judge the data needing to be judged by screening the data.
In this example, as shown in fig. 3, T cell receptor statistics can be calculated by the following steps.
Step S301, VJ family frequencies of T cell data to be calculated are obtained.
Wherein V represents a V gene in the T cell, and J represents a J gene in the T cell.
Step S302, calculating the internal homology of VJ family according to the T cell data to be calculated.
And step S303, calculating T cell receptor statistics of the T cell data to be calculated according to the VJ family frequency and the internal homology of the VJ family.
Further, the T cell receptor statistic of the T cell data to be calculated from the VJ family frequency and the homology within the VJ family is represented by the following expression:
wherein f represents a VJ family frequency; c represents homology within the VJ family.
In this embodiment, the step of calculating the homology of the VJ family from the T cell data to be calculated includes:
obtaining a number of amino acid sequence species in the VJ family of the T cell data to be calculated;
calculating the information entropy of the distance matrix between the amino acid sequences and the VJ family internal amino acid sequences of the T cell data to be calculated;
and calculating the product of the number of the amino acid sequence types and the information entropy of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family to obtain the homology in the VJ family.
Further, the calculation formula for homology within the VJ family can be expressed as:
c=v×e;
where v represents the number of amino acid sequence classes in the VJ family and e represents the entropy of information in the distance matrix between amino acid sequences and amino acid sequences within the VJ family.
In this embodiment, the step of calculating the information entropy of the distance matrix between the amino acid sequence and the VJ family internal amino acid sequence of the T cell data to be calculated includes:
aligning all amino acid sequences within the VJ family of the T cell data to be calculated;
using a scoring matrix to score every two amino acid sequences to obtain the distance between each pair of amino acid sequences;
calculating the distance between every two amino acid sequences to obtain a distance matrix between the amino acid sequences in the VJ family;
calculating the entropy of information of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family.
Further, the calculation formula of the information entropy of the distance matrix between the amino acid sequences within the VJ family and the amino acid sequence can be expressed as: e- Σ d × log (d).
Wherein d represents the distance between amino acid sequences within the VJ family. In one embodiment, the amino acid sequence can be aligned to the distance between amino acid sequences using a weighted hamming distance. Further, gap penalties can be introduced to compensate for the inconsistent length of amino acids.
Wherein, the calculation of the distance between the amino acid sequences within the VJ family can be represented as the following process:
first, all amino acid sequences can be aligned using ClustalW software;
two amino acid sequence distances are scored using a scoring matrix, e.g., two amino acids are scored against an amino acid using the BLOSUM62 localization matrix, as follows:
distance (a, a) is 0;
distance (a, b) ═ min (4,4-BLOSUM62(a, b));
wherein a and b represent different amino acids. In one example, the penalty for gap openning and gapexpan are both set to 8.
In this embodiment, the step S203 includes: and verifying the importance of the T cell receptor statistics of each group by a random forest and cross verification method, and screening out a characteristic group with high importance in the sample data of a specified number.
In this embodiment, the step S203 includes:
a. after each group of T cell receptor statistics is calculated through the cross-validation method, the importance is obtained according to the calculation result, and all the characteristics in the sample data are ranked according to the importance;
b. selecting the characteristics of the part group which is ranked at the top in the ranking;
and (c) repeating the steps a and b for the set times, and selecting a characteristic group with high importance in the sample data of a specified number from the characteristics of the part group which is screened out from the set times and is ranked at the top.
In this embodiment, fifty, sixty, seventy, and so on, which are ranked first, may be sorted each time. It will be appreciated that the number of choices can be set by one skilled in the art as desired. Wherein each feature is a TCR.
In one example, each set of sample data includes a plurality of T cell data sets corresponding to the features. However, not every T cell data set of the characteristics plays a leading role in identifying the target disease, so that the characteristics with higher importance are screened out to be used for constructing a naive Bayes identification network model, and the identification efficiency can be improved under the condition of reducing the calculation amount.
Further, after repeating the calculation a plurality of times, a feature group having a high importance among a specified number of sample data may be selected from the number of features appearing in the first fifty, sixty, seventy, and so on. For example, feature a ranks first fifty in each ranking, then feature a may be selected as a feature in the set of features of high importance.
In one embodiment, the group of features of high importance are TCRs that affect GO (Graves 'ophthalmopathy) and GH (Graves' hyperthyroidism) more strongly. Among them, GO and GH are Graves' disease (GD), a complication of organ-specific autoimmune diseases.
In this embodiment, 24 groups of feature groups can be selected through the above screening, which are respectively: TRBV12.5_ TRBJ2.7, TRBV2_ TRBJ2.3, TRBV5.1_ TRBJ1.1, TRBV5.1_ TRBJ1.2, TRBV6.5_ TRBJ1.5, TRBV7.8_ TRBJ2.7, TRBV7.9_ TRBJ2.2, TRBV9_ TRBJ1.1, TRBV9_ TRBJ2.2, TRBV9_ TRBJ2.3, TRBV11.2_ TRBJ2.7, TRBV19_ TRBJ1.5, TRBV19_ bj1.1, TRBV20.1_ TRBJ1.3, TRBV6.6_ bj1.1, TRBV7.9_ TRBJ2.7, TRBV 24.1.6 _ TRBV 1.6, TRBV 6.1.5 _ TRBJ1.3, TRBV 3.59bv 3.1.5 _ TRBV 1.1.7, TRBV 3.5 _ TRBV 1.592.1.7, TRBV 3.592 _ TRBV 1.1.7.
Of course, more or fewer feature sets may be screened out.
In this embodiment, a feature group with relatively high importance is screened out to be used as a naive bayes recognition network model, so that more accurate prediction can be performed when GO or GH is judged.
In this embodiment, the step S203 includes: after each group of T cell receptor statistics is calculated through the cross-validation method, sorting all the characteristics in the sample data according to the importance obtained according to the calculation result and the importance;
select the top-ranked set of a given number of highly important T cell receptors.
In this embodiment, the potential target disease of the user may be estimated by using the naive bayes recognition network model obtained by the previous construction, as shown in fig. 4, the method further includes:
step S401, calculating T cell receptor statistic of data to be judged, wherein the data to be judged comprises T cell data of a target object.
And S402, inputting the T cell receptor statistic of the data to be judged into the naive Bayes recognition network model to recognize the target disease.
In this embodiment, the naive bayes recognition network model can be used to perform classification prediction on data to be judged. And obtaining whether the data to be judged is GH or GO through classification prediction.
The following are the results of tests performed on various examples using the T cell data processing method in this example:
1) display of diagnostic results for 17 GH and GO patients
The accuracy of GO is 70% and the accuracy of GH is 85.7%.
2)7 cases of predicted outcome presentation of patients with GH progressing to GO
The prediction accuracy in the following example was 71.5%.
Sample numbering | Probability of GH | GO probability | Predicted results | True GO generation |
WL0581 | 1.46E-02 | 9.85E-01 | GO | Is that |
WL0682 | 5.68E-01 | 4.32E-01 | GH | Is that |
WL0594 | 7.14E-03 | 9.93E-01 | GO | Is that |
WL0539 | 9.98E-01 | 2.23E-03 | GH | Is that |
WL0551 | 6.02E-06 | 1.00E+00 | GO | Is that |
WL0613 | 2.42E-01 | 7.58E-01 | GO | Is that |
WL0648 | 3.88E-08 | 1.00E+00 | GO | Is that |
The prediction result in this embodiment is a result obtained by predicting using a naive bayes recognition network model, and the real result is data obtained by acquiring a field situation of a corresponding user afterwards.
The two tables are the results obtained by performing prediction calculation on part of the examples, and the prediction results may be higher than those of the examples in practical application.
Please refer to fig. 5, which is a functional block diagram of the T cell data processing apparatus shown in fig. 1 according to an embodiment of the present invention. The T cell data processing apparatus in this embodiment is used to perform each step in the above-described method embodiments. The T cell data processing device includes:
an obtaining module 501, configured to obtain multiple sets of sample data, where each set of sample data includes T cell datasets corresponding to different features;
a first calculation module 502 for calculating T cell receptor statistics for each set of T cell data sets;
the screening module 503 is configured to verify the importance of each group of T cell receptor statistics by a cross-validation method, and screen out a feature group with high importance in a specified amount of sample data;
a constructing module 504, configured to construct a naive bayesian recognition network model according to the feature group with high importance.
For other details of the present embodiment, further reference may be made to the description of the above method embodiments, which are not repeated herein.
According to the T cell data processing device, the recognition network is constructed by calculating and screening a plurality of groups of detection data to obtain data with higher importance, and the constructed naive Bayesian recognition network model can better judge the data needing to be judged by screening the data.
In this embodiment, the T cell data processing apparatus further includes:
the second calculation module is used for calculating the T cell receptor statistic of the data to be judged, wherein the data to be judged comprises the T cell data of the target object;
and the recognition module is used for inputting the T cell receptor statistic of the data to be judged into the naive Bayes recognition network model to recognize the target disease.
In this embodiment, the first calculating module 502 or the second calculating module is further configured to:
obtaining VJ family frequencies of T cell data to be calculated, wherein V represents a V gene in the T cell and J represents a J gene in the T cell;
calculating to obtain the internal homology of the VJ family according to the T cell data to be calculated;
and calculating the T cell receptor statistic of the T cell data to be calculated according to the VJ family frequency and the internal homology of the VJ family.
In this embodiment, the second calculating module is configured to:
obtaining VJ family frequencies of T cell data to be calculated, wherein V represents a V gene in the T cell and J represents a J gene in the T cell;
calculating to obtain the internal homology of the VJ family according to the T cell data to be calculated;
and calculating the T cell receptor statistic of the T cell data to be calculated according to the VJ family frequency and the internal homology of the VJ family.
In this embodiment, the T cell receptor statistic of the T cell data to be calculated according to the VJ family frequency and the homology inside the VJ family is represented by the following expression:
wherein f represents a VJ family frequency; c represents homology within the VJ family.
The first computation module 502 or the second computation module is further configured to:
obtaining a number of amino acid sequence species in the VJ family of the T cell data to be calculated;
calculating the information entropy of the distance matrix between the amino acid sequences and the VJ family internal amino acid sequences of the T cell data to be calculated;
and calculating the product of the number of the amino acid types and the information entropy of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family to obtain the homology in the VJ family.
The first computation module 502 or the second computation module is further configured to:
aligning all amino acid sequences within the VJ family of the T cell data to be calculated;
using a scoring matrix to score every two amino acids to obtain the distance between each pair of amino acid sequences
Calculating the distance between every two amino acid sequences to obtain a distance matrix between the amino acid sequences in the VJ family;
calculating the entropy of information of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family.
The first computing module 502 is further configured to:
distance (a, a) is 0;
distance (a, b) ═ min (4,4-BLOSUM62(a, b));
wherein a and b represent different amino acids.
In this embodiment, the screening module 503 is further configured to:
a. after each group of T cell receptor statistics is calculated through the cross-validation method, the importance is obtained according to the calculation result, and all the characteristics in the sample data are ranked according to the importance;
b. selecting the characteristics of the part group which is ranked at the top in the ranking;
repeating the execution of the two modules a and b for the set times, and selecting a characteristic group with high importance in a specified number of sample data from the characteristics of the part group which is screened out from the set times and ranked at the top.
In this embodiment, the screening module 503 is further configured to:
and verifying the importance of the T cell receptor statistics of each group by a random forest and cross verification method, and screening out a characteristic group with high importance in the sample data of a specified number.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (7)
1. A method of T-cell data processing, comprising:
acquiring a plurality of groups of sample data, wherein each group of sample data comprises T cell data sets corresponding to different characteristics;
calculating T cell receptor statistics for each set of T cell datasets: obtaining VJ family frequency of T cell data to be calculated, calculating to obtain internal homology of a VJ family according to the T cell data to be calculated, and calculating to obtain T cell receptor statistic of the T cell data to be calculated according to the VJ family frequency and the internal homology of the VJ family; wherein V represents a V gene in the T cell, J represents a J gene in the T cell, and T cell receptor statistics of the T cell data to be calculated based on the VJ family frequency and homology within the VJ family are represented by the following expression:
wherein f represents a VJ family frequency; c represents homology within the VJ family;
verifying the importance of each group of T cell receptor statistic by a cross-validation method, and screening out a characteristic group with high importance in a specified amount of sample data;
and constructing a naive Bayes recognition network model according to the feature group with high importance.
2. The T-cell data processing method of claim 1, wherein the step of calculating the homology within the VJ family from the T-cell data to be calculated comprises:
obtaining a number of amino acid sequence species in the VJ family of the T cell data to be calculated;
calculating the information entropy of the distance matrix between the amino acid sequences and the VJ family internal amino acid sequences of the T cell data to be calculated;
and calculating the product of the number of the amino acid types and the information entropy of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family to obtain the homology in the VJ family.
3. The T-cell data processing method according to claim 2, wherein the step of calculating the entropy of the distance matrix between the amino acid sequence and the amino acid sequence within the VJ family of the T-cell data to be calculated comprises:
pairwise alignment of all amino acid sequences within the VJ family of the T cell data to be calculated;
using a scoring matrix to score the comparison result of every two amino acid sequences to obtain the distance between each pair of amino acid sequences;
calculating the distance between every two amino acid sequences to obtain a distance matrix between the amino acid sequences in the VJ family;
calculating the entropy of information of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family.
4. The method of T-cell data processing according to claim 3, wherein two amino acid sequences are scored using a scoring matrix, and the scoring rule for the distance between each pair of amino acid sequences within the VJ family is:
distance (a, a) is 0;
distance (a, b) ═ min (4,4-BLOSUM62(a, b));
wherein a and b represent different amino acids.
5. The method of T-cell data processing according to claim 1, wherein said step of cross-validating the importance of each T-cell receptor statistic group by each T-cell receptor statistic group and selecting a feature group of high importance among a given number of sample data comprises:
a. after each group of T cell receptor statistics is calculated through the cross-validation method, the importance is obtained according to the calculation result, and all the characteristics in the sample data are ranked according to the importance;
b. selecting the characteristics of the part group which is ranked at the top in the ranking;
and (c) repeating the steps a and b for the set times, and selecting a characteristic group with high importance in the sample data of a specified number from the characteristics of the part group which is screened out from the set times and is ranked at the top.
6. The method of T-cell data processing according to claim 1, wherein said step of cross-validating the importance of each T-cell receptor statistic group by each T-cell receptor statistic group and selecting a feature group of high importance among a given number of sample data comprises:
and verifying the importance of the T cell receptor statistics of each group by a random forest and cross verification method, and screening out a characteristic group with high importance in the sample data of a specified number.
7. A T-cell data processing apparatus, comprising:
the acquisition module is used for acquiring a plurality of groups of sample data, wherein each group of sample data comprises T cell data sets corresponding to different characteristics;
a first calculation module for calculating T cell receptor statistics for each set of T cell datasets: obtaining VJ family frequency of T cell data to be calculated, calculating to obtain internal homology of a VJ family according to the T cell data to be calculated, and calculating to obtain T cell receptor statistic of the T cell data to be calculated according to the VJ family frequency and the internal homology of the VJ family; wherein V represents a V gene in the T cell, J represents a J gene in the T cell, and T cell receptor statistics of the T cell data to be calculated based on the VJ family frequency and homology within the VJ family are represented by the following expression:
wherein f represents a VJ family frequency; c represents homology within the VJ family;
the screening module is used for verifying the importance of each group of T cell receptor statistic by a cross-validation method and screening out a characteristic group with high importance in a specified amount of sample data;
and the construction module is used for constructing a naive Bayes recognition network model according to the feature group with high importance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810813090.4A CN109002689B (en) | 2018-07-23 | 2018-07-23 | T cell data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810813090.4A CN109002689B (en) | 2018-07-23 | 2018-07-23 | T cell data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109002689A CN109002689A (en) | 2018-12-14 |
CN109002689B true CN109002689B (en) | 2020-10-09 |
Family
ID=64596836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810813090.4A Active CN109002689B (en) | 2018-07-23 | 2018-07-23 | T cell data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109002689B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102939540A (en) * | 2010-04-16 | 2013-02-20 | 伊玛提克斯生物技术有限公司 | Method for differentially quantifying naturally processed hla-restricted peptides for cancer, autoimmune and infectious diseases immunotherapy development |
CN104487979A (en) * | 2012-05-25 | 2015-04-01 | 拜尔健康护理有限责任公司 | System and method for predicting the immunogenicity of a peptide |
CN104673899A (en) * | 2010-05-06 | 2015-06-03 | 赛昆塔公司 | Methods Of Monitoring Conditions By Sequence Analysis |
CN105189779A (en) * | 2012-10-01 | 2015-12-23 | 适应生物技术公司 | Immunocompetence assessment by adaptive immune receptor diversity and clonality characterization |
CN107207597A (en) * | 2014-11-06 | 2017-09-26 | 儿研所儿童医学中心 | For cancer and the immunotherapy of autoimmune disease |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170039314A1 (en) * | 2010-03-23 | 2017-02-09 | Iogenetics, Llc | Bioinformatic processes for determination of peptide binding |
-
2018
- 2018-07-23 CN CN201810813090.4A patent/CN109002689B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102939540A (en) * | 2010-04-16 | 2013-02-20 | 伊玛提克斯生物技术有限公司 | Method for differentially quantifying naturally processed hla-restricted peptides for cancer, autoimmune and infectious diseases immunotherapy development |
CN104673899A (en) * | 2010-05-06 | 2015-06-03 | 赛昆塔公司 | Methods Of Monitoring Conditions By Sequence Analysis |
CN104487979A (en) * | 2012-05-25 | 2015-04-01 | 拜尔健康护理有限责任公司 | System and method for predicting the immunogenicity of a peptide |
CN105189779A (en) * | 2012-10-01 | 2015-12-23 | 适应生物技术公司 | Immunocompetence assessment by adaptive immune receptor diversity and clonality characterization |
CN107207597A (en) * | 2014-11-06 | 2017-09-26 | 儿研所儿童医学中心 | For cancer and the immunotherapy of autoimmune disease |
Non-Patent Citations (6)
Title |
---|
"Comparative Assessment of Female Mouse Model of Graves" Orbitopathy Under Different Environments, Accompanied by Proinflammatory Cytokine and T-Cell Responses to Thyrotropin Hormone Receptor Antigen ";Sajad M 等;《Endocrinology》;20160401;第157卷(第4期);1673–1682 * |
"Graves甲亢和Graves突眼治疗前后TBX21、GATA3、RORC、FOXP3 mRNA表达水平的变化";施秉银 等;《海南医学》;20170830;第28卷(第15期);2437-2440 * |
"Graves病动物模型研究进展";王悦 等;《医学综述》;20130415;第18卷(第10期);1455-1458 * |
"Graves眼病活动性评判方法研究进展";谢秦 等;《国外医学(内分泌学分册)》;20000425;第20卷(第2期);76-78 * |
"ncreased microRNA-155 and decreased microRNA-146a may promote ocular inflammation and proliferation in Graves’ ophthalmopathy";Kaijun Li 等;《Med Sci Monit》;20140418(第20期);639–643 * |
"甲状腺相关眼病基因多态性";李星辰 等;《协和医学杂志》;20150130;第6卷(第1期);52-55 * |
Also Published As
Publication number | Publication date |
---|---|
CN109002689A (en) | 2018-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Klein et al. | Distinguishing features of Long COVID identified through immune profiling | |
Tian et al. | Discovering statistically significant pathways in expression profiling studies | |
Gerdes et al. | Immune signatures of prodromal multiple sclerosis in monozygotic twins | |
Daneshjou et al. | Working toward precision medicine: Predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges | |
US20140278133A1 (en) | Systems and methods for disease associated human genomic variant analysis and reporting | |
EP3916731A1 (en) | Methods and systems for interpretation and reporting of sequence-based genetic tests | |
US10665328B2 (en) | Methods and systems for interpretation and reporting of sequence-based genetic tests | |
EP3822974A1 (en) | Computational platform to identify therapeutic treatments for neurodevelopmental conditions | |
Jokinen et al. | TCRconv: predicting recognition between T cell receptors and epitopes using contextualized motifs | |
Jalali-Najafabadi et al. | Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models | |
Kohane | An autism case history to review the systematic analysis of large-scale data to refine the diagnosis and treatment of neuropsychiatric disorders | |
Abah et al. | Clinical utility of pharmacy-based adherence measurement in predicting virologic outcomes in an adult HIV-infected cohort in Jos, North Central Nigeria | |
di Iulio et al. | Transfer transcriptomic signatures for infectious diseases | |
Petterle et al. | Double poisson-tweedie regression models | |
Koncz et al. | Self-mediated positive selection of T cells sets an obstacle to the recognition of nonself | |
Bing et al. | Essential regression: a generalizable framework for inferring causal latent factors from multi-omic datasets | |
Yohannes et al. | Clustering based approach for population level identification of condition-associated T-cell receptor β-chain CDR3 sequences | |
WO2023278601A1 (en) | Methods and systems for machine learning analysis of inflammatory skin diseases | |
Olson et al. | Comparing T cell receptor repertoires using optimal transport | |
CN109002689B (en) | T cell data processing method and device | |
Bensouda Koraichi et al. | Inferring the T cell repertoire dynamics of healthy individuals | |
Martínez-Velasco et al. | Machine learning approach for pre-eclampsia risk factors association | |
Li et al. | Bioinformatic analysis of immune-related transcriptome affected by IFIT1 gene in childhood systemic lupus erythematosus | |
Kaplan et al. | Mixture model framework for traumatic brain injury prognosis using heterogeneous clinical and outcome data | |
Wei et al. | Risk factors for severe COVID-19 outcomes: a study of immune-mediated inflammatory diseases, immunomodulatory medications, and comorbidities in a large US healthcare system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |