CN109002689B

CN109002689B - T cell data processing method and device

Info

Publication number: CN109002689B
Application number: CN201810813090.4A
Authority: CN
Inventors: 施秉银; 王悦; 刘宇峰; 叶凯; 杨晓飞; 蔺佳栋
Original assignee: First Affiliated Hospital of Xian Jiaotong University
Current assignee: First Affiliated Hospital of Xian Jiaotong University
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2020-10-09
Anticipated expiration: 2038-07-23
Also published as: CN109002689A

Abstract

Embodiments of the present invention provide a T cell data processing method and device. The T cell data processing method includes: acquiring multiple sets of sample data, each set of sample data includes T cell data sets corresponding to different features; calculating each The T cell receptor statistics of the group T cell data set; the importance of each group of T cell receptor statistics is verified by the cross-validation method, and the importance of the specified number of sample data is screened out. High feature group; build a naive Bayesian recognition network model according to the feature group with high importance.

Description

T cell data processing method and device

技术领域technical field

本发明涉及数据处理领域，具体而言，涉及一种T细胞数据处理方法及装置。The present invention relates to the field of data processing, in particular, to a T cell data processing method and device.

背景技术Background technique

随着计算机技术的发展，越来越多的领域都使用到了计算机技术，通过计算机技术可以提高各个领域的作业效率。在医疗技术领域中，也需要更多的技术以提高医疗过程中的效率。With the development of computer technology, computer technology is used in more and more fields, and the operation efficiency of various fields can be improved through computer technology. In the field of medical technology, there is also a need for more technology to improve efficiency in medical procedures.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明实施例的目的在于提供一种T细胞数据处理方法及装置。In view of this, the purpose of the embodiments of the present invention is to provide a T cell data processing method and apparatus.

本发明实施例提供的一种T细胞数据处理方法，包括：A T cell data processing method provided in an embodiment of the present invention includes:

获取多组样本数据，每组所述样本数据中包括不同特征对应的T细胞数据集；Acquiring multiple sets of sample data, each set of sample data includes T cell data sets corresponding to different characteristics;

计算每一组T细胞数据集的T细胞受体统计量；Calculate T cell receptor statistics for each T cell dataset;

将每一组T细胞受体统计量通过交叉验证法验证各组T细胞受体统计量的重要性，筛选出指定数量的样本数据中重要性高的特征组；The importance of each group of T cell receptor statistics is verified by cross-validation method, and the feature groups with high importance in the specified number of sample data are screened out;

根据所述重要性高的特征组构建朴素贝叶斯识别网络模型。A naive Bayesian recognition network model is constructed according to the feature groups with high importance.

可选地，在根据所述重要性高的特征组构建朴素贝叶斯识别网络模型的步骤之后，所述方法还包括：Optionally, after the step of constructing a naive Bayesian recognition network model according to the feature group with high importance, the method further includes:

计算待判断数据的T细胞受体统计量，所述待判断数据包括目标对象的T细胞数据；Calculate the T cell receptor statistics of the data to be judged, and the data to be judged includes the T cell data of the target object;

将所述待判断数据的T细胞受体统计量输入所述朴素贝叶斯识别网络模型进行目标病症的识别。The T cell receptor statistics of the data to be judged are input into the naive Bayesian identification network model to identify the target disease.

可选地，通过以下方式计算T细胞受体统计量：Optionally, T cell receptor statistics are calculated by:

获得待计算的T细胞数据的VJ家族频率，其中，V表示所述T细胞中的V基因，J表示所述T细胞中的J基因；Obtain the VJ family frequency of the T cell data to be calculated, wherein V represents the V gene in the T cell, and J represents the J gene in the T cell;

根据待计算的T细胞数据计算得到VJ家族内部的同源性；Calculate the homology within the VJ family according to the T cell data to be calculated;

根据所述VJ家族频率和VJ家族内部的同源性计算得到所述待计算的T细胞数据的T细胞受体统计量。The T cell receptor statistics of the T cell data to be calculated are calculated according to the VJ family frequency and the homology within the VJ family.

可选地，所述根据所述VJ家族频率和VJ家族内部的同源性计算得到所述待计算的T细胞数据的T细胞受体统计量表示为如下表达式：Optionally, the T cell receptor statistic calculated according to the VJ family frequency and the homology within the VJ family to obtain the T cell data to be calculated is expressed as the following expression:

其中，f表示VJ家族频率；c表示VJ家族内部的同源性。Among them, f represents the VJ family frequency; c represents the homology within the VJ family.

可选地，所述根据待计算的T细胞数据计算得到VJ家族内部的同源性的步骤，包括：Optionally, the step of calculating the homology within the VJ family according to the T cell data to be calculated includes:

获得所述待计算的T细胞数据的VJ家族中的氨基酸序列种类的数目；the number of amino acid sequence species in the VJ family for which the T cell data to be calculated was obtained;

计算所述待计算的T细胞数据的VJ家族内氨基酸序列与氨基酸序列之间的距离矩阵的信息熵；Calculate the information entropy of the distance matrix between the amino acid sequence within the VJ family of the T cell data to be calculated and the amino acid sequence;

计算所述氨基酸种类的数目与所述VJ家族内氨基酸序列与氨基酸序列之间的距离矩阵的信息熵的乘积，得到VJ家族内部的同源性。The product of the number of amino acid species and the information entropy of the distance matrix between amino acid sequences and amino acid sequences within the VJ family is calculated to obtain homology within the VJ family.

可选地，所述计算所述待计算的T细胞数据的VJ家族内氨基酸与氨基酸之间的距离矩阵的信息熵的步骤，包括：Optionally, the step of calculating the information entropy of the distance matrix between amino acids and amino acids in the VJ family of the T cell data to be calculated includes:

对所述待计算的T细胞数据的VJ家族内所有氨基酸序列进行两两比对；All amino acid sequences in the VJ family of the T cell data to be calculated are aligned pairwise;

使用打分矩阵对两两氨基酸序列对比结果进行打分，得到每对氨基酸序列的距离；Use the scoring matrix to score the comparison results of the two amino acid sequences to obtain the distance of each pair of amino acid sequences;

计算所有两两氨基酸序列的距离，得到VJ家族内氨基酸序列与氨基酸序列之间的距离矩阵；Calculate the distance of all pairs of amino acid sequences to obtain the distance matrix between amino acid sequences and amino acid sequences within the VJ family;

计算所述VJ家族内氨基酸序列与氨基酸序列之间的距离矩阵的信息熵。The information entropy of the distance matrix between amino acid sequences within the VJ family is calculated.

可选地，所述使用打分矩阵对两两氨基酸序列对比结果进行打分，得到每对氨基酸序列的距离的打分规则为：Optionally, the scoring matrix is used to score the comparison results of pairs of amino acid sequences, and the scoring rule for obtaining the distance of each pair of amino acid sequences is:

距离(a,a)＝0；distance(a,a)=0;

距离(a,b)＝min(4,4-BLOSUM62(a,b))；distance(a,b)=min(4,4-BLOSUM62(a,b));

其中，a,b分别表示不同的氨基酸。Among them, a and b represent different amino acids, respectively.

可选地，所述将每一组T细胞受体统计量通过交叉验证法验证各组T细胞受体统计量的重要性，筛选出指定数量的样本数据中重要性高的特征组的步骤，包括：Optionally, the step of verifying the importance of each group of T cell receptor statistics by a cross-validation method for each group of T cell receptor statistics, and screening out a feature group with high importance in the specified number of sample data, include:

a.通过所述交叉验证法对所述每一组T细胞受体统计量进行计算后，根据计算结果得到的重要性，并按照重要性将样本数据中的各个特征进行排序；a. After calculating the statistics of each group of T cell receptors by the cross-validation method, according to the importance obtained by the calculation result, and rank each feature in the sample data according to the importance;

b.选择排序中排序靠前的部分组的特征；b. Select the features of the top-ranked partial groups in the sorting;

重复步骤a、b执行设定次数，从设定次数筛选出的排序靠前的部分组的特征中选择出指定数量的样本数据中重要性高的特征组。Steps a and b are repeated for a set number of times, and a feature group with high importance in a specified number of sample data is selected from the features of the partial groups screened out by the set number of times.

通过所述交叉验证法对所述每一组T细胞受体统计量进行计算后，按照根据计算结果得到的重要性，并按照重要性将样本数据中的各个特征进行排序；After calculating the statistics of each group of T cell receptors by the cross-validation method, according to the importance obtained according to the calculation result, the respective features in the sample data are sorted according to the importance;

选择排序中排序靠前的指定数量的重要性高的T细胞受体组。A specified number of highly important T cell receptor groups ranked high in the selection sort.

将每一组T细胞受体统计量通过随机森林及交叉验证法验证各组T细胞受体统计量的重要性，筛选出指定数量的样本数据中重要性高的特征组。The importance of each group of T cell receptor statistics was verified by random forest and cross-validation methods, and the feature groups with high importance in the specified number of sample data were screened.

本发明实施例还提供一种T细胞数据处理装置，包括：The embodiment of the present invention also provides a T cell data processing device, including:

获取模块，用于获取多组样本数据，每组所述样本数据中包括不同特征对应的T细胞数据集；an acquisition module, used for acquiring multiple groups of sample data, wherein each group of the sample data includes T cell data sets corresponding to different characteristics;

第一计算模块，用于计算每一组T细胞数据集的T细胞受体统计量；a first computing module for computing the T cell receptor statistics of each T cell dataset;

筛选模块，用于将每一组T细胞受体统计量通过交叉验证法验证各组T细胞受体统计量的重要性，筛选出指定数量的样本数据中重要性高的特征组；The screening module is used to verify the importance of each group of T cell receptor statistics by cross-validation method for each group of T cell receptor statistics, and screen out the feature groups with high importance in the specified number of sample data;

构建模块，用于根据所述重要性高的特征组构建朴素贝叶斯识别网络模型。A building module is used to build a naive Bayesian recognition network model according to the feature group with high importance.

与现有技术相比，本发明实施例的T细胞数据处理方法及装置，通过使用对多组检测数据进行计算筛选得到重要性比较高的数据构建识别网络，通过对数据的筛选可以使构建的朴素贝叶斯识别网络模型能够更好地对需要判断的数据进行判断。Compared with the prior art, the T cell data processing method and device according to the embodiment of the present invention construct a recognition network by calculating and screening multiple sets of detection data to obtain data with relatively high importance. The Naive Bayesian recognition network model can better judge the data that needs to be judged.

为使本发明的上述目的、特征和优点能更明显易懂，下文特举实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the following specific embodiments are given and described in detail in conjunction with the accompanying drawings.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the embodiments. It should be understood that the following drawings only show some embodiments of the present invention, and therefore do not It should be regarded as a limitation of the scope, and for those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.

图1为本发明实施例提供的电子设备的方框示意图。FIG. 1 is a schematic block diagram of an electronic device according to an embodiment of the present invention.

图2为本发明实施例提供的T细胞数据处理方法的流程图。FIG. 2 is a flowchart of a method for processing T cell data according to an embodiment of the present invention.

图3为本发明实施例提供的T细胞数据处理方法的计算T细胞受体统计量的流程图。FIG. 3 is a flowchart of calculating T cell receptor statistics according to the T cell data processing method provided by the embodiment of the present invention.

图4为本发明实施例提供的T细胞数据处理方法的部分流程图FIG. 4 is a partial flowchart of the T cell data processing method provided by the embodiment of the present invention

图5为本发明实施例提供的T细胞数据处理装置的功能模块示意图。FIG. 5 is a schematic diagram of functional modules of a T cell data processing apparatus according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明的实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. The components of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。同时，在本发明的描述中，术语“第一”、“第二”等仅用于区分描述，而不能理解为指示或暗示相对重要性。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.

如图1所示，是一电子设备100的方框示意图。所述电子设备100包括存储器111、存储控制器112、处理器113、外设接口114、输入输出单元115、显示单元116。本领域普通技术人员可以理解，图1所示的结构仅为示意，其并不对电子设备100的结构造成限定。例如，电子设备100还可包括比图1中所示更多或者更少的组件，或者具有与图1所示不同的配置。本实施例所述的电子设备100可以是个人计算机、图像处理服务器、车载设备或者移动电子设备等具有图像处理能力的计算设备。As shown in FIG. 1 , it is a schematic block diagram of an electronic device 100 . The electronic device 100 includes a memory 111 , a storage controller 112 , a processor 113 , a peripheral interface 114 , an input and output unit 115 , and a display unit 116 . Those of ordinary skill in the art can understand that the structure shown in FIG. 1 is only for illustration, and does not limit the structure of the electronic device 100 . For example, the electronic device 100 may also include more or fewer components than shown in FIG. 1 , or have a different configuration than that shown in FIG. 1 . The electronic device 100 in this embodiment may be a computing device with image processing capability, such as a personal computer, an image processing server, a vehicle-mounted device, or a mobile electronic device.

所述存储器111、存储控制器112、处理器113、外设接口114、输入输出单元115及显示单元116各元件相互之间直接或间接地电性连接，以实现数据的传输或交互。例如，这些元件相互之间可通过一条或多条通讯总线或信号线实现电性连接。所述存储器111中存储至少一个以软件或固件(Firmware)的形式的软件功能模块，或所述电子设备100的操作系统(Operating System，OS)中固化有软件功能模块。所述处理器113用于执行存储器中存储的可执行模块。The elements of the memory 111 , the storage controller 112 , the processor 113 , the peripheral interface 114 , the input/output unit 115 and the display unit 116 are directly or indirectly electrically connected to each other to realize data transmission or interaction. For example, these elements may be electrically connected to each other through one or more communication buses or signal lines. The memory 111 stores at least one software function module in the form of software or firmware (Firmware), or a software function module is solidified in an operating system (Operating System, OS) of the electronic device 100 . The processor 113 is used to execute executable modules stored in the memory.

其中，所述存储器111可以是，但不限于，随机存取存储器(Random AccessMemory，RAM)，只读存储器(Read Only Memory，ROM)，可编程只读存储器(ProgrammableRead-Only Memory，PROM)，可擦除只读存储器(Erasable Programmable Read-OnlyMemory，EPROM)，电可擦除只读存储器(Electric Erasable Programmable Read-OnlyMemory，EEPROM)等。其中，存储器111用于存储程序，所述处理器113在接收到执行指令后，执行所述程序，本发明实施例任一实施例揭示的过程定义的电子设备100所执行的方法可以应用于处理器113中，或者由处理器113实现。Wherein, the memory 111 may be, but not limited to, random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), or Erasable Programmable Read-Only Memory (EPROM), Electrical Erasable Programmable Read-Only Memory (EEPROM), etc. The memory 111 is used to store a program, and the processor 113 executes the program after receiving the execution instruction, and the method executed by the electronic device 100 defined by the process disclosed in any embodiment of the present invention can be applied to processing in the processor 113 , or implemented by the processor 113 .

所述处理器113可能是一种集成电路芯片，具有信号的处理能力。上述的处理器113可以是通用处理器，包括中央处理器(Central Processing Unit，简称CPU)、网络处理器(Network Processor，简称NP)等；还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 113 may be an integrated circuit chip with signal processing capability. The above-mentioned processor 113 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; it may also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. Various methods, steps, and logical block diagrams disclosed in the embodiments of the present invention can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

所述外设接口114将各种输入/输入装置耦合至处理器113以及存储器111。在一些实施例中，外设接口114，处理器113以及存储控制器112可以在单个芯片中实现。在其他一些实例中，他们可以分别由独立的芯片实现。The peripheral interface 114 couples various input/input devices to the processor 113 and the memory 111 . In some embodiments, peripheral interface 114, processor 113, and memory controller 112 may be implemented in a single chip. In other instances, they may be implemented by separate chips.

所述输入输出单元115用于提供给用户输入数据。所述输入输出单元115可以是，但不限于，鼠标和键盘等。The input and output unit 115 is used for providing input data to the user. The input and output unit 115 may be, but not limited to, a mouse, a keyboard, and the like.

所述显示单元116在所述电子设备100与用户之间提供一个交互界面(例如用户操作界面)或用于显示图像数据给用户参考。在本实施例中，所述显示单元可以是液晶显示器或触控显示器。若为触控显示器，其可为支持单点和多点触控操作的电容式触控屏或电阻式触控屏等。支持单点和多点触控操作是指触控显示器能感应到来自该触控显示器上一个或多个位置处同时产生的触控操作，并将该感应到的触控操作交由处理器进行计算和处理。The display unit 116 provides an interactive interface (eg, a user operation interface) between the electronic device 100 and the user or is used to display image data for the user's reference. In this embodiment, the display unit may be a liquid crystal display or a touch display. In the case of a touch display, it can be a capacitive touch screen or a resistive touch screen that supports single-point and multi-touch operations. Supporting single-point and multi-touch operation means that the touch display can sense the touch operation from one or more positions on the touch display at the same time, and hand over the sensed touch operation to the processor. calculation and processing.

Graves病(Graves’disease,GD)是一种器官特异性自身免疫性疾病，人群患病率高达0.5％-2％。Graves眼病(Graves’ophthalmopathy,GO)是GD最常见的并发症，25％-50％的GD患者会在病程中伴发GO，表现为眼球突出、疼痛、视力受损等。GO现有治疗主要针对中重度活动期患者，给予激素冲击抑制免疫及炎症反应。然而该治疗对部分患者效果不理想，并且存在高血压、感染、骨质疏松等不良反应。因此无论疾病本身还是治疗方式，GO都严重影响患者身心健康，优化现有诊疗方案一直是领域内研究热点。Graves' disease (GD) is an organ-specific autoimmune disease with a prevalence of 0.5%-2% in the population. Graves' ophthalmopathy (GO) is the most common complication of GD. 25%-50% of GD patients will be accompanied by GO during the course of the disease, manifesting as proptosis, pain, and visual impairment. The existing treatment of GO is mainly aimed at patients with moderate to severe activity, giving hormone shocks to suppress immune and inflammatory responses. However, this treatment is not effective for some patients, and there are adverse reactions such as hypertension, infection, and osteoporosis. Therefore, regardless of the disease itself or the treatment method, GO seriously affects the physical and mental health of patients, and optimizing the existing diagnosis and treatment plan has always been a research hotspot in the field.

眼眶纤维母细胞是GO的靶细胞，在传统诊断体系下，GO确诊时患者眶内纤维母细胞早已活化增殖，并启动不可逆的组织重塑过程。此时激素冲击只能改善眼眶炎症反应，无法停止或逆转组织纤维化重塑。因此，如果可以早期预测GD患者GO的发生，就可在其眼眶纤维母细胞活化增殖之前，通过预防性治疗阻断该病程进展，从根本上延缓甚至阻止GO的发生。Orbital fibroblasts are the target cells of GO. Under the traditional diagnostic system, orbital fibroblasts in patients with GO have already been activated and proliferated, and an irreversible tissue remodeling process has been initiated. At this time, hormonal shock can only improve the orbital inflammatory response, but cannot stop or reverse tissue fibrotic remodeling. Therefore, if the occurrence of GO in GD patients can be predicted early, the progression of the disease can be blocked by preventive treatment before the activation and proliferation of orbital fibroblasts, and the occurrence of GO can be fundamentally delayed or even prevented.

近年来有诸多GO易感性研究报道，但尚无成功预测GO发生的报道。HLA,CTLA-4,TNF,TSHR,TPO，PTPN12，MTHFR(四甲基四氢叶酸)，TSLP(胸腺基质淋巴细胞生成素)等基因多态性可能与GO发生相关，但由于基因连锁不平衡及实验设计缺陷等因素，导致这些研究未开展后续预测相关试验。高TRAb滴度，吸烟或GD的治疗方式与GO的发生具有相关性，但缺乏明确因果关联，也不能作为预测的指标。In recent years, many GO susceptibility studies have been reported, but no reports have successfully predicted the occurrence of GO. HLA, CTLA-4, TNF, TSHR, TPO, PTPN12, MTHFR (tetramethyltetrahydrofolate), TSLP (thymic stromal lymphopoietin) and other gene polymorphisms may be associated with the occurrence of GO, but due to gene linkage disequilibrium Due to factors such as flaws in experimental design and other factors, these studies did not carry out follow-up prediction-related experiments. High TRAb titers, smoking, or treatment for GD were associated with the occurrence of GO, but lacked a clear causal association and could not be used as a predictor.

对GO发病机制的深入理解，有助于寻找合适的层面和角度进行预测和干预。GO是自身免疫介导的炎症反应性疾病，眼眶组织大量浸润单个核细胞，主要包括CD4+T细胞。GWAS研究报道GO易感基因包括HLA,CTLA-4和IL-23R等，主要集中于T细胞抗原呈递和活化的通路。GO的发生是机体在遗传与环境共同作用下打破T细胞对自身抗原的免疫耐受，进而抗原特异性T细胞在甲状腺和眼眶组织进行特异性交叉反应，启动眼眶纤维母细胞的活化。T细胞受体(T Cell Receptor,TCR)是T细胞特异性识别和结合抗原肽-MHC的分子结构，是介导T细胞特异性免疫反应的重要标志。当受到抗原刺激后T细胞会发生显著的克隆扩增，TCR总体多样性和优势家族会发生明显偏移。因此，TCR在GD患者GO的发生与进展中发挥重要的介导作用，经过发明人的研究认为GO特征性TCR可以帮助预测GD患者GO的发生。An in-depth understanding of the pathogenesis of GO will help to find appropriate levels and angles for prediction and intervention. GO is an autoimmune-mediated inflammatory disease, and orbital tissue is massively infiltrated with mononuclear cells, mainly including CD4+ T cells. GWAS studies have reported that GO susceptibility genes include HLA, CTLA-4 and IL-23R, which are mainly concentrated in the pathways of T cell antigen presentation and activation. The occurrence of GO is that the body breaks the immune tolerance of T cells to self-antigens under the combined action of genetics and the environment, and then antigen-specific T cells conduct specific cross-reactions in the thyroid and orbital tissues to initiate the activation of orbital fibroblasts. T Cell Receptor (TCR) is a molecular structure that T cells specifically recognize and bind to antigenic peptide-MHC, and is an important marker for mediating T cell-specific immune responses. When T cells are stimulated by antigen, there will be significant clonal expansion, and the overall diversity and dominant family of TCR will be significantly shifted. Therefore, TCR plays an important mediating role in the occurrence and progression of GO in GD patients, and the inventor's research believes that GO characteristic TCR can help predict the occurrence of GO in GD patients.

TCR组库包含某个体所有功能多样性T细胞的总和，全面反应细胞免疫状态。TCR是一种高维数据，包含多样性及克隆扩增等多重角度的信息，为全面反应其真实状态带来极大难度。据保守估计人体内TCR多样性高达1012-1015。这种多样性是由于T细胞发育过程中TCR胚系基因V/D/J片段重排及重排过程中核苷酸插入缺失造成的。为了方便进行TCR数据分析，根据V基因同源性，将TCR的80个Vα和65个Vβ基因分为32个Vα和24个Vβ亚家族。因而，可从高通量测序数据结果中进行比对，计算出TCR VJ家族所占频率。另一方面，TCR多样性随外环境变化而不断发生改变，当受到抗原或抗原决定簇刺激后T细胞会发生显著地克隆性增殖，表现为TCR VJ家族中每种TCR亚家族种类繁多，但同源性高，结构差异很小。因此，单纯从TCR的VJ家族频率并不能全面描述TCR的真实状态，还应考虑TCR的VJ家族结构及同源性等因素。基于上述研究本申请通过以下几个实施例可以对T细胞数据进行处理研究。The TCR repertoire contains the sum total of all functionally diverse T cells in an individual, and comprehensively reflects the cellular immune status. TCR is a high-dimensional data that contains information from multiple perspectives such as diversity and clonal expansion, which brings great difficulty to fully reflect its true state. It is conservatively estimated that the TCR diversity in humans is as high as 1012-1015. This diversity is due to rearrangement of the TCR germline gene V/D/J segment during T cell development and nucleotide indels during rearrangement. To facilitate TCR data analysis, 80 Vα and 65 Vβ genes of TCR were divided into 32 Vα and 24 Vβ subfamilies according to V gene homology. Thus, the frequency of TCR VJ families can be calculated from the alignment of the high-throughput sequencing data results. On the other hand, the diversity of TCRs changes continuously with changes in the external environment. When stimulated by antigens or antigenic determinants, T cells will undergo significant clonal proliferation, showing that each TCR subfamily in the TCR VJ family is diverse, but The homology is high and the structural difference is small. Therefore, the VJ family frequency of TCR alone cannot fully describe the true state of TCR, and the VJ family structure and homology of TCR should also be considered. Based on the above research, the present application can conduct processing research on T cell data through the following examples.

请参阅图2，是本发明实施例提供的应用于图1所示的电子设备的T细胞数据处理方法的流程图。下面将对图2所示的具体流程进行详细阐述。Please refer to FIG. 2 , which is a flowchart of a T cell data processing method applied to the electronic device shown in FIG. 1 provided by an embodiment of the present invention. The specific flow shown in FIG. 2 will be described in detail below.

步骤S201，获取多组样本数据。Step S201, acquiring multiple sets of sample data.

本实施例中，每组所述样本数据中包括不同特征对应的T细胞数据集。In this embodiment, each group of the sample data includes T cell data sets corresponding to different features.

本实施例中，所述每一组样本数据可以是针对一用户或患者采集到的各个特征对应的T细胞数据。每个特征可以是一个TCR数据组。In this embodiment, each group of sample data may be T cell data corresponding to each feature collected for a user or patient. Each feature can be a TCR data set.

在一个应用场景中，本实施例中的T细胞数据处理方法用于构建用于识别GD患者可能发生GO的预测的数据识别网络。在此应用场景中所述多组样本数据可以是GD患者的不同特征对应的T细胞数据集。每个样本可以携带多组特征，可以包括：TRBV12.5_TRBJ2.7、TRBV2_TRBJ2.3、TRBV5.1_TRBJ1.1、TRBV5.1_TRBJ1.2等，在此不再一一描述。In one application scenario, the T cell data processing method in this embodiment is used to construct a data identification network for identifying the prediction of GO likely to occur in GD patients. In this application scenario, the multiple sets of sample data may be T cell data sets corresponding to different characteristics of GD patients. Each sample may carry multiple sets of features, which may include: TRBV12.5_TRBJ2.7, TRBV2_TRBJ2.3, TRBV5.1_TRBJ1.1, TRBV5.1_TRBJ1.2, etc., which will not be described one by one here.

步骤S202，计算每一组T细胞数据集的T细胞受体统计量。Step S202, calculating the T cell receptor statistics of each group of T cell data sets.

本实施例中，可以根据T细胞数据集中的T细胞VJ家族内部同源性以及VJ家族频率计算得到。通过所述T细胞受体统计量可以反映T细胞受体的频率及内部的同源性。In this embodiment, it can be calculated according to the internal homology of the T cell VJ family and the VJ family frequency in the T cell data set. The T cell receptor statistics can reflect the frequency and internal homology of T cell receptors.

步骤S203，将每一组T细胞受体统计量通过交叉验证法验证各组T细胞受体统计量的重要性，筛选出指定数量的样本数据中重要性高的特征组。Step S203 , verify the importance of each group of T cell receptor statistics by a cross-validation method, and select feature groups with high importance in the specified number of sample data.

步骤S204，根据所述重要性高的特征组构建朴素贝叶斯识别网络模型。Step S204, constructing a naive Bayesian recognition network model according to the feature group with high importance.

通过所述朴素贝叶斯识别网络模型可以对目标用户的数据进行识别。The data of the target user can be identified through the naive Bayesian identification network model.

进一步地，所述朴素贝叶斯识别网络模型用于对目标用户的身体进行预测时，使用该用户的所述重要性高的特征组对应的数据进行计算预测。Further, when the naive Bayesian recognition network model is used to predict the body of the target user, the calculation prediction is performed using the data corresponding to the feature group with high importance of the user.

本发明实施例的T细胞数据处理方法，通过使用对多组检测数据进行计算筛选得到重要性比较高的数据构建识别网络，通过对数据的筛选可以使构建的朴素贝叶斯识别网络模型能够更好地对需要判断的数据进行判断。In the T cell data processing method of the embodiment of the present invention, a recognition network is constructed by using data with relatively high importance obtained by computing and screening multiple sets of detection data, and the constructed Naive Bayesian recognition network model can be more Good judgment on the data that needs to be judged.

本实施例中，如图3所示，可以通过以下步骤计算T细胞受体统计量。In this embodiment, as shown in FIG. 3 , the T cell receptor statistics can be calculated through the following steps.

步骤S301，获得待计算的T细胞数据的VJ家族频率。Step S301, obtaining the VJ family frequency of the T cell data to be calculated.

其中，V表示所述T细胞中的V基因，J表示所述T细胞中的J基因。Wherein, V represents the V gene in the T cell, and J represents the J gene in the T cell.

步骤S302，根据待计算的T细胞数据计算得到VJ家族内部的同源性。Step S302, calculating the homology within the VJ family according to the T cell data to be calculated.

步骤S303，根据所述VJ家族频率和VJ家族内部的同源性计算得到所述待计算的T细胞数据的T细胞受体统计量。Step S303, calculating the T cell receptor statistics of the T cell data to be calculated according to the VJ family frequency and the homology within the VJ family.

进一步地，所述根据所述VJ家族频率和VJ家族内部的同源性计算得到所述待计算的T细胞数据的T细胞受体统计量表示为如下表达式：Further, the T cell receptor statistic calculated according to the VJ family frequency and the homology within the VJ family to obtain the T cell data to be calculated is expressed as the following expression:

本实施例中，所述根据待计算的T细胞数据计算得到VJ家族内部的同源性的步骤，包括：In this embodiment, the step of calculating the homology within the VJ family according to the T cell data to be calculated includes:

计算所述氨基酸序列种类的数目与所述VJ家族内氨基酸序列与氨基酸序列之间的距离矩阵的信息熵的乘积，得到VJ家族内部的同源性。The product of the number of amino acid sequence species and the information entropy of the distance matrix between amino acid sequences and amino acid sequences within the VJ family is calculated to obtain the homology within the VJ family.

进一步地，VJ家族内部的同源性的计算公式可表示为：Further, the calculation formula of homology within the VJ family can be expressed as:

c＝v×e；c=v×e;

其中，v表示VJ家族中氨基酸序列种类的数目，e表示VJ家族内氨基酸序列与氨基酸序列之间的距离矩阵的信息熵。Among them, v represents the number of amino acid sequence species in the VJ family, and e represents the information entropy of the distance matrix between the amino acid sequences in the VJ family and the amino acid sequences.

本实施例中，所述计算所述待计算的T细胞数据的VJ家族内氨基酸序列与氨基酸序列之间的距离矩阵的信息熵的步骤，包括：In this embodiment, the step of calculating the information entropy of the distance matrix between the amino acid sequence in the VJ family of the T cell data to be calculated and the amino acid sequence includes:

对所述待计算的T细胞数据的VJ家族内所有氨基酸序列进行比对；Aligning all amino acid sequences within the VJ family of the T cell data to be calculated;

使用打分矩阵对两两氨基酸序列进行打分，得到每对氨基酸序列的距离；Use the scoring matrix to score pairs of amino acid sequences to obtain the distance of each pair of amino acid sequences;

进一步地，所述VJ家族内氨基酸序列与氨基酸序列之间的距离矩阵的信息熵的计算公式可表示为：e＝-∑d×log(d)。Further, the calculation formula of the information entropy of the distance matrix between the amino acid sequence and the amino acid sequence in the VJ family can be expressed as: e=-Σd×log(d).

其中，d表示VJ家族内氨基酸序列与氨基酸序列之间的距离。在一种实施方式中，可以使用加权的海明距离比对氨基酸序列与氨基酸序列之间的距离。进一步地，可以将gap罚分机制引入从而完善氨基酸长短不一致的情况。Wherein, d represents the distance between the amino acid sequence and the amino acid sequence within the VJ family. In one embodiment, the distances between amino acid sequences can be aligned using a weighted Hamming distance. Further, a gap penalty mechanism can be introduced to improve the situation of inconsistent amino acid lengths.

其中，所述VJ家族内氨基酸序列与氨基酸序列之间的距离的计算可以表示成如下过程：Wherein, the calculation of the distance between the amino acid sequence and the amino acid sequence in the VJ family can be expressed as the following process:

首先，可以使用ClustalW软件对所有氨基酸序列进行比对；First, all amino acid sequences can be aligned using ClustalW software;

使用打分矩阵对两两氨基酸序列距离进行打分，例如，使用BLOSUM62substitution matrix对两两氨基酸与氨基酸进行打分，打分规则如下：Use the scoring matrix to score the distance between pairs of amino acid sequences. For example, use the BLOSUM62substitution matrix to score pairs of amino acids and amino acids. The scoring rules are as follows:

距离(a,a)＝0；distance(a,a)=0;

距离(a,b)＝min(4,4-BLOSUM62(a,b))；distance(a,b)=min(4,4-BLOSUM62(a,b));

其中，a,b分别表示不同的氨基酸。在一个实例中，gap opening和gapexpansion的罚分都设定为8。Among them, a and b respectively represent different amino acids. In one example, both gap opening and gapexpansion penalties are set to 8.

本实施例中，所述步骤S203包括：将每一组T细胞受体统计量通过随机森林及交叉验证法验证各组T细胞受体统计量的重要性，筛选出指定数量的样本数据中重要性高的特征组。In this embodiment, the step S203 includes: verifying the importance of each group of T cell receptor statistics through random forest and cross-validation methods, and screening out a specified number of sample data that are important in high-sex trait group.

本实施例中，所述步骤S203包括：In this embodiment, the step S203 includes:

本实施例中，可以每次排序筛选出排序靠前的五十、六十、七十等数量的特征。可以知道的是，至于选择的数量本领域技术人员可以按照需求进行设定。其中，每个特征均为一个TCR。In this embodiment, the top fifty, sixty, seventy and other features can be filtered out each time. It can be known that, as for the selected quantity, those skilled in the art can set it according to requirements. where each feature is a TCR.

在一个实例中，每组样本数据中包括多个特征对应的T细胞数据集。但是不是每个特征的T细胞数据集都会对目标病症的识别起到主导作用，因此筛选出重要性比较高的特征作为构建朴素贝叶斯识别网络模型，可以在减少计算量的情况下可以提高识别效率。In one example, each set of sample data includes T cell data sets corresponding to multiple features. However, not every T-cell dataset of features will play a leading role in the identification of the target disease. Therefore, screening out the features with relatively high importance as the construction of a naive Bayesian recognition network model can reduce the amount of computation and improve the identification efficiency.

进一步地，重复多次计算后，可以从出现在前五十、六十、七十等数量的特征中，选择出指定数量的样本数据中重要性高的特征组。例如，特征A在每次排序中均排序在前五十，则特征A可以被选为重要性高的特征组中的一特征。Further, after repeating the calculation for many times, a feature group with high importance in a specified number of sample data can be selected from the features appearing in the first fifty, sixty, seventy, etc. numbers. For example, feature A is ranked in the top fifty in each ranking, and feature A can be selected as a feature in a feature group with high importance.

在一种实施方式中，所述重要性高的特征组是对GO(Graves’ophthalmopathy，Graves眼病)和GH(Graves’hyperthyroidism，Graves甲亢)影响较大的TCR。其中，GO和GH为Graves病(Graves’disease,GD)是一种器官特异性自身免疫性疾病的并发症。In one embodiment, the feature group with high importance is TCR that has a greater impact on GO (Graves'ophthalmopathy, Graves ophthalmopathy) and GH (Graves'hyperthyroidism, Graves hyperthyroidism). Among them, GO and GH are Graves' disease (GD), which is a complication of organ-specific autoimmune diseases.

本实施例中，经过上述筛选可以选择出24组特征组，分别为：TRBV12.5_TRBJ2.7、TRBV2_TRBJ2.3、TRBV5.1_TRBJ1.1、TRBV5.1_TRBJ1.2、TRBV6.5_TRBJ1.5、TRBV7.8_TRBJ2.7、TRBV7.9_TRBJ2.2、TRBV9_TRBJ1.1、TRBV9_TRBJ2.2、TRBV9_TRBJ2.3、TRBV11.2_TRBJ2.7、TRBV19_TRBJ1.5、TRBV19_TRBJ1.1、TRBV20.1_TRBJ1.3、TRBV6.6_TRBJ1.1、TRBV7.9_TRBJ2.7、TRBV24.1_TRBJ1.6、TRBV6.1_TRBJ1.5、TRBV10.2_TRBJ2.3、TRBV15_TRBJ1.3、TRBV7.3_TRBJ1.6、TRBV7.3_TRBJ1.6。In this embodiment, 24 groups of feature groups can be selected after the above screening, namely: TRBV12.5_TRBJ2.7, TRBV2_TRBJ2.3, TRBV5.1_TRBJ1.1, TRBV5.1_TRBJ1.2, TRBV6.5_TRBJ1.5, TRBV7.8_TRBJ2 .7, TRBV7.9_TRBJ2.2, TRBV9_TRBJ1.1, TRBV9_TRBJ2.2, TRBV9_TRBJ2.3, TRBV11.2_TRBJ2.7, TRBV19_TRBJ1.5, TRBV19_TRBJ1.1, TRBV20.1_TRBJ1.3, TRBV6.6_TRBJ1.1, TRBV7 .7, TRBV24.1_TRBJ1.6, TRBV6.1_TRBJ1.5, TRBV10.2_TRBJ2.3, TRBV15_TRBJ1.3, TRBV7.3_TRBJ1.6, TRBV7.3_TRBJ1.6.

当然，可以筛选出更多或者更少的特征组。Of course, more or less feature groups can be filtered out.

本实施例中，通过筛选出重要性比较高的特征组作为构建朴素贝叶斯识别网络模型，可以是在判断GO或GH的时候能够更准确地预测。In this embodiment, by filtering out feature groups with relatively high importance as a naive Bayesian recognition network model, it can be more accurate to predict when GO or GH is judged.

本实施例中，所述步骤S203包括：通过所述交叉验证法对所述每一组T细胞受体统计量进行计算后，按照根据计算结果得到的重要性，并按照重要性将样本数据中的各个特征进行排序；In this embodiment, the step S203 includes: after calculating the statistics of each group of T cell receptors by the cross-validation method, according to the importance obtained according to the calculation result, and according to the importance The various features are sorted;

本实施例中还可以使用前面构建得到的朴素贝叶斯识别网络模型对用户的潜在的目标病症进行预估，如图4所示，所述方法还包括：In this embodiment, the naive Bayesian recognition network model constructed above can also be used to estimate the potential target disease of the user, as shown in FIG. 4 , the method further includes:

步骤S401，计算待判断数据的T细胞受体统计量，所述待判断数据包括目标对象的T细胞数据。Step S401: Calculate the T cell receptor statistics of the data to be determined, where the data to be determined includes the T cell data of the target object.

步骤S402，将所述待判断数据的T细胞受体统计量输入所述朴素贝叶斯识别网络模型进行目标病症进行识别。Step S402 , input the T cell receptor statistics of the data to be determined into the naive Bayesian identification network model to identify the target disease.

本实施例中，使用所述朴素贝叶斯识别网络模型可以对待判断数据进行分类预测。分类预测可以得到所述待判断数据是GH还GO。In this embodiment, the naive Bayesian recognition network model can be used to classify and predict the data to be judged. Classification prediction can obtain that the data to be judged is GH and GO.

下面为使用本实施例中的T细胞数据处理方法对多个例子进行测试的结果展示：The following shows the results of testing multiple examples using the T cell data processing method in this example:

1)17例GH和GO患者的诊断结果展示1) Display of the diagnostic results of 17 patients with GH and GO

GO正确率为70％，GH正确率为85.7％。GO is 70% correct and GH is 85.7% correct.

2)7例GH进展为GO的患者预测结果展示2) Prediction results of 7 patients who progressed from GH to GO

在以下实例中的预测正确率为71.5％。The prediction accuracy in the following example is 71.5%.

样本编号sample number GH概率GH probability GO概率GO probability 预测结果forecast result 真实GO发生Real GO happens WL0581WL0581 1.46E-021.46E-02 9.85E-019.85E-01 GOGO 是Yes WL0682WL0682 5.68E-015.68E-01 4.32E-014.32E-01 GHGH 是Yes WL0594WL0594 7.14E-037.14E-03 9.93E-019.93E-01 GOGO 是Yes WL0539WL0539 9.98E-019.98E-01 2.23E-032.23E-03 GHGH 是Yes WL0551WL0551 6.02E-066.02E-06 1.00E+001.00E+00 GOGO 是Yes WL0613WL0613 2.42E-012.42E-01 7.58E-017.58E-01 GOGO 是Yes WL0648WL0648 3.88E-083.88E-08 1.00E+001.00E+00 GOGO 是Yes

本实施例中的上述预测结果为通过使用朴素贝叶斯识别网络模型进行预测得到的结果，真实结果为事后采集对应用户的现场情况得到的数据。The above prediction results in this embodiment are the results obtained by using the naive Bayesian recognition network model for prediction, and the real results are the data obtained by collecting the on-site situation of the corresponding user afterwards.

上述两个表格为对部分实例进行预测计算得到的结果，在实际应用中预测结果可能会比上述实例更高。The above two tables are the results obtained by performing prediction calculations on some instances, and the predicted results may be higher than the above instances in practical applications.

请参阅图5，是本发明实施例提供的图1所示的T细胞数据处理装置的功能模块示意图。本实施例中的T细胞数据处理装置用于执行上述方法实施例中的各个步骤。所述T细胞数据处理装置包括：Please refer to FIG. 5 , which is a schematic diagram of functional modules of the T cell data processing apparatus shown in FIG. 1 according to an embodiment of the present invention. The T cell data processing apparatus in this embodiment is used to execute each step in the above method embodiment. The T cell data processing device includes:

获取模块501，用于获取多组样本数据，每组所述样本数据中包括不同特征对应的T细胞数据集；an acquisition module 501, configured to acquire multiple groups of sample data, wherein each group of the sample data includes T cell data sets corresponding to different characteristics;

第一计算模块502，用于计算每一组T细胞数据集的T细胞受体统计量；a first calculation module 502, configured to calculate the T cell receptor statistics of each group of T cell data sets;

筛选模块503，用于将每一组T细胞受体统计量通过交叉验证法验证各组T细胞受体统计量的重要性，筛选出指定数量的样本数据中重要性高的特征组；The screening module 503 is used to verify the importance of each group of T cell receptor statistics by cross-validation method, and screen out the feature groups with high importance in the specified number of sample data;

构建模块504，用于根据所述重要性高的特征组构建朴素贝叶斯识别网络模型。The building module 504 is configured to build a naive Bayesian recognition network model according to the feature group with high importance.

关于本实施例的其它细节还可进一步地参考上述方法实施例中的描述，在此不再赘述。For other details of this embodiment, further reference may be made to the descriptions in the foregoing method embodiments, which will not be repeated here.

本发明实施例的T细胞数据处理装置，通过使用对多组检测数据进行计算筛选得到重要性比较高的数据构建识别网络，通过对数据的筛选可以使构建的朴素贝叶斯识别网络模型能够更好地对需要判断的数据进行判断。In the T cell data processing device of the embodiment of the present invention, a recognition network is constructed by using data with relatively high importance obtained by calculating and screening multiple sets of detection data, and the constructed Naive Bayesian recognition network model can be more Good judgment on the data that needs to be judged.

本实施例中，所述T细胞数据处理装置还包括：In this embodiment, the T cell data processing device further includes:

第二计算模块，用于计算待判断数据的T细胞受体统计量，所述待判断数据包括目标对象的T细胞数据；The second computing module is used to calculate the T cell receptor statistics of the data to be determined, and the data to be determined includes the T cell data of the target object;

识别模块，用于将所述待判断数据的T细胞受体统计量输入所述朴素贝叶斯识别网络模型进行目标病症进行识别。The identification module is used for inputting the T cell receptor statistics of the data to be judged into the naive Bayesian identification network model to identify the target disease.

本实施例中，所述第一计算模块502或第二计算模块还用于：In this embodiment, the first computing module 502 or the second computing module is further used for:

本实施例中，所述第二计算模块用于：In this embodiment, the second computing module is used for:

本实施例中，所述根据所述VJ家族频率和VJ家族内部的同源性计算得到所述待计算的T细胞数据的T细胞受体统计量表示为如下表达式：In this embodiment, the T cell receptor statistic obtained by calculating the T cell data to be calculated according to the VJ family frequency and the homology within the VJ family is expressed as the following expression:

所述第一计算模块502或第二计算模块还用于：The first computing module 502 or the second computing module is also used for:

使用打分矩阵对两两氨基酸进行打分，得到每对氨基酸序列的距离Use the scoring matrix to score pairs of amino acids to get the distance of each pair of amino acid sequences

所述第一计算模块502还用于：The first computing module 502 is also used for:

距离(a,a)＝0；distance(a,a)=0;

距离(a,b)＝min(4,4-BLOSUM62(a,b))；distance(a,b)=min(4,4-BLOSUM62(a,b));

本实施例中，所述筛选模块503还用于：In this embodiment, the screening module 503 is also used for:

重复上面a和b两个模块执行设定次数，从设定次数筛选出的排序靠前的部分组的特征中选择出指定数量的样本数据中重要性高的特征组。The above two modules a and b are repeated for a set number of times, and a feature group with high importance in the specified number of sample data is selected from the features of the top-ranked partial groups filtered out by the set number of times.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，也可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的，例如，附图中的流程图和框图显示了根据本发明的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分，所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现方式中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may also be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality and possible implementations of apparatuses, methods and computer program products according to various embodiments of the present invention. operate. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.

另外，在本发明各个实施例中的各功能模块可以集成在一起形成一个独立的部分，也可以是各个模块单独存在，也可以两个或两个以上模块集成形成一个独立的部分。In addition, each functional module in each embodiment of the present invention may be integrated to form an independent part, or each module may exist independently, or two or more modules may be integrated to form an independent part.

所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。If the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes. It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention. It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. should be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method of T-cell data processing, comprising:

acquiring a plurality of groups of sample data, wherein each group of sample data comprises T cell data sets corresponding to different characteristics;

calculating T cell receptor statistics for each set of T cell datasets: obtaining VJ family frequency of T cell data to be calculated, calculating to obtain internal homology of a VJ family according to the T cell data to be calculated, and calculating to obtain T cell receptor statistic of the T cell data to be calculated according to the VJ family frequency and the internal homology of the VJ family; wherein V represents a V gene in the T cell, J represents a J gene in the T cell, and T cell receptor statistics of the T cell data to be calculated based on the VJ family frequency and homology within the VJ family are represented by the following expression:

wherein f represents a VJ family frequency; c represents homology within the VJ family;

verifying the importance of each group of T cell receptor statistic by a cross-validation method, and screening out a characteristic group with high importance in a specified amount of sample data;

and constructing a naive Bayes recognition network model according to the feature group with high importance.

2. The T-cell data processing method of claim 1, wherein the step of calculating the homology within the VJ family from the T-cell data to be calculated comprises:

obtaining a number of amino acid sequence species in the VJ family of the T cell data to be calculated;

calculating the information entropy of the distance matrix between the amino acid sequences and the VJ family internal amino acid sequences of the T cell data to be calculated;

and calculating the product of the number of the amino acid types and the information entropy of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family to obtain the homology in the VJ family.

3. The T-cell data processing method according to claim 2, wherein the step of calculating the entropy of the distance matrix between the amino acid sequence and the amino acid sequence within the VJ family of the T-cell data to be calculated comprises:

pairwise alignment of all amino acid sequences within the VJ family of the T cell data to be calculated;

using a scoring matrix to score the comparison result of every two amino acid sequences to obtain the distance between each pair of amino acid sequences;

calculating the distance between every two amino acid sequences to obtain a distance matrix between the amino acid sequences in the VJ family;

calculating the entropy of information of the distance matrix between the amino acid sequences and the amino acid sequences in the VJ family.

4. The method of T-cell data processing according to claim 3, wherein two amino acid sequences are scored using a scoring matrix, and the scoring rule for the distance between each pair of amino acid sequences within the VJ family is:

distance (a, a) is 0;

distance (a, b) ═ min (4,4-BLOSUM62(a, b));

wherein a and b represent different amino acids.

5. The method of T-cell data processing according to claim 1, wherein said step of cross-validating the importance of each T-cell receptor statistic group by each T-cell receptor statistic group and selecting a feature group of high importance among a given number of sample data comprises:

a. after each group of T cell receptor statistics is calculated through the cross-validation method, the importance is obtained according to the calculation result, and all the characteristics in the sample data are ranked according to the importance;

b. selecting the characteristics of the part group which is ranked at the top in the ranking;

and (c) repeating the steps a and b for the set times, and selecting a characteristic group with high importance in the sample data of a specified number from the characteristics of the part group which is screened out from the set times and is ranked at the top.

6. The method of T-cell data processing according to claim 1, wherein said step of cross-validating the importance of each T-cell receptor statistic group by each T-cell receptor statistic group and selecting a feature group of high importance among a given number of sample data comprises:

and verifying the importance of the T cell receptor statistics of each group by a random forest and cross verification method, and screening out a characteristic group with high importance in the sample data of a specified number.

7. A T-cell data processing apparatus, comprising:

the acquisition module is used for acquiring a plurality of groups of sample data, wherein each group of sample data comprises T cell data sets corresponding to different characteristics;

a first calculation module for calculating T cell receptor statistics for each set of T cell datasets: obtaining VJ family frequency of T cell data to be calculated, calculating to obtain internal homology of a VJ family according to the T cell data to be calculated, and calculating to obtain T cell receptor statistic of the T cell data to be calculated according to the VJ family frequency and the internal homology of the VJ family; wherein V represents a V gene in the T cell, J represents a J gene in the T cell, and T cell receptor statistics of the T cell data to be calculated based on the VJ family frequency and homology within the VJ family are represented by the following expression:

the screening module is used for verifying the importance of each group of T cell receptor statistic by a cross-validation method and screening out a characteristic group with high importance in a specified amount of sample data;

and the construction module is used for constructing a naive Bayes recognition network model according to the feature group with high importance.