计算机科学 ›› 2022, Vol. 49 ›› Issue (6): 127-133.doi: 10.11896/jsjkx.211100043
王宇飞, 陈文
WANG Yu-fei, CHEN Wen
摘要: Tri-training是一种基于分歧的半监督学习算法,同时利用了半监督学习和集成学习机制。Tri-training能有效地利用少量有标记样本和大量无标记样本,通过分类器间的相互协同和迭代来提升模型性能。但是在已标记样本量不足的情况下,Tri-training生成的初始分类器训练不足,并且在分类器间协同标记的过程中可能产生误标记的噪声数据。针对上述问题,提出了一种结合DECORATE集成学习、多样性度量与置信度评估的协同学习算法。该算法基于DECORATE集成学习方法,通过添加差异化的人工样本和标记来训练多种偏好的基分类器,以提升分类泛化能力。该算法还基于JS散度对分类器进行多样性度量和筛选,以最大化基分类器多样性,同时在迭代过程中基于标签传播算法对伪标记样本进行置信度评估,以减少噪声数据。在UCI数据集上进行了分类实验,结果表明,相比Tri-training算法及其改进算法,所提算法具有更高的分类准确率和F1分数。
中图分类号:
[1] GONG S,ZHAO C.Intrusion detection system based on classification[C]//IEEE International Conference on Intelligent Control.IEEE,2012:78-83. [2] MAZEL J,CASAS P,LABIT Y,et al.Sub-Space clustering,Inter-Clustering Results Association & anomaly correlation for unsupervised network anomaly detection[C]//7th International Conference on Network and Service Management(CNSM 2011).IEEE,Paris,France,2011:1-8. [3] ZHOU Z H,LI M.Semi-supervised learning by disagreement[J].Knowledge & Information Systems,2010,24(3):415-439. [4] ZHU X J,GHAHRAMANI Z,LAFFERTY J D.Semi-Super-vised Learning Using Gaussian Fields and Harmonic Functions[C]//Machine Learning,Proceedings of the Twentieth International Conference(ICML 2003).Washington,DC,USA.2003:912-919. [5] BLUM A,MITCHELL T.Combining Labeled and UnlabeledData with Co-Training[C]//Proceedings of the 11th Annual Conference on Computational Learning Theory.Madison:ACM,1998:92-100. [6] CHEN S J,LIU J F,HUANG Q C,et al.Conditional Value-based Co-training[J].Acta Automatica Sinica,2013,39(10):1665-1673. [7] KATZ G,CARAGEA C,SHABTAI A.Vertical Ensemble Co-Training for Text Classification[J].ACM Transactions on Intelligent Systems and Technology,2017,9(2):1-23. [8] LU J,GONG Y.A co-training method based on entropy and multi-criteria[J].Applied Intelligence,2021,51(6):1-14. [9] ZHOU Z H,LI M.Tri-training:exploiting unlabeled data using three classifiers[J].IEEE Transactions on Knowledge and Data Engineering,2005,17(11):1529-1541. [10] XU G,ZHAO J,HUANG D.An improved social spammer detection based on tri-training[C]//2016 IEEE International Conference on Big Data(Big Data).IEEE,2016:4040-4042. [11] LI J,WEI Z,LI K.A Novel Semi-supervised SVM Based on Tri-training for Intrusition Detection[J].Journal of Computers,2010,5(4):638-645. [12] SØGAARD A.Simple semi-supervised training of part-of-speech taggers[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2010:205-208. [13] RUDER S,PLANK B.Strong Baselines for Neural Semi-supervised Learning under Domain Shift[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2018:1044-1054. [14] ZHANG Y,CHEN R R,ZHANG J.Safe Tri-training Algorithm Based on Cross Entropy[J].Journal of Computer Research and Development,2021,58(1):60-69. [15] MELVILLE P,MOONEY R J.Creating diversity in ensembles using artificial data[J].Information Fusion,2005,6(1):99-111. [16] ZHU X J,GHAHRAMANI Z.Learning from labels and unlabeled data with label propagation[J].Tech Report,2002,3175(2004):237-244. [17] ZHOU Z H.Disagreement-based Semi-supervised Learning[J].Acta Automatica Sinica,2013,39(11):1871-1878. [18] ANGLUIN D,LAIRD P.Learning From Noisy Examples[J].Machine Learning,1988,2(4):343-370. [19] ZHANG C X,WANG G W,ZHANG J S.An empirical bias-variance analysis of DECORATE ensemble method at different training sample sizes[J].Journal of Applied Statistics,2012,39(3/4):829-850. [20] SUN B,WANG J D,CHEN H Y,et al.Diversity measures in ensemble learning[J].Control and Decision,2014(3):385-395. [21] WANG W,ZHOU Z H.Analyzing Co-training Style Algorithms[C]//European Conference on Machine Learning.Springer-Verlag,2007:454-465. [22] KUNCHEVA L I,WHITAKER C J.Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy[J].Machine Learning,2003,51(2):181-207. [23] CHU R,WANG M,ZENG X Q,et al.A New Diverse Measure in Ensemble Learning Using Unlabeled Data[C]//2012 Fourth International Conference on Computational Intelligence,Communication Systems and Networks(CICSyN).IEEE,2012:18-21. [24] ZHANG M L,ZHOU Z H.Exploiting unlabeled data to enhance ensemble diversity[J].Data Mining and Knowledge Discovery,2013,26(1):98-129. [25] DUA D,GRAFF C.UCI Machine Learning Repository[DB/OL].[2019-12-10].https://archive.ics.uci.edu/ml/. |
[1] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[2] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[3] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[4] | 任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132 |
[5] | 陈伟, 李杭, 李维华. 核小体定位预测的集成学习方法 Ensemble Learning Method for Nucleosome Localization Prediction 计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195 |
[6] | 刘振宇, 宋晓莹. 一种可用于分类型属性数据的多变量回归森林 Multivariate Regression Forest for Categorical Attribute Data 计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189 |
[7] | 周新民, 胡宜桂, 刘文洁, 孙荣俊. 基于多模态多层级数据融合方法的城市功能识别研究 Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method 计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220 |
[8] | 周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究 Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data 计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102 |
[9] | 戴宗明, 胡凯, 谢捷, 郭亚. 基于直觉模糊集的集成学习算法 Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets 计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036 |
[10] | 郇文明, 林海涛. 基于采样集成算法的入侵检测系统设计 Design of Intrusion Detection System Based on Sampling Ensemble Algorithm 计算机科学, 2021, 48(11A): 705-712. https://doi.org/10.11896/jsjkx.201100101 |
[11] | 刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲. FS-CRF:基于特征切分与级联随机森林的异常点检测模型 FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest 计算机科学, 2020, 47(8): 185-188. https://doi.org/10.11896/jsjkx.190600162 |
[12] | 钟熙, 孙祥娥. 基于Kmeans++聚类的朴素贝叶斯集成方法研究 Research on Naive Bayes Ensemble Method Based on Kmeans++ Clustering 计算机科学, 2019, 46(6A): 439-441. |
[13] | 曹雅茜, 黄海燕. 基于概率采样和集成学习的不平衡数据分类算法 Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning 计算机科学, 2019, 46(5): 203-208. https://doi.org/10.11896/j.issn.1002-137X.2019.05.031 |
[14] | 胡海根, 孔祥勇, 周乾伟, 管秋, 陈胜勇. 基于深层卷积残差网络集成的黑色素瘤分类方法 Melanoma Classification Method by Integrating Deep Convolutional Residual Network 计算机科学, 2019, 46(5): 247-253. https://doi.org/10.11896/j.issn.1002-137X.2019.05.038 |
[15] | 袁丁,王茜,邓李维. 聚类辅助特征对齐的域适应方法 Clustering Assist Feature Alignment for Unsupervised Domain Adaptation 计算机科学, 2019, 46(3): 221-226. https://doi.org/10.11896/j.issn.1002-137X.2019.03.033 |
|