More Web Proxy on the site http://driver.im/

research-article

Heterogeneous-training: A Semi-supervised Text Classification Method

Authors:

Fei HaoAuthors Info & Claims

BDIOT '23: Proceedings of the 2023 7th International Conference on Big Data and Internet of Things

Pages 162 - 168

https://doi.org/10.1145/3617695.3617707

Published: 02 November 2023 Publication History

Abstract

With the advent of the information age, there are more and more text data on the Internet. As the most widely distributed information carrier with the largest amount of data, it is particularly important to use text classification technology to organize and manage massive data scientifically. In this paper, a semi-supervised ensemble learning algorithm Heterogeneous-training is proposed and applied to the field of text classification. Based on the Tri-training algorithm, the Heterogeneous-training algorithm improves the traditional Tri-training algorithm by using different classifiers, dynamically updating the probability threshold and adaptively editing data. A large number of experiments show that our method always outperforms Tri-training algorithm in text classification on benchmark text data sets.

References

[1]

Agichtein E, Eskin E, Gravano L. Combining Strategies for Extracting Relations from Text Collections. 2000

[2]

Li X, Roth D, Tu Y. [Association for Computational Linguistics the seventh conference - Edmonton, Canada (2003.05.31-.)] Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - - PhraseNet. 2003; 4: 87-94

[3]

Tao F, Zhang C, Chen X, Doc 2 Cube : Automated Document Allocation to Text Cube via Dimension-Aware Joint Embedding. 2018

[4]

Meng Y, Shen J, Zhang C, Han J. Weakly-Supervised Neural Text Classification. Proceedings of the 27th ACM International Conference on Information and Knowledge Management 2018

[5]

Mekala D, Shang J. Contextualized Weak Supervision for Text Classification. In: Association for Computational Linguistics; 2020; Online:323–333

[6]

Dietterich TG. Ensemble Methods in Machine Learning. 2000

[7]

Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. 1998

[8]

Qian Z, Huailiang L. An Algorithm of Short Text Classification Based on Semi-supervised Learning. Data Analysis and Knowledge Discovery 2013;29: 30-35.

[9]

Goldman SA, Zhou Y. Enhancing Supervised Learning with Unlabeled Data. 2000

[10]

Zhou ZH, Li M. Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 2005; 17:1529-1541

[11]

Saito K, Ushiku Y, Harada T. Asymmetric Tri-training for Unsupervised Domain Adaptation. 2017

[12]

Miyato T, Dai AM, Goodfellow IJ. Adversarial Training Methods for Semi-Supervised Text Classification. arXiv: Machine Learning 2017

[13]

Clark K, Luong MT, Manning CD, Le Q. Semi-Supervised Sequence Modeling with Cross-View Training. In: Association for ComputationalLinguistics; 2018; Brussels, Belgium: 1914–1925

[14]

Sachan DS, Zaheer M, Salakhutdinov R. Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function. 2019

[15]

Miyato T, Maeda iS, Koyama M, Ishii S. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning.IEEE Transactions on Pattern Analysis and Machine Intelligence 2019; 41: 1979-1993

[16]

Xie Q, Dai Z, Hovy E, Luong T, Le Q. Unsupervised Data Augmentation for Consistency Training. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H., eds. Advances in Neural Information Processing Systems. 33. Curran Associates, Inc. 2020: 6256–6268

[17]

Chen J, Yang Z, Yang D. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. ArXiv 2020; abs/2004.12239

[18]

Ganaie MA, Hu M, Tanveer M, Suganthan PN. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022; 115: 105151.

[19]

Zhou Z. Machine Learning. China Civil And Commercial 2016; 03(No.21): 93-93

[20]

Opitz D, Maclin R. Popular Ensemble Methods: An Empirical Study. J. Artif. Int. Res. 1999; 11(1): 169–198

[21]

Dasgupta S, Littman ML, McAllester DA. PAC Generalization Bounds for Co-training. In: Dietterich TG, Becker S, Ghahramani Z., eds. Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada]MIT Press; 2001: 375-382

[22]

Sun S. Local within-class accuracies for weighting individual outputs in multiple classifier systems. Pattern Recognit. Lett. 2010; 31: 119-124

[23]

Wang S, Minku LL, Yao X. Resampling-Based Ensemble Methods for Online Class Imbalance Learning. IEEE Transactions on Knowledge and Data Engineering 2015; 27: 1356-1368

[24]

Belkin M, Niyogi P, Sindhwani V. On Manifold Regularization. 2005

[25]

Zhu X. Semi-Supervised Learning Literature Survey. 2005

[26]

Zhang X, Zhao J, LeCun Y. Character-level Convolutional Networks for Text Classification. arXiv: Learning 2015

[27]

Wan M, McAuley J. Item recommendation on monotonic behavior chains. Proceedings of the 12th ACM Conference on Recommender Systems 2018

[28]

Wan M, Misra R, Nakashole N, McAuley J. Fine-Grained Spoiler Detection from Large-Scale Review Corpora. In: Association for Computational Linguistics; 2019; Florence, Italy: 2605–2610

[29]

Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016

Index Terms

Heterogeneous-training: A Semi-supervised Text Classification Method
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use ...
Semi-supervised learning for question classification in CQA

In a community question answering (CQA) system, the new questions are appeared endlessly which have no tags. And the questions must be marked as some labels. Therefore, the question classification is very important for CQA. In the traditional task of ...
Stacked co-training for semi-supervised multi-label learning
Abstract
Due to the difficulty of annotation, multi-label learning sometimes obtains a small amount of labeled data and a large amount of unlabeled data as supplements. To make up this issue, many algorithms extended the existing semi-supervised ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

BDIOT '23: Proceedings of the 2023 7th International Conference on Big Data and Internet of Things

August 2023

232 pages

ISBN:9798400708015

DOI:10.1145/3617695

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

the Science and Technology Program of Sichuan Province
the Opening Project of Intelligent Policing Key Laboratory of Sichuan Province
the Opening Project of Intelligent Policing Key Laboratory of Sichuan Province

Conference

BDIOT 2023

BDIOT 2023: 2023 7th International Conference on Big Data and Internet of Things

August 11 - 13, 2023

Beijing, China

Acceptance Rates

Overall Acceptance Rate 75 of 136 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
24
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents