[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Free access
Just Accepted

A Noise-Oriented and Redundancy-Aware Instance Selection Framework

Online AM: 20 November 2024 Publication History

Abstract

Fine-tuning transformer-based deep learning models is currently at the forefront of natural language processing (NLP) and information retrieval (IR) tasks. However, fine-tuning these transformers for specific tasks, especially when dealing with ever-expanding volumes of data, constant retraining requirements, and budget constraints, can be computationally and financially costly, requiring substantial energy consumption and contributing to carbon dioxide emissions. This article focuses on advancing the state-of-the-art (SOTA) on instance selection (IS) – a range of document filtering techniques designed to select the most representative documents for the sake of training. The objective is to either maintain or enhance classification effectiveness while reducing the overall training (fine-tuning) total processing time. In our prior research, we introduced the E2SC framework, a redundancy-oriented IS method focused on transformers and large datasets – currently the state-of-the-art in IS. Nonetheless, important research questions remained unanswered in our previous work, mostly due to E2SC's sole emphasis on redundancy. In this article, we take our research a step further by proposing biO-IS – an extended bi-objective instance selection solution, a novel IS framework aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on scalable, fast, and calibrated weak classifiers and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our extended solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, not even our previous SOTA solution, was capable of achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup. To ensure reproducibility, our documentation, code, and datasets can be accessed on GitHub – https://github.com/waashk/bio-is.

References

[1]
David W Aha, Dennis Kibler, and Marc K Albert. 1991. Instance-based learning algorithms. Machine learning 6, 1 (1991), 37–66.
[2]
Nouf Alturayeif, Hamoud Aljamaan, and Jameleddine Hassine. 2023. An automated approach to aspect-based sentiment analysis of apps reviews using machine and deep learning. Automated Software Engineering 30, 2 (2023), 30.
[3]
Claudio Andrade, Washington Cunha, Guilherme Fonseca, Ana Pagano, Luana Santos, Adriana Pagano, Leonardo Rocha, and Marcos Gonçalves. 2024. Explaining the Hardest Errors of Contextual Embedding Based Classifiers. In Proceedings of the 28th Conference on Computational Natural Language Learning. 419–434.
[4]
Lasse F Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. 2020. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051 (2020).
[5]
Fabiano Belém, Washington Cunha, Celso França, Claudio Andrade, Leonardo Rocha, and Marcos André Gonçalves. 2024. A Novel Two-Step Fine-Tuning Pipeline for Cold-Start Active Learning in Text Classification Tasks. arXiv preprint arXiv:2407.17284 (2024).
[6]
Glenn W Brier et al. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review 78, 1 (1950), 1–3.
[7]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in NeurIPS, Vol. 33.
[8]
Martin Juan José Bucher and Marco Martini. 2024. Fine-Tuned’Small’LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification. arXiv preprint arXiv:2406.08660 (2024).
[9]
Jose Camacho-Collados and Mohammad Taher Pilehvar. 2018. On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. In Proceedings of the EMNLP Workshop BlackboxNLP. 40–46.
[10]
Sergio Canuto, Thiago Salles, Thierson C Rosa, and Marcos A Gonçalves. 2019. Similarity-based synthetic document representations for meta-feature generation in text classification. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.
[11]
Washington Cunha, Sérgio Canuto, Felipe Viegas, Thiago Salles, Christian Gomes, Vitor Mangaravite, Elaine Resende, Thierson Rosa, Marcos André Gonçalves, and Leonardo Rocha. 2020. Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management (IP&M) 57, 4 (2020). https://doi.org/10.1016/j.ipm.2020.102263
[12]
Washington Cunha, Celso França, Guilherme Fonseca, Leonardo Rocha, and Marcos André Gonçalves. 2023. An Effective, Efficient, and Scalable Confidence-Based Instance Selection Framework for Transformer-Based Text Classification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 665–674.
[13]
Washington Cunha, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Elaine Resende, Cecilia Nascimento, Felipe Viegas, Celso França, Wellington Santos Martins, Jussara M. Almeida, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2021. On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management 58, 3 (2021), 102481.
[14]
Washington Cunha, Felipe Viegas, Celso França, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2023. A Comparative Survey of Instance Selection Methods Applied to NonNeural and Transformer-Based Text Classification. ACM Comput. Surv. (jan 2023). https://doi.org/10.1145/3582000
[15]
Claudio de Andrade, Washington Cunha, Davi Reis, Adriana Silvina Pagano, Leonardo Rocha, and Marcos André Gonçalves. 2024. A Strategy to Combine 1stGen Transformers and Open LLMs for Automatic Text Classification. arXiv preprint arXiv:2408.09629 (2024).
[16]
Cláudio Moisés Valiense de Andrade, Fabiano Belém, Washington Cunha, Celso França, Felipe Viegas, Leonardo Rocha, and Marcos André Gonçalves. 2023. On the class separability of contextual embeddings representations - or ”The classifier does not matter when the (text) representation is so good!”. Inf. Process. Manag. 60, 4 (2023), 103336. https://doi.org/10.1016/J.IPM.2023.103336
[17]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics HLT-NAACL.
[18]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
[19]
Andrea Esuli and Fabrizio Sebastiani. 2013. Improving text classification accuracy by training label cleaning. ACM Transactions on Information Systems (TOIS) 31, 4 (2013), 1–28.
[20]
Yongfu Fan, Jin Chen, Yongquan Jiang, Defu Lian, Fangda Guo, and Kai Zheng. 2023. Batch-Mix Negative Sampling for Learning Recommendation Retrievers. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (Birmingham, United Kingdom) (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 494–503. https://doi.org/10.1145/3583780.3614789
[21]
David Fernandes, Edleno S. de Moura, Berthier Ribeiro-Neto, Altigran S. da Silva, and Marcos André Gonçalves. 2007. Computing block importance for searching on web sites. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management (Lisbon, Portugal) (CIKM ’07). Association for Computing Machinery, New York, NY, USA, 165–174. https://doi.org/10.1145/1321440.1321466
[22]
Maurizio Ferrari Dacrema, Andrea Pasin, Paolo Cremonesi, and Nicola Ferro. 2024. Quantum Computing for Information Retrieval and Recommender Systems. In European Conference on Information Retrieval. Springer, 358–362.
[23]
Peter A Flach. 2016. Classifier calibration. In Encyclopedia of machine learning and data mining. Springer US.
[24]
Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2024. Towards Effective and Efficient Sparse Neural Information Retrieval. ACM Trans. Inf. Syst. 42, 5, Article 116 (apr 2024), 46 pages. https://doi.org/10.1145/3634912
[25]
Celso França, Rennan C Lima, Claudio Andrade, Washington Cunha, Pedro OS Vaz de Melo, Berthier Ribeiro-Neto, Leonardo Rocha, Rodrygo LT Santos, Adriana Silvina Pagano, and Marcos André Gonçalves. 2024. On representation learning-based methods for effective, efficient, and scalable code retrieval. Neurocomputing 600 (2024), 128172.
[26]
Andrea Gasparetto, Matteo Marcuzzo, Alessandro Zangari, and Andrea Albarelli. 2022. A Survey on Text Classification Algorithms: From Text to Predictions. Inf. 13 (2022), 83. https://api.semanticscholar.org/CorpusID:246840078
[27]
Shansan Gong, Zelin Zhou, Shuo Wang, Fengjiao Chen, Xiujie Song, Xuezhi Cao, Yunsen Xian, and Kenny Zhu. 2023. Transferable and efficient: Unifying dynamic multi-domain product categorization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.
[28]
Xiao Han, Yuqi Liu, and Jimmy Lin. 2021. The simplest thing that can possibly work:(pseudo-) relevance feedback via text classification. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval.
[29]
Yosef Hochberg. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 4 (1988).
[30]
David Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 329–338.
[31]
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An introduction to statistical learning. Vol. 112. Springer.
[32]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the Conference European Chapter Association Computational Linguistics (EACL). 427–431.
[33]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
[34]
Loïc Lannelongue, Jason Grealey, and Michael Inouye. 2021. Green algorithms: quantifying the carbon footprint of computation. Advanced science (2021).
[35]
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2023. Bloom: A 176b-parameter open-access multilingual language model. (2023).
[36]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th ACL. 7871–7880.
[37]
Enrique Leyva, Antonio González, and Raúl Pérez. 2015. Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective. Pattern Recognition 48, 4 (2015), 1523–1537.
[38]
Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S Yu, and Lifang He. 2022. A Survey on Text Classification: From Traditional to Deep Learning. ACM Transactions on Intelligent Systems and Technology (2022).
[39]
Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461 (2021), 370–403.
[40]
Dugang Liu, Chaohua Yang, Xing Tang, Yejing Wang, Fuyuan Lyu, Weihong Luo, Xiuqiang He, Zhong Ming, and Xiangyu Zhao. 2024. MultiFS: Automated Multi-Scenario Feature Selection in Deep Recommender Systems. In WSDM. 434–442.
[41]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint 1907.11692 (2019).
[42]
Washington Luiz, Felipe Viegas, Rafael Alencar, Fernando Mourão, Thiago Salles, Dárlinton Carvalho, Marcos Andre Gonçalves, and Leonardo Rocha. 2018. A feature-oriented sentiment rating for mobile app reviews. In Proceedings of the 2018 World Wide Web Conference. 1909–1918.
[43]
Zhengyi Ma, Zhicheng Dou, Wei Xu, Xinyu Zhang, Hao Jiang, Zhao Cao, and Ji-Rong Wen. 2021. Pre-training for ad-hoc retrieval: hyperlink is also you need. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 1212–1221.
[44]
Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020. Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. In SIGIR’20. https://doi.org/10.1145/3397271.3401093
[45]
Mohamed Malhat, Mohamed El Menshawy, Hamdy Mousa, and Ashraf El Sisi. 2020. A new approach for instance selection: Algorithms, evaluation, and comparisons. Expert Systems with Applications 149 (2020), 113297.
[46]
Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
[47]
Karen Martins, Pedro Vaz de Melo, and Rodrygo Santos. 2021. Why Do Document-Level Polarity Classifiers Fail? 1782–1794. https://doi.org/10.18653/v1/2021.naacl-main.143
[48]
Yoshitomo Matsubara, Thuy Vu, and Alessandro Moschitti. 2020. Reranking for Efficient Transformer-Based Answer Selection. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1577–1580. https://doi.org/10.1145/3397271.3401266
[49]
Luiz Felipe Mendes, Marcos André Gonçalves, Washington Cunha, Leonardo C. da Rocha, Thierson Couto Rosa, and Wellington Martins. 2020. ”Keep it Simple, Lazy” MetaLazy: A New MetaStrategy for Lazy Text Classification. In CIKM ’20.
[50]
Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep Learning–Based Text Classification: A Comprehensive Review. ACM Comput. Surv. 54, 3, Article 62 (apr 2021), 40 pages.
[51]
Franco Maria Nardini, Cosimo Rulli, Salvatore Trani, and Rossano Venturini. 2023. Neural Network Compression using Binarization and Few Full-Precision Weights. arXiv preprint arXiv:2306.08960 (2023).
[52]
Andrea Pasin, Washington Cunha, Marcos André Gonçalves, and Nicola Ferro. 2022. A Quantum Annealing-Based Instance Selection Approach for Transformer Fine-Tuning. the 14th Italian Information Retrieval Workshop (2022).
[53]
Andrea Pasin, Washington Cunha, Marcos André Gonçalves, and Nicola Ferro. 2024. A Quantum Annealing Instance Selection Approach for Efficient and Effective Transformer Fine-Tuning. In Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval (Washington DC, USA) (ICTIR ’24). Association for Computing Machinery, 205–214. https://doi.org/10.1145/3664190.3672515
[54]
Filipe N Ribeiro, Matheus Araújo, Pollyanna Gonçalves, Marcos André Gonçalves, and Fabrício Benevenuto. 2016. Sentibench-a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science 5 (2016), 1–29.
[55]
Abhinaba Roy and Erik Cambria. 2022. Soft labeling constraint for generalizing from sentiments in single domain. Knowledge-Based Systems 245 (2022), 108346.
[56]
Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. 2024. How to Train Data-Efficient LLMs. arXiv preprint arXiv:2402.09668 (2024).
[57]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
[58]
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (2002), 1–47.
[59]
Marco Siino, Ilenia Tinnirello, and Marco La Cascia. 2024. Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers. Information Systems 121 (2024), 102342. https://doi.org/10.1016/j.is.2023.102342
[60]
Vishwanath A. Sindagi, Rajeev Yasarla, Deepak Sam Babu, R. Venkatesh Babu, and Vishal M. Patel. 2020. Learning to Count in the Crowd from Limited Labeled Data. In Computer Vision – ECCV. Cham, 212–229.
[61]
Marina Sokolova and Guy Lapalme. 2009. A Systematic Analysis of Performance Measures for Classification Tasks. Information Processing & Management (IP&M) 45, 4 (July 2009), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002
[62]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
[63]
Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proceedings of the 42nd Annual International ACM SIGIR Conference. 505–514.
[64]
Felipe Viegas, Antonio Pereira, Washington Cunha, Celso França, Claudio Andrade, Elisa Tuler, Leonardo Rocha, and Marcos André Gonçalves. 2024. Exploiting Contextual Embeddings in Hierarchical Topic Modeling and Investigating the Limits of the Current Evaluation Metrics. Computational Linguistics (2024), 1–59.
[65]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
[66]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NIPS, Vol. 32. 5754–5764.
[67]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2016. Character-level Convolutional Networks for Text Classification. In NIPS´16. Vol. 28. 649–657.
[68]
Yazhou Zhang, Mengyao Wang, Chenyu Ren, Qiuchi Li, Prayag Tiwari, Benyou Wang, and Jing Qin. 2024. Pushing The Limit of LLM Capacity for Text Classification. arXiv preprint arXiv:2402.07470 (2024).
[69]
Zizhuo Zhang and Bang Wang. 2023. Prompt Learning for News Recommendation. In SIGIR. 227–237. https://doi.org/10.1145/3539618.3591752

Index Terms

  1. A Noise-Oriented and Redundancy-Aware Instance Selection Framework

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems Just Accepted
    EISSN:1558-2868
    Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Online AM: 20 November 2024
    Accepted: 11 November 2024
    Revised: 09 October 2024
    Received: 16 July 2024

    Check for updates

    Author Tags

    1. Instance Selection
    2. Document Filtering
    3. Transformer-Based Text Classification

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 92
      Total Downloads
    • Downloads (Last 12 months)92
    • Downloads (Last 6 weeks)92
    Reflects downloads up to 30 Dec 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media