[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3589335.3651915acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Open access

Robust Data-centric Graph Structure Learning for Text Classification

Published: 13 May 2024 Publication History

Abstract

Over the past decades, text classification underwent remarkable evolution across diverse domains. Despite these advancements, most existing model-centric methods in text classification cannot generalize well on class-imbalanced datasets that contain high-similarity textual information. Instead of developing new model architectures, data-centric approaches enhance the performance by manipulating the data structure. In this study, we aim to investigate robust data-centric approaches that can help text classification in our collected dataset, the metadata of survey papers about Large Language Models (LLMs). In the experiments, we explore four paradigms and observe that leveraging arXiv's co-category information on graphs can help robustly classify the text data over the other three paradigms, conventional machine-learning algorithms, pre-trained language models' fine-tuning, and zero-shot / few-shot classifications using LLMs.

Supplemental Material

MP4 File
Presentation video
MP4 File
Supplemental video

References

[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
[2]
Hadeer Ahmed, Issa Traore, and Sherif Saad. 2018. Detecting opinion spams and fake news using text classification. Security and Privacy, Vol. 1, 1 (2018), e9.
[3]
Samar Bashath, Nadeesha Perera, Shailesh Tripathi, Kalifa Manjang, Matthias Dehmer, and Frank Emmert Streib. 2022. A data-centric review of deep transfer learning with applications to text data. Information Sciences, Vol. 585 (2022), 498--528.
[4]
Alex Bogatu, Alvaro AA Fernandes, Norman W Paton, and Nikolaos Konstantinou. 2020. Dataset discovery in data lakes. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 709--720.
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, Vol. 33 (2020), 1877--1901.
[6]
Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013).
[7]
Hanning Chen, Ali Zakeri, Fei Wen, Hamza Errahmouni Barkam, and Mohsen Imani. 2023. HyperGRAF: Hyperdimensional Graph-Based Reasoning Acceleration on FPGA. In 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 34--41.
[8]
Ziheng Chen, Fabrizio Silvestri, Jia Wang, Yongfeng Zhang, Zhenhua Huang, Hongshik Ahn, and Gabriele Tolomei. 2022. Grease: Generate factual and counterfactual explanations for gnn-based recommendations. arXiv preprint arXiv:2208.04222 (2022).
[9]
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2019. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In International Conference on Learning Representations.
[10]
Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems. 3844--3852.
[11]
Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. 2017. Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710.10370 (2017).
[12]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. Advances in neural information processing systems, Vol. 28 (2015).
[13]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024--1034.
[14]
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[16]
Chengyue Huang, Anindita Bandyopadhyay, Weiguo Fan, Aaron Miller, and Stephanie Gilbertson-White. 2023. Mental toll on working women during the COVID-19 pandemic: An exploratory study using Reddit data. PloS one, Vol. 18, 1 (2023), e0280049.
[17]
Lianzhe Huang, Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2019. Text level graph neural network for text classification. arXiv preprint arXiv:1910.02356 (2019).
[18]
Wei Jin, Xiaorui Liu, Xiangyu Zhao, Yao Ma, Neil Shah, and Jiliang Tang. 2021. Automated self-supervised learning for graphs. arXiv preprint arXiv:2106.05470 (2021).
[19]
Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. 2020. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 66--74.
[20]
Xin Jin, Sunil Manandhar, Kaushal Kafle, Zhiqiang Lin, and Adwait Nadkarni. 2022. Understanding iot security from a market-scale perspective. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1615--1629.
[21]
Xin Jin and Yuchen Wang. 2023. Understand legal documents with contextualized large language models. arXiv preprint arXiv:2303.12135 (2023).
[22]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.
[23]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[24]
Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. Text classification algorithms: A survey. Information, Vol. 10, 4 (2019), 150.
[25]
Raghu Krishnapuram and Krishna Kummamuru. 2003. Automatic taxonomy generation: Issues and possibilities. In International Fuzzy Systems Association World Congress. Springer, 52--63.
[26]
Arun Kumar, Jeffrey Naughton, Jignesh M Patel, and Xiaojin Zhu. 2016. To join or not to join? thinking twice about joins before feature selection. In Proceedings of the 2016 International Conference on Management of Data. 19--34.
[27]
Mucahid Kutlu, Tyler McDonnell, Tamer Elsayed, and Matthew Lease. 2020. Annotator rationales for labeling tasks in crowdsourcing. Journal of Artificial Intelligence Research, Vol. 69 (2020), 143--189.
[28]
Kwei-Herng Lai, Daochen Zha, Junjie Xu, Yue Zhao, Guanchu Wang, and Xia Hu. 2021. Revisiting time series outlier detection: Definitions and benchmarks. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (round 1).
[29]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations.
[30]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871--7880.
[31]
Baoli Li and Liping Han. 2013. Distance weighted cosine similarity measure for text classification. In Intelligent Data Engineering and Automated Learning--IDEAL 2013: 14th International Conference, IDEAL 2013, Hefei, China, October 20--23, 2013. Proceedings 14. Springer, 611--618.
[32]
Haoyang Liu, Maheep Chaudhary, and Haohan Wang. 2023. Towards trustworthy and aligned machine learning: A data-centric survey with causality perspectives. arXiv preprint arXiv:2307.16851 (2023).
[33]
Ying Liu, Han Tong Loh, and Aixin Sun. 2009. Imbalanced text classification: A term weighting approach. Expert systems with Applications, Vol. 36, 1 (2009), 690--701.
[34]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[35]
Weimin Lyu, Xinyu Dong, Rachel Wong, Songzhu Zheng, Kayley Abell-Hart, Fusheng Wang, and Chao Chen. 2022a. A multimodal transformer: Fusing clinical notes with structured EHR data for interpretable in-hospital mortality prediction. In AMIA Annual Symposium Proceedings, Vol. 2022. American Medical Informatics Association, 719.
[36]
Weimin Lyu, Songzhu Zheng, Tengfei Ma, and Chao Chen. 2022b. A Study of the Attention Abnormality in Trojaned BERTs. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4727--4741.
[37]
Weimin Lyu, Songzhu Zheng, Lu Pang, Haibin Ling, and Chao Chen. 2023. Attention-Enhancing Backdoor Attacks Against BERT-based Models. In Findings of the Association for Computational Linguistics: EMNLP 2023. 10672--10690.
[38]
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. 2018. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV). 181--196.
[39]
Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, Vol. 752. Madison, WI, 41--48.
[40]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
[41]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
[42]
Saurabh Srivastava, Chengyue Huang, Weiguo Fan, and Ziyu Yao. 2023. Instance Needs More Care: Rewriting Prompts for Instances Yields Better Zero-Shot Performance. arXiv preprint arXiv:2310.02107 (2023).
[43]
Qiaoyu Tan, Guoxian Yu, Carlotta Domeniconi, Jun Wang, and Zili Zhang. 2018. Incomplete multi-view weak-label learning. In Ijcai. 2703--2709.
[44]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[46]
Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Conference on Learning Representations.
[47]
Jun Wu, Xuesong Ye, and Yanyuet Man. 2023 a. Bottrinet: A unified and efficient embedding for social bots detection via metric learning. In 2023 11th International Symposium on Digital Forensics and Security (ISDFS). IEEE, 1--6.
[48]
Jun Wu, Xuesong Ye, Chengjie Mou, and Weinan Dai. 2023 b. Fineehr: Refine clinical note representations to improve mortality prediction. In 2023 11th International Symposium on Digital Forensics and Security (ISDFS). IEEE, 1--6.
[49]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
[50]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Advances in Neural Information Processing Systems, Vol. 32 (2019).
[51]
Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 7370--7377.
[52]
Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. 2023. Data-centric ai: Perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). SIAM, 945--948.
[53]
Yin Zhang, Rong Jin, and Zhi-Hua Zhou. 2010. Understanding bag-of-words model: a statistical framework. International journal of machine learning and cybernetics, Vol. 1 (2010), 43--52.
[54]
Tong Zhao, Wei Jin, Yozen Liu, Yingheng Wang, Gang Liu, Stephan Günnemann, Neil Shah, and Meng Jiang. 2022. Graph data augmentation for graph machine learning: A survey. arXiv preprint arXiv:2202.08871 (2022).
[55]
Jun Zhuang and Mohammad Al Hasan. 2022a. Defending Graph Convolutional Networks against Dynamic Graph Perturbations via Bayesian Self-Supervision. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 4 (Jun. 2022), 4405--4413. https://doi.org/10.1609/aaai.v36i4.20362
[56]
Jun Zhuang and Mohammad Al Hasan. 2022b. Deperturbation of Online Social Networks via Bayesian Label Transition. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). SIAM, 603--611.
[57]
Jun Zhuang and Mohammad Al Hasan. 2022c. Robust Node Classification on Graphs: Jointly from Bayesian Label Transition and Topology-based Label Propagation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2795--2805.
[58]
Jun Zhuang and Mohammad Al Hasan. 2022. How Does Bayesian Noisy Self-Supervision Defend Graph Convolutional Networks? Neural Processing Letters (2022), 1--22.
[59]
Jun Zhuang and Mohammad Al Hasan. 2023. Robust Node Representation Learning via Graph Variational Diffusion Networks. arXiv preprint arXiv:2312.10903 (2023).
[60]
Henry Zou and Cornelia Caragea. 2023. JointMatch: A Unified Approach for Diverse and Collaborative Pseudo-Labeling to Semi-Supervised Text Classification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 7290--7301.
[61]
Henry Zou, Yue Zhou, Weizhi Zhang, and Cornelia Caragea. 2023 b. DeCrisisMB: Debiased Semi-Supervised Learning for Crisis Tweet Classification via Memory Bank. In Findings of the Association for Computational Linguistics: EMNLP 2023. 6104--6115.
[62]
Henry Peng Zou, Yue Zhou, Cornelia Caragea, and Doina Caragea. 2023 a. Crisismatch: Semi-supervised few-shot learning for fine-grained disaster tweet classification. arXiv preprint arXiv:2310.14627 (2023). io

Cited By

View all
  • (2024)Style Transfer: From Stitching to Neural Networks2024 5th International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE)10.1109/ICBASE63199.2024.10762296(526-530)Online publication date: 20-Sep-2024

Index Terms

  1. Robust Data-centric Graph Structure Learning for Text Classification

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '24: Companion Proceedings of the ACM Web Conference 2024
    May 2024
    1928 pages
    ISBN:9798400701726
    DOI:10.1145/3589335
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 May 2024

    Check for updates

    Author Tags

    1. data-centric ai
    2. graph neural networks
    3. text classification

    Qualifiers

    • Research-article

    Conference

    WWW '24
    Sponsor:
    WWW '24: The ACM Web Conference 2024
    May 13 - 17, 2024
    Singapore, Singapore

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)237
    • Downloads (Last 6 weeks)41
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Style Transfer: From Stitching to Neural Networks2024 5th International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE)10.1109/ICBASE63199.2024.10762296(526-530)Online publication date: 20-Sep-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media