A Comprehensive Survey on Text, Audio, and Image Data Augmentation Using Multi-Modal LLMs for Deep Learning Applications
This repo contains all the relevant paper and information used in our study. This will be updated perodically as we revise our manuscript throughout the publication process.
The papers used in this study are organised and the links can be found below:
-
Ahmed, T., Pai, K. S., Devanbu, P., & Barr, E. Automatic semantic augmentation of language model prompts (for code summarization). Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024. paper code - NOCODE
-
Cai, X., Xiao, M., Ning, Z., & Zhou, Y. Resolving the imbalance issue in hierarchical disciplinary topic inference via LLM-based data augmentation. 2023 IEEE International Conference on Data Mining Workshops (ICDMW), 2023. paper code - NOCODE
-
Cloutier, N. A., & Japkowicz, N. Fine-tuned generative LLM oversampling can improve performance over traditional techniques on multiclass imbalanced text classification. 2023 IEEE International Conference on Big Data (BigData), 2023. paper code
-
Santos, V. G., Santos, G. L., Lynn, T., & Benatallah, B. Identifying citizen-related issues from social media using LLM-based data augmentation. International Conference on Advanced Information Systems Engineering, 2024. paper code
-
Hu, L., He, H., Wang, D., Zhao, Z., Shao, Y., & Nie, L. LLM vs small model? Large language model-based text augmentation enhanced personality detection model. Proceedings of the AAAI Conference on Artificial Intelligence, 2024. paper code
-
Hua, J., Cui, X., Li, X., Tang, K., & Zhu, P. Multimodal fake news detection through data augmentation-based contrastive learning. Applied Soft Computing, 2023. paper code
-
Jung, H., Yeen, H., Lee, J., Kim, M., Bang, N., & Koo, M.-W. Enhancing task-oriented dialog system with subjective knowledge: A large language model-based data augmentation framework. Proceedings of The Eleventh Dialog System Technology Challenge, 2023. paper code
-
Lai, J., Yang, X., Luo, W., Zhou, L., Li, L., Wang, Y., & Shi, X. RumorLLM: A rumor large language model-based fake-news-detection data-augmentation approach. Applied Sciences, 2024. paper code
-
Meng, Z., Liu, T., Zhang, H., Feng, K., & Zhao, P. CEAN: Contrastive event aggregation network with LLM-based augmentation for event extraction. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024. paper code
-
Silva, K., Frommholz, I., Can, B., Blain, F., Sarwar, R., & Ugolini, L. Forged-GAN-BERT: Authorship attribution for LLM-generated forged novels. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, 2024. paper code
-
Wan, M., Safavi, T., Jauhar, S. K., Kim, Y., Counts, S., Neville, J., Suri, S., Shah, C., White, R. W., Yang, L., & others. TnT-LLM: Text mining at scale with large language models. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024. paper code
-
Wu, S.-L., Chang, X., Wichern, G., Jung, J.-W., Germain, F., Le Roux, J., & Watanabe, S. Improving audio captioning models with fine-grained audio features, text embedding supervision, and LLM mix-up augmentation. ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024. paper code
-
Zhang, J., Gao, H., Zhang, P., Feng, B., Deng, W., & Hou, Y. LA-UCL: LLM-augmented unsupervised contrastive learning framework for few-shot text classification. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024. paper code
-
Zhang, M., Jiang, G., Liu, S., Chen, J., & Zhang, M. LLM–assisted data augmentation for Chinese dialogue–level dependency parsing. Computational Linguistics, 2024. paper code
-
Zhao, H., Chen, H., Ruggles, T. A., Feng, Y., Singh, D., & Yoon, H.-J. Improving text classification with large language model-based data augmentation. Electronics, 2024. paper code
-
Kang, A., Chen, J. Y., Lee-Youngzie, Z., & Fu, S. Synthetic data generation with LLM for improved depression prediction. arXiv preprint arXiv:2411.17672, 2024. paper code
-
Song, S., Subramanyam, A., Madejski, I., & Grossman, R. L. Lab-RAG: Label boosted retrieval augmented generation for radiology report generation. arXiv preprint arXiv:2411.16523, 2024. paper code
-
Fischer, L., Gao, Y., Lintner, A., & Ebling, S. SwissADT: An audio description translation system for Swiss languages. arXiv preprint arXiv:2411.14967, 2024. paper code
-
Glazkova, A., & Zakharova, O. Evaluating LLM prompts for data augmentation in multi-label classification of ecological texts. arXiv preprint arXiv:2411.14896, 2024. paper code
-
Wen, Z., Guo, D., & Zhang, H. AIDBench: A benchmark for evaluating the authorship identification capability of large language models. arXiv preprint arXiv:2411.13226, 2024. paper code
-
Alyafeai, Z., et al. Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic. arXiv preprint arXiv:2412.04277, 2024. paper code
-
Liu, J., & Nguyen, A. Rephrasing electronic health records for pretraining clinical language models. arXiv preprint arXiv:2411.18940, 2024. paper code
-
Abane, A., Bekri, A., & Battou, A. FastRAG: Retrieval augmented generation for semi-structured data. arXiv preprint arXiv:2411.13773, 2024. paper code
-
Yang, M., Shi, B., Le, M., et al. AudioBox TTA-RAG: Improving zero-shot and few-shot text-to-audio with retrieval-augmented generation. arXiv preprint arXiv:2411.05141, 2024. paper code
-
Fuad, K. A. A., & Chen, L. LLM-Ref: Enhancing reference handling in technical writing with large language models. arXiv preprint arXiv:2411.00294, 2024. paper code
-
Wang, Z., Xu, G., & Ren, M. LLM-generated natural language meets scaling laws: New explorations and data augmentation methods. arXiv preprint arXiv:2407.00322, 2024. paper code
-
Dai, H., Liu, Z., & Wu, Z. AugGPT: Leveraging ChatGPT for text data augmentation. arXiv preprint arXiv:2302.13007, 2023. paper code
-
Cegin, J., Simko, J., & Brusilovsky, P. LLMs vs established text augmentation techniques for classification: When do the benefits outweigh the costs? arXiv preprint arXiv:2408.16502, 2024. paper code
-
Lee, N., Wattanawong, T., Kim, S., et al. LLM2LLM: Boosting LLMs with novel iterative data enhancement. arXiv preprint arXiv:2403.15042, 2024. paper code
-
Song, Y., Zhang, J., Tian, Z., et al. LLM-based privacy data augmentation guided by knowledge distillation with a distribution tutor for medical text classification. arXiv preprint arXiv:2402.16515, 2024. paper code
-
Yang, H., Zhao, X., Huang, S., et al. LATEX-GCL: Large language models (LLMs)-based data augmentation for text-attributed graph contrastive learning. arXiv preprint arXiv:2409.01145, 2024. paper code
-
Liu, Y., Zhu, Y., Gu, Z., et al. Improving topic relevance model by mix-structured summarization and LLM-based data augmentation. arXiv preprint arXiv:2404.02616, 2024. paper code
-
Cegin, J., Pecher, B., Simko, J., et al. Use random selection for now: Investigation of few-shot selection strategies in LLM-based text augmentation for classification. arXiv preprint arXiv:2410.10756, 2024. paper code
-
Jia, K., Wu, Y., & Li, R. Curriculum-style data augmentation for LLM-based metaphor detection. arXiv preprint arXiv:2412.02956, 2024. paper code
-
Jung, K., Seo, Y., Cho, S., et al. DALDA: Data augmentation leveraging diffusion model and LLM with adaptive guidance scaling. arXiv preprint arXiv:2409.16949, 2024. paper code
-
Cegin, J., Pecher, B., Simko, J., et al. Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation. arXiv preprint arXiv:2401.06643, 2024. paper code
-
Zeng, L. Leveraging large language models for code-mixed data augmentation in sentiment analysis. arXiv preprint arXiv:2411.00691, 2024. paper code
-
Litake, O., Yagnik, N., & Labhsetwar, S. Inditext boost: Text augmentation for low resource Indian languages. arXiv preprint arXiv:2401.13085, 2024. paper code
-
Sahu, G., Vechtomova, O., Bahdanau, D., & Laradji, I. H. PromptMix: A class boundary augmentation method for large language model distillation. arXiv preprint arXiv:2310.14192, 2023. paper code
-
Chowdhury, A. G., & Chadha, A. Generative data augmentation using LLMs improves distributional robustness in question answering. arXiv preprint arXiv:2309.06358, 2023. paper code
-
Wang, L., Yu, L., Zhang, Y., & Xie, H. Large language model-based augmentation for imbalanced node classification on text-attributed graphs. arXiv preprint arXiv:2410.16882, 2024. paper code
-
Sapkota, R., Meng, Z., & Karkee, M. Synthetic meets authentic: Leveraging LLM generated datasets for YOLO11 and YOLOv10-based apple detection through machine vision sensors. Smart Agricultural Technology, 2024. paper code
-
Yuan, J., Tang, R., Jiang, X., & Hu, X. Large language models for healthcare data augmentation: An example on patient-trial matching. AMIA Annual Symposium Proceedings, 2023. paper code
-
Li, H., Chen, B., Chen, J., et al. ITIMCA: Image-text information and cross-attention for multi-modal cassava leaf disease classification based on a novel multi-modal dataset in natural environments. Crop Protection, 2024. paper code
-
Liu, Y., Zhu, Y., Gu, Z., et al. Enhanced dual contrast representation learning with cell separation and merging for breast cancer diagnosis. Computer Vision and Image Understanding, 2024. paper code
-
Kirilenko, D., Andreychuk, A., Panov, A. I., & Yakovlev, K. Generative models for grid-based and image-based pathfinding. Artificial Intelligence, 2024. paper code
-
Jindal, N., Kumaresan, P. K., Ponnusamy, R., et al. MISTRA: Misogyny detection through text–image fusion and representation analysis. Natural Language Processing Journal, 2024. paper code
-
Li, J., Guan, Z., Wang, J., et al. Integrated image-based deep learning and language models for primary diabetes care. Nature Medicine, 2024. paper code
-
Liu, F., Zhu, T., Wu, X., et al. A medical multimodal large language model for future pandemics. NPJ Digital Medicine, 2023. paper code
-
Cortacero, K., McKenzie, B., Müller, S., et al. Evolutionary design of explainable algorithms for biomedical image segmentation. Nature Communications, 2023. paper code
-
Raminedi, S., Shridevi, S., & Won, D. Multi-modal transformer architecture for medical image analysis and automated report generation. Scientific Reports, 2024. paper code
-
Wang, Y., Shi, X., & Zhao, X. MLLM4Rec: Multimodal information enhancing LLM for sequential recommendation. Journal of Intelligent Information Systems, 2024. paper code
-
Bet, M., Mălan, A., Aldinucci, M., et al. DALLMi: Domain adaption for LLM-based multi-label classifier. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2024. paper code
-
Sheik, R., Sundara, K. P., & Nirmala, S. J. Neural data augmentation for legal overruling task: Small deep learning models vs. large language models. Neural Processing Letters, 2024. paper code
-
Wu, W., Qiu, X., Song, S., et al. Image augmentation agent for weakly supervised semantic segmentation. arXiv preprint arXiv:2412.20439, 2024. paper code
-
Qian, R., Yin, X., & Dou, D. Reasoning to attend: Try to understand how ¡SEG¿ token works. arXiv preprint arXiv:2412.17741, 2024. paper code
-
Yin, S., Fu, C., Zhao, S., et al. T2Vid: Translating long text into multi-image is the catalyst for video-LLMs. arXiv preprint arXiv:2411.19951, 2024. paper code
-
Song, S., Subramanyam, A., Madejski, I., & Grossman, R. L. Lab-RAG: Label boosted retrieval augmented generation for radiology report generation. arXiv preprint arXiv:2411.16523, 2024. paper code
-
Lingenberg, T., Reuter, M., Sudhakaran, G., et al. DIAGen: Diverse image augmentation with generative models. arXiv preprint arXiv:2408.14584, 2024. paper code
-
Li, J., Zhang, F., Zhu, J., et al. ForgeryGPT: Multimodal large language model for explainable image forgery detection and localization. arXiv preprint arXiv:2410.10238, 2024. paper code
-
Sultan, O., Khasin, A., Shiran, G., et al. Visual editing with LLM-based tool chaining: An efficient distillation approach for real-time applications. arXiv preprint arXiv:2410.02952, 2024. paper code
-
Jin, J., Wang, X., Zhu, Q., et al. Pedestrian attribute recognition: A new benchmark dataset and a large language model augmented framework. arXiv preprint arXiv:2408.09720, 2024. paper code
-
Hsieh, C., Moreira, C., Nobre, I. B., et al. DALL-M: Context-aware clinical data augmentation with LLMs. arXiv preprint arXiv:2407.08227, 2024. paper code
-
Liu, J., Huang, X., Zheng, J., et al. MM-Instruct: Generated visual instructions for large multimodal model alignment. arXiv preprint arXiv:2406.19736, 2024. paper code
-
Wu, S.-L., Chang, X., Wichern, G., et al. Improving audio captioning models with fine-grained audio features, text embedding supervision, and LLM mix-up augmentation. ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024. paper code
-
Xu, D. AudioSetMix: Enhancing audio-language datasets with LLM-assisted augmentations. arXiv preprint arXiv:2405.11093, 2024. paper code
-
Dhingra, P., Agrawal, S., Veerappan, C. S., et al. Speech de-identification data augmentation leveraging large language model. IEEE International Conference on Asian Language Processing (IALP), 2024. paper code
-
Cai, Z., Ghosh, S., Adatia, A. P., et al. AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset. ACM International Conference on Multimedia, 2024. paper code
-
Ma, Z., Wu, W., Zheng, Z., et al. Leveraging speech PTM, text LLM, and emotional TTS for speech emotion recognition. ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024. paper code
-
Dhingra, P., Agrawal, S., Veerappan, C. S., et al. Speech de-identification data augmentation leveraging large language model. ICAICTA 2024 11th International Conference on Advanced Informatics: Concept, Theory and Application, 2024. paper code
-
Heakl, A., Zaghloul, Y., Ali, M., et al. ArzEn-LLM: Code-switched Egyptian Arabic-English translation and speech recognition using LLMs. Procedia Computer Science, 2024. paper code
-
Hashmi, E., Yayilgan, S. Y., Yamin, M. M., et al. Self-supervised hate speech detection in Norwegian texts with lexical and semantic augmentations. Expert Systems with Applications, 2024. paper code
-
Xu, F., Zhou, T., Nguyen, T., et al. Integrating augmented reality and LLM for enhanced cognitive support in critical audio communications. International Journal of Human-Computer Studies, 2024. paper code
-
Cook, A., & Karakuş, O. LLM-Commentator: Novel fine-tuning strategies of large language models for automatic commentary generation using football event data. Knowledge-Based Systems, 2024. paper code
-
Gkournelos, C., Konstantinou, C., & Makris, S. An LLM-based approach for enabling seamless human-robot collaboration in assembly. CIRP Annals, 2024. paper code
-
Alier, M., Pereira, J., García-Peñalvo, F. J., et al. LAMB: An open-source software framework to create AI assistants deployed and integrated into LMS. Computer Standards & Interfaces, 2025. paper code
-
Senthilselvi, A., Prawin, R., et al. Abstractive summarization of YouTube videos using Lamini-Flan-T5 LLM. ICAIT 2024 Second International Conference on Advances in Information Technology, 2024. paper code
-
Wang, M., Shafran, I., Soltau, H., et al. Retrieval augmented end-to-end spoken dialog models. ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024. paper code
-
Qiu, P., Wu, C., Zhang, X., et al. Towards building multilingual language model for medicine. Nature Communications, 2024. paper code
-
Hasebe, K., Fujimura, S., Kojima, T., et al. The effect of noise on deep learning for classification of pathological voice. The Laryngoscope, 2024. paper code
-
Xu, D. AudioSetMix: Enhancing audio-language datasets with LLM-assisted augmentations. arXiv preprint arXiv:2405.11093, 2024. paper code
-
Ghosh, S., Kumar, S., Kong, Z., et al. Synthio: Augmenting small-scale audio classification datasets with synthetic data. arXiv preprint arXiv:2410.02056, 2024. paper code
-
Whitehouse, C., Choudhury, M., & Aji, A. F. LLM-powered data augmentation for enhanced cross-lingual performance. arXiv preprint arXiv:2305.14288, 2023. paper code
-
Ghosal, D., Majumder, N., Mehrish, A., & Poria, S. Text-to-audio generation using instruction-tuned LLM and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023. paper code
-
Goel, A., Kong, Z., Valle, R., & Catanzaro, B. Audio dialogues: Dialogues dataset for audio and music understanding. arXiv preprint arXiv:2404.07616, 2024. paper code
-
Yang, D., Tian, J., Tan, X., et al. UniAudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023. paper code
-
Manco, I., Salamon, J., & Nieto, O. Augment, Drop & Swap: Improving diversity in LLM captions for efficient music-text representation learning. arXiv preprint arXiv:2409.11498, 2024. paper code
-
Li, B., Xie, Z., Xu, X., et al. DiveSound: LLM-assisted automatic taxonomy construction for diverse audio generation. arXiv preprint arXiv:2407.13198, 2024. paper code
-
Wang, Z., Tai, Y.-W., & Tang, C.-K. Audio-Agent: Leveraging LLMs for audio generation, editing, and composition. arXiv preprint arXiv:2410.03335, 2024. paper code
-
Shu, F., Zhang, L., Jiang, H., & Xie, C. Audio-visual LLM for video understanding. arXiv preprint arXiv:2312.06720, 2023. paper code
-
Lei, Z., Na, X., Xu, M., et al. Contextualization of ASR with LLM using phonetic retrieval-based augmentation. arXiv preprint arXiv:2409.15353, 2024. paper code
-
Huang, J., Ren, Y., Huang, R., et al. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023. paper code
-
Ok, H., Yoo, S., & Lee, J. AudioBERT: Audio knowledge augmented language model. arXiv preprint arXiv:2409.08199, 2024. paper code
-
Lu, Y., Xie, Y., Fu, R., et al. Codecfake: An initial dataset for detecting LLM-based deepfake audio. arXiv preprint arXiv:2406.08112, 2024. paper code
-
Das, N., Dingliwal, S., Ronanki, S., et al. SpeechVerse: A large-scale generalizable audio language model. arXiv preprint arXiv:2405.08295, 2024. paper code
-
Sridhar, A. K., Guo, Y., & Visser, E. Enhancing temporal understanding in audio question answering for large audio language models. arXiv preprint arXiv:2409.06223, 2024. paper code
-
Vallaeys, T., Shukor, M., Cord, M., & Verbeek, J. Improved baselines for data-efficient perceptual augmentation of LLMs. arXiv preprint arXiv:2403.13499, 2024. paper code
If you found our work useful fo your research or work, please consider citing it:
Sapkota, R., Raza, S., Shoman, M., Paudel, A. and Karkee, M., 2025. Image, Text, and Speech Data Augmentation using Multimodal LLMs for Deep Learning: A Survey. arXiv preprint arXiv:2501.18648.