A Comprehensive Survey on Text, Audio, and Image Data Augmentation Using Multi-Modal LLMs for Deep Learning Applications

Data-Aug-Multi-Modal-LLM

A Comprehensive Survey on Text, Audio, and Image Data Augmentation Using Multi-Modal LLMs for Deep Learning Applications

This repo contains all the relevant paper and information used in our study. This will be updated perodically as we revise our manuscript throughout the publication process.

The papers used in this study are organised and the links can be found below:

Text Data Augmentation

Peer Reviewed Paper

Ahmed, T., Pai, K. S., Devanbu, P., & Barr, E. Automatic semantic augmentation of language model prompts (for code summarization). Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024. paper code - NOCODE
Cai, X., Xiao, M., Ning, Z., & Zhou, Y. Resolving the imbalance issue in hierarchical disciplinary topic inference via LLM-based data augmentation. 2023 IEEE International Conference on Data Mining Workshops (ICDMW), 2023. paper code - NOCODE
Cloutier, N. A., & Japkowicz, N. Fine-tuned generative LLM oversampling can improve performance over traditional techniques on multiclass imbalanced text classification. 2023 IEEE International Conference on Big Data (BigData), 2023. paper code
Santos, V. G., Santos, G. L., Lynn, T., & Benatallah, B. Identifying citizen-related issues from social media using LLM-based data augmentation. International Conference on Advanced Information Systems Engineering, 2024. paper code
Hu, L., He, H., Wang, D., Zhao, Z., Shao, Y., & Nie, L. LLM vs small model? Large language model-based text augmentation enhanced personality detection model. Proceedings of the AAAI Conference on Artificial Intelligence, 2024. paper code
Hua, J., Cui, X., Li, X., Tang, K., & Zhu, P. Multimodal fake news detection through data augmentation-based contrastive learning. Applied Soft Computing, 2023. paper code
Jung, H., Yeen, H., Lee, J., Kim, M., Bang, N., & Koo, M.-W. Enhancing task-oriented dialog system with subjective knowledge: A large language model-based data augmentation framework. Proceedings of The Eleventh Dialog System Technology Challenge, 2023. paper code
Lai, J., Yang, X., Luo, W., Zhou, L., Li, L., Wang, Y., & Shi, X. RumorLLM: A rumor large language model-based fake-news-detection data-augmentation approach. Applied Sciences, 2024. paper code
Meng, Z., Liu, T., Zhang, H., Feng, K., & Zhao, P. CEAN: Contrastive event aggregation network with LLM-based augmentation for event extraction. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024. paper code
Silva, K., Frommholz, I., Can, B., Blain, F., Sarwar, R., & Ugolini, L. Forged-GAN-BERT: Authorship attribution for LLM-generated forged novels. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, 2024. paper code
Wan, M., Safavi, T., Jauhar, S. K., Kim, Y., Counts, S., Neville, J., Suri, S., Shah, C., White, R. W., Yang, L., & others. TnT-LLM: Text mining at scale with large language models. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024. paper code
Wu, S.-L., Chang, X., Wichern, G., Jung, J.-W., Germain, F., Le Roux, J., & Watanabe, S. Improving audio captioning models with fine-grained audio features, text embedding supervision, and LLM mix-up augmentation. ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024. paper code
Zhang, J., Gao, H., Zhang, P., Feng, B., Deng, W., & Hou, Y. LA-UCL: LLM-augmented unsupervised contrastive learning framework for few-shot text classification. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024. paper code
Zhang, M., Jiang, G., Liu, S., Chen, J., & Zhang, M. LLM–assisted data augmentation for Chinese dialogue–level dependency parsing. Computational Linguistics, 2024. paper code
Zhao, H., Chen, H., Ruggles, T. A., Feng, Y., Singh, D., & Yoon, H.-J. Improving text classification with large language model-based data augmentation. Electronics, 2024. paper code

Preprints

Kang, A., Chen, J. Y., Lee-Youngzie, Z., & Fu, S. Synthetic data generation with LLM for improved depression prediction. arXiv preprint arXiv:2411.17672, 2024. paper code
Song, S., Subramanyam, A., Madejski, I., & Grossman, R. L. Lab-RAG: Label boosted retrieval augmented generation for radiology report generation. arXiv preprint arXiv:2411.16523, 2024. paper code
Fischer, L., Gao, Y., Lintner, A., & Ebling, S. SwissADT: An audio description translation system for Swiss languages. arXiv preprint arXiv:2411.14967, 2024. paper code
Glazkova, A., & Zakharova, O. Evaluating LLM prompts for data augmentation in multi-label classification of ecological texts. arXiv preprint arXiv:2411.14896, 2024. paper code
Wen, Z., Guo, D., & Zhang, H. AIDBench: A benchmark for evaluating the authorship identification capability of large language models. arXiv preprint arXiv:2411.13226, 2024. paper code
Alyafeai, Z., et al. Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic. arXiv preprint arXiv:2412.04277, 2024. paper code
Liu, J., & Nguyen, A. Rephrasing electronic health records for pretraining clinical language models. arXiv preprint arXiv:2411.18940, 2024. paper code
Abane, A., Bekri, A., & Battou, A. FastRAG: Retrieval augmented generation for semi-structured data. arXiv preprint arXiv:2411.13773, 2024. paper code
Yang, M., Shi, B., Le, M., et al. AudioBox TTA-RAG: Improving zero-shot and few-shot text-to-audio with retrieval-augmented generation. arXiv preprint arXiv:2411.05141, 2024. paper code
Fuad, K. A. A., & Chen, L. LLM-Ref: Enhancing reference handling in technical writing with large language models. arXiv preprint arXiv:2411.00294, 2024. paper code
Wang, Z., Xu, G., & Ren, M. LLM-generated natural language meets scaling laws: New explorations and data augmentation methods. arXiv preprint arXiv:2407.00322, 2024. paper code
Dai, H., Liu, Z., & Wu, Z. AugGPT: Leveraging ChatGPT for text data augmentation. arXiv preprint arXiv:2302.13007, 2023. paper code
Cegin, J., Simko, J., & Brusilovsky, P. LLMs vs established text augmentation techniques for classification: When do the benefits outweigh the costs? arXiv preprint arXiv:2408.16502, 2024. paper code
Lee, N., Wattanawong, T., Kim, S., et al. LLM2LLM: Boosting LLMs with novel iterative data enhancement. arXiv preprint arXiv:2403.15042, 2024. paper code
Song, Y., Zhang, J., Tian, Z., et al. LLM-based privacy data augmentation guided by knowledge distillation with a distribution tutor for medical text classification. arXiv preprint arXiv:2402.16515, 2024. paper code
Yang, H., Zhao, X., Huang, S., et al. LATEX-GCL: Large language models (LLMs)-based data augmentation for text-attributed graph contrastive learning. arXiv preprint arXiv:2409.01145, 2024. paper code
Liu, Y., Zhu, Y., Gu, Z., et al. Improving topic relevance model by mix-structured summarization and LLM-based data augmentation. arXiv preprint arXiv:2404.02616, 2024. paper code
Cegin, J., Pecher, B., Simko, J., et al. Use random selection for now: Investigation of few-shot selection strategies in LLM-based text augmentation for classification. arXiv preprint arXiv:2410.10756, 2024. paper code
Jia, K., Wu, Y., & Li, R. Curriculum-style data augmentation for LLM-based metaphor detection. arXiv preprint arXiv:2412.02956, 2024. paper code
Jung, K., Seo, Y., Cho, S., et al. DALDA: Data augmentation leveraging diffusion model and LLM with adaptive guidance scaling. arXiv preprint arXiv:2409.16949, 2024. paper code
Cegin, J., Pecher, B., Simko, J., et al. Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation. arXiv preprint arXiv:2401.06643, 2024. paper code
Zeng, L. Leveraging large language models for code-mixed data augmentation in sentiment analysis. arXiv preprint arXiv:2411.00691, 2024. paper code
Litake, O., Yagnik, N., & Labhsetwar, S. Inditext boost: Text augmentation for low resource Indian languages. arXiv preprint arXiv:2401.13085, 2024. paper code
Sahu, G., Vechtomova, O., Bahdanau, D., & Laradji, I. H. PromptMix: A class boundary augmentation method for large language model distillation. arXiv preprint arXiv:2310.14192, 2023. paper code
Chowdhury, A. G., & Chadha, A. Generative data augmentation using LLMs improves distributional robustness in question answering. arXiv preprint arXiv:2309.06358, 2023. paper code
Wang, L., Yu, L., Zhang, Y., & Xie, H. Large language model-based augmentation for imbalanced node classification on text-attributed graphs. arXiv preprint arXiv:2410.16882, 2024. paper code

Image Data Augmentation

Peer Reviewed Paper

Sapkota, R., Meng, Z., & Karkee, M. Synthetic meets authentic: Leveraging LLM generated datasets for YOLO11 and YOLOv10-based apple detection through machine vision sensors. Smart Agricultural Technology, 2024. paper code
Yuan, J., Tang, R., Jiang, X., & Hu, X. Large language models for healthcare data augmentation: An example on patient-trial matching. AMIA Annual Symposium Proceedings, 2023. paper code
Li, H., Chen, B., Chen, J., et al. ITIMCA: Image-text information and cross-attention for multi-modal cassava leaf disease classification based on a novel multi-modal dataset in natural environments. Crop Protection, 2024. paper code
Liu, Y., Zhu, Y., Gu, Z., et al. Enhanced dual contrast representation learning with cell separation and merging for breast cancer diagnosis. Computer Vision and Image Understanding, 2024. paper code
Kirilenko, D., Andreychuk, A., Panov, A. I., & Yakovlev, K. Generative models for grid-based and image-based pathfinding. Artificial Intelligence, 2024. paper code
Jindal, N., Kumaresan, P. K., Ponnusamy, R., et al. MISTRA: Misogyny detection through text–image fusion and representation analysis. Natural Language Processing Journal, 2024. paper code
Li, J., Guan, Z., Wang, J., et al. Integrated image-based deep learning and language models for primary diabetes care. Nature Medicine, 2024. paper code
Liu, F., Zhu, T., Wu, X., et al. A medical multimodal large language model for future pandemics. NPJ Digital Medicine, 2023. paper code
Cortacero, K., McKenzie, B., Müller, S., et al. Evolutionary design of explainable algorithms for biomedical image segmentation. Nature Communications, 2023. paper code
Raminedi, S., Shridevi, S., & Won, D. Multi-modal transformer architecture for medical image analysis and automated report generation. Scientific Reports, 2024. paper code
Wang, Y., Shi, X., & Zhao, X. MLLM4Rec: Multimodal information enhancing LLM for sequential recommendation. Journal of Intelligent Information Systems, 2024. paper code
Bet, M., Mălan, A., Aldinucci, M., et al. DALLMi: Domain adaption for LLM-based multi-label classifier. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2024. paper code
Sheik, R., Sundara, K. P., & Nirmala, S. J. Neural data augmentation for legal overruling task: Small deep learning models vs. large language models. Neural Processing Letters, 2024. paper code

Prepreints

Wu, W., Qiu, X., Song, S., et al. Image augmentation agent for weakly supervised semantic segmentation. arXiv preprint arXiv:2412.20439, 2024. paper code
Qian, R., Yin, X., & Dou, D. Reasoning to attend: Try to understand how ¡SEG¿ token works. arXiv preprint arXiv:2412.17741, 2024. paper code
Yin, S., Fu, C., Zhao, S., et al. T2Vid: Translating long text into multi-image is the catalyst for video-LLMs. arXiv preprint arXiv:2411.19951, 2024. paper code
Song, S., Subramanyam, A., Madejski, I., & Grossman, R. L. Lab-RAG: Label boosted retrieval augmented generation for radiology report generation. arXiv preprint arXiv:2411.16523, 2024. paper code
Lingenberg, T., Reuter, M., Sudhakaran, G., et al. DIAGen: Diverse image augmentation with generative models. arXiv preprint arXiv:2408.14584, 2024. paper code
Li, J., Zhang, F., Zhu, J., et al. ForgeryGPT: Multimodal large language model for explainable image forgery detection and localization. arXiv preprint arXiv:2410.10238, 2024. paper code
Sultan, O., Khasin, A., Shiran, G., et al. Visual editing with LLM-based tool chaining: An efficient distillation approach for real-time applications. arXiv preprint arXiv:2410.02952, 2024. paper code
Jin, J., Wang, X., Zhu, Q., et al. Pedestrian attribute recognition: A new benchmark dataset and a large language model augmented framework. arXiv preprint arXiv:2408.09720, 2024. paper code
Hsieh, C., Moreira, C., Nobre, I. B., et al. DALL-M: Context-aware clinical data augmentation with LLMs. arXiv preprint arXiv:2407.08227, 2024. paper code
Liu, J., Huang, X., Zheng, J., et al. MM-Instruct: Generated visual instructions for large multimodal model alignment. arXiv preprint arXiv:2406.19736, 2024. paper code

AUDIO/VOICE DATA AUGMENTATION

Peer Reviewed Paper

Wu, S.-L., Chang, X., Wichern, G., et al. Improving audio captioning models with fine-grained audio features, text embedding supervision, and LLM mix-up augmentation. ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024. paper code
Xu, D. AudioSetMix: Enhancing audio-language datasets with LLM-assisted augmentations. arXiv preprint arXiv:2405.11093, 2024. paper code
Dhingra, P., Agrawal, S., Veerappan, C. S., et al. Speech de-identification data augmentation leveraging large language model. IEEE International Conference on Asian Language Processing (IALP), 2024. paper code
Cai, Z., Ghosh, S., Adatia, A. P., et al. AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset. ACM International Conference on Multimedia, 2024. paper code
Ma, Z., Wu, W., Zheng, Z., et al. Leveraging speech PTM, text LLM, and emotional TTS for speech emotion recognition. ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024. paper code
Dhingra, P., Agrawal, S., Veerappan, C. S., et al. Speech de-identification data augmentation leveraging large language model. ICAICTA 2024 11th International Conference on Advanced Informatics: Concept, Theory and Application, 2024. paper code
Heakl, A., Zaghloul, Y., Ali, M., et al. ArzEn-LLM: Code-switched Egyptian Arabic-English translation and speech recognition using LLMs. Procedia Computer Science, 2024. paper code
Hashmi, E., Yayilgan, S. Y., Yamin, M. M., et al. Self-supervised hate speech detection in Norwegian texts with lexical and semantic augmentations. Expert Systems with Applications, 2024. paper code
Xu, F., Zhou, T., Nguyen, T., et al. Integrating augmented reality and LLM for enhanced cognitive support in critical audio communications. International Journal of Human-Computer Studies, 2024. paper code
Cook, A., & Karakuş, O. LLM-Commentator: Novel fine-tuning strategies of large language models for automatic commentary generation using football event data. Knowledge-Based Systems, 2024. paper code
Gkournelos, C., Konstantinou, C., & Makris, S. An LLM-based approach for enabling seamless human-robot collaboration in assembly. CIRP Annals, 2024. paper code
Alier, M., Pereira, J., García-Peñalvo, F. J., et al. LAMB: An open-source software framework to create AI assistants deployed and integrated into LMS. Computer Standards & Interfaces, 2025. paper code
Senthilselvi, A., Prawin, R., et al. Abstractive summarization of YouTube videos using Lamini-Flan-T5 LLM. ICAIT 2024 Second International Conference on Advances in Information Technology, 2024. paper code
Wang, M., Shafran, I., Soltau, H., et al. Retrieval augmented end-to-end spoken dialog models. ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024. paper code
Qiu, P., Wu, C., Zhang, X., et al. Towards building multilingual language model for medicine. Nature Communications, 2024. paper code
Hasebe, K., Fujimura, S., Kojima, T., et al. The effect of noise on deep learning for classification of pathological voice. The Laryngoscope, 2024. paper code

Prepreints

Xu, D. AudioSetMix: Enhancing audio-language datasets with LLM-assisted augmentations. arXiv preprint arXiv:2405.11093, 2024. paper code
Ghosh, S., Kumar, S., Kong, Z., et al. Synthio: Augmenting small-scale audio classification datasets with synthetic data. arXiv preprint arXiv:2410.02056, 2024. paper code
Whitehouse, C., Choudhury, M., & Aji, A. F. LLM-powered data augmentation for enhanced cross-lingual performance. arXiv preprint arXiv:2305.14288, 2023. paper code
Ghosal, D., Majumder, N., Mehrish, A., & Poria, S. Text-to-audio generation using instruction-tuned LLM and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023. paper code
Goel, A., Kong, Z., Valle, R., & Catanzaro, B. Audio dialogues: Dialogues dataset for audio and music understanding. arXiv preprint arXiv:2404.07616, 2024. paper code
Yang, D., Tian, J., Tan, X., et al. UniAudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023. paper code
Manco, I., Salamon, J., & Nieto, O. Augment, Drop & Swap: Improving diversity in LLM captions for efficient music-text representation learning. arXiv preprint arXiv:2409.11498, 2024. paper code
Li, B., Xie, Z., Xu, X., et al. DiveSound: LLM-assisted automatic taxonomy construction for diverse audio generation. arXiv preprint arXiv:2407.13198, 2024. paper code
Wang, Z., Tai, Y.-W., & Tang, C.-K. Audio-Agent: Leveraging LLMs for audio generation, editing, and composition. arXiv preprint arXiv:2410.03335, 2024. paper code
Shu, F., Zhang, L., Jiang, H., & Xie, C. Audio-visual LLM for video understanding. arXiv preprint arXiv:2312.06720, 2023. paper code
Lei, Z., Na, X., Xu, M., et al. Contextualization of ASR with LLM using phonetic retrieval-based augmentation. arXiv preprint arXiv:2409.15353, 2024. paper code
Huang, J., Ren, Y., Huang, R., et al. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023. paper code
Ok, H., Yoo, S., & Lee, J. AudioBERT: Audio knowledge augmented language model. arXiv preprint arXiv:2409.08199, 2024. paper code
Lu, Y., Xie, Y., Fu, R., et al. Codecfake: An initial dataset for detecting LLM-based deepfake audio. arXiv preprint arXiv:2406.08112, 2024. paper code
Das, N., Dingliwal, S., Ronanki, S., et al. SpeechVerse: A large-scale generalizable audio language model. arXiv preprint arXiv:2405.08295, 2024. paper code
Sridhar, A. K., Guo, Y., & Visser, E. Enhancing temporal understanding in audio question answering for large audio language models. arXiv preprint arXiv:2409.06223, 2024. paper code
Vallaeys, T., Shukor, M., Cord, M., & Verbeek, J. Improved baselines for data-efficient perceptual augmentation of LLMs. arXiv preprint arXiv:2403.13499, 2024. paper code

Citation

If you found our work useful fo your research or work, please consider citing it:

Sapkota, R., Raza, S., Shoman, M., Paudel, A. and Karkee, M., 2025. Image, Text, and Speech Data Augmentation using Multimodal LLMs for Deep Learning: A Survey. arXiv preprint arXiv:2501.18648.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
pics		pics
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-Aug-Multi-Modal-LLM

A Comprehensive Survey on Text, Audio, and Image Data Augmentation Using Multi-Modal LLMs for Deep Learning Applications

Text Data Augmentation

Peer Reviewed Paper

Preprints

Image Data Augmentation

Peer Reviewed Paper

Prepreints

AUDIO/VOICE DATA AUGMENTATION

Peer Reviewed Paper

Prepreints

Citation

About

Uh oh!

Releases

Packages

WSUAgRobotics/data-aug-multi-modal-llm

Folders and files

Latest commit

History

Repository files navigation

Data-Aug-Multi-Modal-LLM

A Comprehensive Survey on Text, Audio, and Image Data Augmentation Using Multi-Modal LLMs for Deep Learning Applications

Text Data Augmentation

Peer Reviewed Paper

Preprints

Image Data Augmentation

Peer Reviewed Paper

Prepreints

AUDIO/VOICE DATA AUGMENTATION

Peer Reviewed Paper

Prepreints

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages