Vol. 3 No. 2 (2023): Journal of AI-Assisted Scientific Discovery
Articles

Generative AI for Data Augmentation in Machine Learning

Naresh Dulam
Vice President Sr Lead Software Engineer, JP Morgan Chase, USA
Kishore Reddy Gade
Vice President, Lead Software Engineer, JP Morgan Chase, USA
Venkataramana Gosukonda
Senior Software Engineering Manager, Wells Fargo, USA
Cover

Published 06-09-2023

Keywords

  • Generative AI,
  • data augmentation

How to Cite

[1]
Naresh Dulam, Kishore Reddy Gade, and Venkataramana Gosukonda, “Generative AI for Data Augmentation in Machine Learning”, Journal of AI-Assisted Scientific Discovery, vol. 3, no. 2, pp. 665–688, Sep. 2023, Accessed: Dec. 23, 2024. [Online]. Available: https://scienceacadpress.com/index.php/jaasd/article/view/232

Abstract

Generative Artificial Intelligence (AI) has become a powerful tool in machine learning, especially regarding data augmentation. In machine learning, data augmentation is essential for expanding datasets, which can lead to enhanced model performance. This process involves creating new, synthetic data that mirrors the characteristics of the original dataset. As machine learning tasks become increasingly complex, particularly in areas like image recognition, natural language processing, and speech recognition, the demand for diverse & extensive datasets continues to grow. Generative AI offers an innovative approach to this challenge by generating high-quality synthetic data that can be used to supplement real-world datasets. Techniques such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and diffusion models have become central to data augmentation. These methods allow for the generation of data that expands the dataset & introduces variety, helping to create more robust machine learning models. GANs, for example, generate new samples by pitting two neural networks against each other, while VAEs focus on learning a compact representation of the data to create new instances. Diffusion models, conversely, have shown promise in producing highly realistic data through a process that gradually refines noise into a usable sample. Using these generative models in data augmentation has significantly impacted various machine learning tasks, improving model accuracy, generalization, & robustness, especially in areas with limited labelled data. However, the integration of generative AI also brings forward specific challenges. One primary concern is the potential for bias in the generated data, which can unintentionally skew model predictions. Additionally, there are ethical considerations, particularly related to using synthetic data in sensitive applications and the potential for misuse. Despite these challenges, the future of generative AI in data augmentation looks promising, with potential applications extending beyond traditional machine learning tasks. Its ability to create diverse datasets will continue to play a crucial role in advancing the field of machine learning, offering new solutions to data scarcity, bias, and generalization problems.

Downloads

Download data is not yet available.

References

  1. Shao, S., Wang, P., & Yan, R. (2019). Generative adversarial networks for data augmentation in machine fault diagnosis. Computers in Industry, 106, 85-93.
  2. Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2021). Text data augmentation for deep learning. Journal of big Data, 8(1), 101.
  3. Tanaka, F. H. K. D. S., & Aranha, C. (2019). Data augmentation using GANs. arXiv preprint arXiv:1904.09135.
  4. Antoniou, A., Storkey, A., & Edwards, H. (2018). Augmenting image classifiers using data augmentation generative adversarial networks. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III 27 (pp. 594-603). Springer International Publishing.
  5. Hu, W. J., Xie, T. Y., Li, B. S., Du, Y. X., & Xiong, N. N. (2021). An edge intelligence-based generative data augmentation system for IoT image recognition tasks. Journal of Internet Technology, 22(4), 765-778.
  6. Howe, J., Pula, K., & Reite, A. A. (2019, September). Conditional generative adversarial networks for data augmentation and adaptation in remotely sensed imagery. In Applications of Machine Learning (Vol. 11139, pp. 119-131). SPIE.
  7. Motamed, S., Rogalla, P., & Khalvati, F. (2021). Data augmentation using Generative Adversarial Networks (GANs) for GAN-based detection of Pneumonia and COVID-19 in chest X-ray images. Informatics in medicine unlocked, 27, 100779.
  8. Lorencin, I., Baressi Šegota, S., Anđelić, N., Mrzljak, V., Ćabov, T., Španjol, J., & Car, Z. (2021). On urinary bladder cancer diagnosis: Utilization of deep convolutional generative adversarial networks for data augmentation. Biology, 10(3), 175.
  9. Ma, Y., Liu, K., Guan, Z., Xu, X., Qian, X., & Bao, H. (2018). Background augmentation generative adversarial networks (BAGANs): Effective data generation based on GAN-augmented 3D synthesizing. Symmetry, 10(12), 734.
  10. Paul, D., Sivathapandi, P., & Soundarapandiyan, R. (2022). Evaluating the Impact of Synthetic Data on Financial Machine Learning Models: A Comprehensive Study of AI Techniques for Data Augmentation and Model Training. Journal of Artificial Intelligence Research and Applications, 2(2), 303-341.
  11. Lim, G., Thombre, P., Lee, M. L., & Hsu, W. (2020, November). Generative data augmentation for diabetic retinopathy classification. In 2020 IEEE 32nd international conference on tools with artificial intelligence (ICTAI) (pp. 1096-1103). IEEE.
  12. Liu, R., Xu, G., Jia, C., Ma, W., Wang, L., & Vosoughi, S. (2020). Data boost: Text data augmentation through reinforcement learning guided conditional generation. arXiv preprint arXiv:2012.02952.
  13. Tran, N. T., Tran, V. H., Nguyen, N. B., Nguyen, T. K., & Cheung, N. M. (2021). On data augmentation for GAN training. IEEE Transactions on Image Processing, 30, 1882-1897.
  14. Ma, L., Ding, Y., Wang, Z., Wang, C., Ma, J., & Lu, C. (2021). An interpretable data augmentation scheme for machine fault diagnosis based on a sparsity-constrained generative adversarial network. Expert Systems with Applications, 182, 115234.
  15. Gao, Y., Kong, B., & Mosalam, K. M. (2019). Deep leaf‐bootstrapping generative adversarial network for structural image data augmentation. Computer‐Aided Civil and Infrastructure Engineering, 34(9), 755-773.
  16. Thumburu, S. K. R. (2022). A Framework for Seamless EDI Migrations to the Cloud: Best Practices and Challenges. Innovative Engineering Sciences Journal, 2(1).
  17. Thumburu, S. K. R. (2022). AI-Powered EDI Migration Tools: A Review. Innovative Computer Sciences Journal, 8(1).
  18. Gade, K. R. (2022). Data Analytics: Data Fabric Architecture and Its Benefits for Data Management. MZ Computing Journal, 3(2).
  19. Gade, K. R. (2022). Data Modeling for the Modern Enterprise: Navigating Complexity and Uncertainty. Innovative Engineering Sciences Journal, 2(1).
  20. Katari, A., & Vangala, R. Data Privacy and Compliance in Cloud Data Management for Fintech.
  21. Katari, A., Ankam, M., & Shankar, R. Data Versioning and Time Travel In Delta Lake for Financial Services: Use Cases and Implementation.
  22. Komandla, V. Enhancing Product Development through Continuous Feedback Integration “Vineela Komandla”.
  23. Komandla, V. Enhancing Security and Growth: Evaluating Password Vault Solutions for Fintech Companies.
  24. Thumburu, S. K. R. (2021). The Future of EDI Standards in an API-Driven World. MZ Computing Journal, 2(2).
  25. Thumburu, S. K. R. (2021). Integrating Blockchain Technology into EDI for Enhanced Data Security and Transparency. MZ Computing Journal, 2(1).
  26. Gade, K. R. (2021). Cloud Migration: Challenges and Best Practices for Migrating Legacy Systems to the Cloud. Innovative Engineering Sciences Journal, 1(1).