Vol. 3 No. 2 (2023): Journal of AI-Assisted Scientific Discovery
Articles

Autonomous Decision-Making and Self-Healing Infrastructure Management Using AI Agent Ecosystems in PaaS

Muthuraman Saminathan
Muthuraman Saminathan, Compunnel Software Group, USA
Akhil Reddy Bairi
Akhil Reddy Bairi, BetterCloud, USA
Cover

Published 10-07-2023

Keywords

  • reinforcement learning,
  • large language models,
  • autonomous decision-making

How to Cite

[1]
Muthuraman Saminathan and Akhil Reddy Bairi, “Autonomous Decision-Making and Self-Healing Infrastructure Management Using AI Agent Ecosystems in PaaS”, Journal of AI-Assisted Scientific Discovery, vol. 3, no. 2, pp. 740–782, Jul. 2023, Accessed: Jan. 16, 2025. [Online]. Available: https://scienceacadpress.com/index.php/jaasd/article/view/281

Abstract

The rise of Platform-as-a-Service (PaaS) architectures has brought significant advancements in application development and deployment by abstracting the complexities of infrastructure management. However, traditional approaches to resource scaling, fault detection, and system recovery in PaaS often require human intervention, resulting in inefficiencies and limited resilience. This paper explores the integration of reinforcement learning (RL) and large language model (LLM)-based decision-making agents to enable autonomous decision-making and self-healing infrastructure management within PaaS environments. By leveraging AI agent ecosystems, this research aims to address the challenges associated with resource optimization, real-time fault mitigation, and operational continuity in dynamic and high-availability cloud-native systems.

The proposed framework combines the adaptive learning capabilities of reinforcement learning with the contextual reasoning and decision-making strengths of LLMs. RL algorithms are employed to learn optimal resource allocation policies in dynamic workloads, while LLMs enhance decision-making processes by analyzing unstructured data, such as system logs and error messages, to infer actionable insights. The hybrid architecture fosters a symbiotic relationship between the two AI paradigms, enabling a cohesive ecosystem of agents capable of autonomously scaling resources, identifying and resolving faults, and preemptively mitigating system risks in PaaS environments.

To demonstrate the practical feasibility of the approach, the study focuses on Kubernetes—a widely adopted container orchestration platform—as a case study. The proposed system is implemented using multi-agent frameworks where AI agents collaborate to monitor cluster states, predict resource demands, and execute self-healing actions through Kubernetes APIs. Advanced RL techniques, such as Proximal Policy Optimization (PPO) and Distributed Q-Learning, are evaluated for their efficiency in managing resource elasticity and fault tolerance. Concurrently, transformer-based LLMs fine-tuned for infrastructure management tasks are employed to interpret system logs and recommend corrective actions with minimal latency.

Performance evaluations conducted in simulated and real-world PaaS environments highlight the system’s capability to reduce mean time to recovery (MTTR), minimize resource under-utilization, and maintain service-level agreements (SLAs) under varying load conditions. Comparative analysis against traditional rule-based systems and standalone AI solutions reveals the superiority of the proposed hybrid AI agent ecosystem in achieving higher reliability, scalability, and cost-efficiency.

The paper also discusses implementation challenges, including model convergence issues, computational overheads, and security implications. Potential solutions, such as federated training for decentralized environments and lightweight model architectures for edge deployments, are proposed to address these limitations. Moreover, the broader applicability of the framework to other cloud platforms, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), is explored to demonstrate its versatility.

By bridging the gap between reactive and proactive infrastructure management, this research underscores the transformative potential of combining reinforcement learning and LLM-based decision-making in PaaS ecosystems. The findings contribute to the advancement of autonomous cloud-native infrastructure and offer actionable insights for researchers and practitioners aiming to enhance the reliability and efficiency of next-generation cloud systems.

Downloads

Download data is not yet available.

References

  1. J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, "High-dimensional continuous control using generalized advantage estimation," in Proc. of the 30th International Conference on Machine Learning (ICML), 2016, pp. 1509–1517.
  2. V. Mnih, H. V. Hasselt, A. Silver, and D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
  3. H. S. Torr, "Introduction to reinforcement learning," in The Cambridge Handbook of Artificial Intelligence, Cambridge University Press, 2014, pp. 109-127.
  4. A. Radford, L. Narasimhan, T. Salimans, and I. Sutskever, "Learning to generate reviews and discovering sentiment," arXiv preprint arXiv:1704.01444, 2017.
  5. L. Wei, X. Li, Z. Li, and Y. Zhang, "Self-healing cloud systems through reinforcement learning," Cloud Computing and Security, Springer, 2017, pp. 185–196.
  6. S. Bouktif, D. Djenouri, and L. Khoumsi, "Reinforcement learning for cloud resource management and fault tolerance," IEEE Access, vol. 7, pp. 65875-65888, 2019.
  7. J. Brownlee, "A comprehensive guide to reinforcement learning," Machine Learning Mastery, 2020.
  8. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. of Neural Information Processing Systems (NeurIPS), 2017, pp. 5998-6008.
  9. R. K. Gupta, S. G. Shivaraj, and P. M. Patil, "Cloud resource optimization through large language models," IEEE Transactions on Cloud Computing, vol. 8, no. 4, pp. 1254-1266, Oct. 2020.
  10. T. G. Dietterich, "Hierarchical reinforcement learning with the MAXQ value function decomposition," Journal of Artificial Intelligence Research, vol. 13, pp. 227-303, 2000.
  11. J. R. Miller, M. A. Rohan, and M. S. Johnson, "Scaling Kubernetes with AI and machine learning for cloud-native applications," IEEE Cloud Computing, vol. 7, no. 2, pp. 12–22, Mar.-Apr. 2020.
  12. C. D. De Mello, N. C. Pereira, and L. F. Gama, "AI-based resource allocation for cloud computing environments," IEEE Access, vol. 10, pp. 1576-1589, 2022.
  13. M. K. Cho, P. M. Carvalho, and C. Y. Lee, "Designing fault-tolerant AI systems using multi-agent frameworks for cloud environments," IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 5, pp. 2542-2552, 2021.
  14. Z. Li, D. Liu, and M. Song, "Multi-agent systems for resource management in cloud computing: Challenges and opportunities," IEEE Transactions on Automation Science and Engineering, vol. 18, no. 2, pp. 448–460, Apr. 2021.
  15. Y. Ding, Z. Li, and D. Wei, "Large-scale cloud orchestration using reinforcement learning: A case study with Kubernetes," International Journal of Cloud Computing and Services Science, vol. 11, no. 6, pp. 215-225, 2022.
  16. X. Wu, D. Yao, and Y. Zhang, "Multi-agent reinforcement learning for large-scale cloud systems optimization," Proceedings of the International Conference on Cloud Computing (ICCC), 2019, pp. 157–164.
  17. W. Lee, "Leveraging AI agents for cloud infrastructure management in Kubernetes," International Journal of Computer Applications, vol. 176, no. 6, pp. 50–60, 2022.
  18. A. Sharma, S. Kumar, and M. Jain, "Federated learning-based approaches for decentralized AI agents in cloud networks," IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 5176–5186, Sept. 2022.
  19. Y. Zhang, W. Zhang, and J. Liu, "Efficient AI fault diagnosis for large-scale cloud systems using unsupervised learning," IEEE Transactions on Cloud Computing, vol. 9, no. 4, pp. 1860-1872, 2021.
  20. K. Lee, R. L. Keck, and S. P. Y. Yang, "AI-based fault detection and self-healing for cloud systems using deep reinforcement learning," IEEE Transactions on Cloud Computing, vol. 10, no. 2, pp. 865-877, Apr.-Jun. 2023.