Vol. 1 No. 1 (2021): Journal of AI-Assisted Scientific Discovery
Articles

Automated Data Pipeline Creation: Leveraging ML algorithms to design and optimize data pipelines

Muneer Ahmed Salamkar
Senior Associate at JP Morgan Chase, USA
Jayaram Immaneni
Sre Lead, JP Morgan Chase, USA
Cover

Published 02-06-2021

Keywords

  • Data engineering automation,
  • ML-driven ETL processes

How to Cite

[1]
Muneer Ahmed Salamkar and Jayaram Immaneni, “Automated Data Pipeline Creation: Leveraging ML algorithms to design and optimize data pipelines”, Journal of AI-Assisted Scientific Discovery, vol. 1, no. 1, pp. 230–250, Jun. 2021, Accessed: Dec. 24, 2024. [Online]. Available: https://scienceacadpress.com/index.php/jaasd/article/view/220

Abstract

Automated data pipeline creation, powered by machine learning (ML) algorithms, significantly transforms how businesses design, manage, and optimize their data workflows. Traditionally, building and maintaining data pipelines is a manual, time-consuming, & error-prone task that requires constant adjustments to accommodate changes in data sources, formats, and processing needs. This traditional approach can lead to inefficiencies and delays, particularly as the volume and complexity of data continue to grow. With the integration of ML, businesses can automate the pipeline creation and optimization process, drastically reducing the time, effort, and cost involved. ML algorithms analyze historical data to identify patterns and trends, using advanced techniques such as reinforcement learning to enhance the design and performance of data pipelines continuously. As a result, these pipelines become adaptive and self-optimizing, automatically adjusting to new data requirements without manual intervention. The ability to detect bottlenecks, predict potential issues, & suggest performance improvements further enhances pipeline efficiency, scalability, and reliability. ML-powered pipelines also possess the ability to self-correct, address problems before they cause significant disruptions or downtime, and ensure seamless and uninterrupted data flow. This self-correction feature is crucial in maintaining high reliability and minimizing the risk of system failures. Additionally, ML models provide real-time feedback that allows businesses to fine-tune their data pipelines continuously, keeping them resilient to changes in data sources or volume. This adaptability ensures that data pipelines can scale with the growing demands of data processing & analysis. Businesses benefit from streamlined workflows, reduced operational costs, improved scalability, and enhanced insights, ultimately empowering faster, data-driven decision-making. By leveraging ML in data pipeline creation, organizations can stay ahead of the curve in today’s fast-paced, data-centric world.

Downloads

Download data is not yet available.

References

  1. Devarasetty, N. (2018). Automating Data Pipelines with AI: From Data Engineering to Intelligent Systems. Revista de Inteligencia Artificial en Medicina, 9(1), 1-30.
  2. Shang, Z., Zgraggen, E., Buratti, B., Kossmann, F., Eichmann, P., Chung, Y., ... & Kraska, T. (2019, June). Democratizing data science through interactive curation of ml pipelines. In Proceedings of the 2019 international conference on management of data (pp. 1171-1188).
  3. Deekshith, A. (2019). Integrating AI and Data Engineering: Building Robust Pipelines for Real-Time Data Analytics. International Journal of Sustainable Development in Computing Science, 1(3), 1-35.
  4. Sparks, E. R., Venkataraman, S., Kaftan, T., Franklin, M. J., & Recht, B. (2017, April). Keystoneml: Optimizing pipelines for large-scale advanced analytics. In 2017 IEEE 33rd international conference on data engineering (ICDE) (pp. 535-546). IEEE.
  5. Patel, D., Shrivastava, S., Gifford, W., Siegel, S., Kalagnanam, J., & Reddy, C. (2020, December). Smart-ml: A system for machine learning model exploration using pipeline graph. In 2020 IEEE International Conference on Big Data (Big Data) (pp. 1604-1613). IEEE.
  6. Prado, M. D., Su, J., Saeed, R., Keller, L., Vallez, N., Anderson, A., ... & Pazos, N. (2020). Bonseyes ai pipeline—bringing ai to you: End-to-end integration of data, algorithms, and deployment tools. ACM Transactions on Internet of Things, 1(4), 1-25.
  7. Zhang, Z., Sparks, E. R., & Franklin, M. J. (2017, June). Diagnosing machine learning pipelines with fine-grained lineage. In Proceedings of the 26th international symposium on high-performance parallel and distributed computing (pp. 143-153).
  8. Doherty, C., & Orenstein, G. (2015). Building Real-Time Data Pipelines.
  9. Rangineni, S., Bhanushali, A., Marupaka, D., Venkata, S., & Suryadevara, M. (1973). Analysis of Data Engineering Techniques With Data Quality in Multilingual Information Recovery. International Journal of Computer Sciences and Engineering, 11(10), 29-36.
  10. Dayal, U., Castellanos, M., Simitsis, A., & Wilkinson, K. (2009, March). Data integration flows for business intelligence. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (pp. 1-11).
  11. Girbal, S., Vasilache, N., Bastoul, C., Cohen, A., Parello, D., Sigler, M., & Temam, O. (2006). Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming, 34, 261-317.
  12. Scheidegger, C., Vo, H., Koop, D., Freire, J., & Silva, C. (2007). Querying and creating visualizations by analogy. IEEE transactions on Visualization and Computer Graphics, 13(6), 1560-1567.
  13. Kouzes, R. T., Anderson, G. A., Elbert, S. T., Gorton, I., & Gracio, D. K. (2009). The changing paradigm of data-intensive computing. Computer, 42(1), 26-34.
  14. Habib, I., Anjum, A., Bloodsworth, P., & McClatchey, R. (2010). Grid-aware planning and optimisation of neuroimaging pipelines. International Journal of Software Engineering & Its Application.
  15. Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., ... & Yelick, K. (2005). Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2), 293-312.
  16. Thumburu, S. K. R. (2020). Enhancing Data Compliance in EDI Transactions. Innovative Computer Sciences Journal, 6(1).
  17. Thumburu, S. K. R. (2020). Interfacing Legacy Systems with Modern EDI Solutions: Strategies and Techniques. MZ Computing Journal, 1(1).
  18. Gade, K. R. (2020). Data Analytics: Data Privacy, Data Ethics, Data Monetization. MZ Computing Journal, 1(1).
  19. Gade, K. R. (2019). Data Migration Strategies for Large-Scale Projects in the Cloud for Fintech. Innovative Computer Sciences Journal, 5(1).
  20. Katari, A. Conflict Resolution Strategies in Financial Data Replication Systems.
  21. Gade, K. R. (2017). Integrations: ETL vs. ELT: Comparative analysis and best practices. Innovative Computer Sciences Journal, 3(1).
  22. Gade, K. R. (2018). Real-Time Analytics: Challenges and Opportunities. Innovative Computer Sciences Journal, 4(1).
  23. Thumburu, S. K. R. (2020). Large Scale Migrations: Lessons Learned from EDI Projects. Journal of Innovative Technologies, 3(1).