Vol. 2 No. 2 (2022): Journal of AI-Assisted Scientific Discovery
Articles

Leveraging in-memory computing for speeding up Apache Spark and Hadoop distributed data processing

Sarbaree Mishra
Program Manager at Molina Healthcare Inc., USA
Vineela Komandla
Vice President - Product Manager, JP Morgan Chase, USA
Srikanth Bandi
Software Engineer, JP Morgan Chase, USA
Cover

Published 26-09-2022

Keywords

  • data caching,
  • cluster computing

How to Cite

[1]
Sarbaree Mishra, Vineela Komandla, and Srikanth Bandi, “Leveraging in-memory computing for speeding up Apache Spark and Hadoop distributed data processing”, Journal of AI-Assisted Scientific Discovery, vol. 2, no. 2, pp. 304–328, Sep. 2022, Accessed: Dec. 24, 2024. [Online]. Available: https://scienceacadpress.com/index.php/jaasd/article/view/239

Abstract

In-memory computing has emerged as a transformative approach in distributed data processing, revolutionizing frameworks like Apache Spark and Hadoop by addressing the limitations of traditional disk-based methods. These conventional approaches, while reliable, often encounter significant delays due to disk I/O bottlenecks, especially with the ever-increasing size and complexity of modern data workloads. In-memory computing overcomes these challenges by leveraging RAM to store and process data, significantly reducing latency & accelerating computation. Apache Spark capitalizes on this concept through its Resilient Distributed Dataset (RDD) model, which retains intermediate data in memory to optimize iterative tasks and minimize disk write operations. Similarly, to enhance performance, Hadoop has evolved by integrating in-memory capabilities, such as YARN’s memory-based caching. This approach is crucial for workloads requiring real-time analytics, iterative machine learning processes, and high-frequency data pipelines, where speed and responsiveness are paramount. Beyond faster processing, in-memory computing improves scalability and resource utilization by allowing more efficient partitioning, caching, and task execution. It also aligns seamlessly with advancements in hardware, such as high-speed RAM and solid-state drives, amplifying the performance gains. Moreover, optimized data partitioning, compression, & dynamic memory management ensure that systems can handle larger datasets while maintaining low latency and high throughput. This integration reduces the processing overhead and empowers organizations to make faster, more informed decisions. By shifting the focus from disk-reliant operations to memory-centric processing, in-memory computing redefines the capabilities of distributed systems, ensuring they can meet the growing demands of modern data-driven applications. It is not merely an enhancement to existing frameworks but a paradigm shift that enables businesses to unlock the full potential of their data, offering a robust foundation for scalability, adaptability, and efficiency in distributed computing environments.

Downloads

Download data is not yet available.

References

  1. Huang, W., Meng, L., Zhang, D., & Zhang, W. (2016). In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop yarn model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1), 3-19.
  2. Hong, S., Choi, W., & Jeong, W. K. (2017, May). GPU in-memory processing using spark for iterative computation. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (pp. 31-41). IEEE.
  3. Zhang, X., Khanal, U., Zhao, X., & Ficklin, S. (2018). Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system. Journal of Parallel and Distributed Computing, 120, 369-382.
  4. Shaikh, E., Mohiuddin, I., Alufaisan, Y., & Nahvi, I. (2019, November). Apache spark: A big data processing engine. In 2019 2nd IEEE Middle East and North Africa COMMunications Conference (MENACOMM) (pp. 1-6). IEEE.
  5. Aziz, K., Zaidouni, D., & Bellafkih, M. (2019). Leveraging resource management for efficient performance of Apache Spark. Journal of Big Data, 6(1), 78.
  6. Tang, S., He, B., Yu, C., Li, Y., & Li, K. (2020). A survey on spark ecosystem: Big data processing infrastructure, machine learning, and applications. IEEE Transactions on Knowledge and Data Engineering, 34(1), 71-91.
  7. Grossman, M., & Sarkar, V. (2016, May). SWAT: A programmable, in-memory, distributed, high-performance computing platform. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (pp. 81-92).
  8. Islam, N. S., Wasi-ur-Rahman, M., Lu, X., Shankar, D., & Panda, D. K. (2015, October). Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters. In 2015 IEEE International Conference on Big Data (Big Data) (pp. 243-252). IEEE.
  9. Huang, Y., Yesha, Y., Halem, M., Yesha, Y., & Zhou, S. (2016, December). YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics. In 2016 IEEE international conference on big data (big data) (pp. 214-222). IEEE.
  10. Zhang, H., Chen, G., Ooi, B. C., Tan, K. L., & Zhang, M. (2015). In-memory big data management and processing: A survey. IEEE Transactions on Knowledge and Data Engineering, 27(7), 1920-1948.
  11. Saxena, S., & Gupta, S. (2017). Practical real-time data processing and analytics: distributed computing and event processing using Apache Spark, Flink, Storm, and Kafka. Packt Publishing Ltd.
  12. Hu, F., Yang, C., Schnase, J. L., Duffy, D. Q., Xu, M., Bowen, M. K., ... & Song, W. (2018). ClimateSpark: An in-memory distributed computing framework for big climate data analytics. Computers & geosciences, 115, 154-166.
  13. Veiga, J., Expósito, R. R., Taboada, G. L., & Tourino, J. (2018). Enhancing in-memory efficiency for MapReduce-based data processing. Journal of Parallel and Distributed Computing, 120, 323-338.
  14. Yan, D., Yin, X. S., Lian, C., Zhong, X., Zhou, X., & Wu, G. S. (2015). Using memory in the right way to accelerate Big Data processing. Journal of Computer Science and Technology, 30, 30-41.
  15. Kim, M., Li, J., Volos, H., Marwah, M., Ulanov, A., Keeton, K., ... & Fernando, P. (2017). Sparkle: Optimizing spark for large memory machines and analytics. arXiv preprint arXiv:1708.05746.
  16. Thumburu, S. K. R. (2020). Interfacing Legacy Systems with Modern EDI Solutions: Strategies and Techniques. MZ Computing Journal, 1(1).
  17. Thumburu, S. K. R. (2020). Leveraging APIs in EDI Migration Projects. MZ Computing Journal, 1(1).
  18. Gade, K. R. (2020). Data Analytics: Data Privacy, Data Ethics, Data Monetization. MZ Computing Journal, 1(1).
  19. Gade, K. R. (2019). Data Migration Strategies for Large-Scale Projects in the Cloud for Fintech. Innovative Computer Sciences Journal, 5(1).
  20. Katari, A. Conflict Resolution Strategies in Financial Data Replication Systems.
  21. Komandla, V. Enhancing Security and Fraud Prevention in Fintech: Comprehensive Strategies for Secure Online Account Opening.
  22. Thumburu, S. K. R. (2021). Integrating Blockchain Technology into EDI for Enhanced Data Security and Transparency. MZ Computing Journal, 2(1).
  23. Thumburu, S. K. R. (2021). The Future of EDI Standards in an API-Driven World. MZ Computing Journal, 2(2).
  24. Gade, K. R. (2017). Integrations: ETL vs. ELT: Comparative analysis and best practices. Innovative Computer Sciences Journal, 3(1).
  25. Katari, A., Muthsyala, A., & Allam, H. HYBRID CLOUD ARCHITECTURES FOR FINANCIAL DATA LAKES: DESIGN PATTERNS AND USE CASES.