Vol. 1 No. 2 (2021): Journal of AI-Assisted Scientific Discovery
Articles

The Big Data Ecosystem: An overview of critical technologies like Hadoop, Spark, and their roles in data processing landscapes

Muneer Ahmed Salamkar
Senior Associate at JP Morgan Chase, USA
Karthik Allam
Big Data Infrastructure Engineer, JP Morgan & Chase, USA
Jayaram Immaneni
Sre Lead, JP Morgan Chase, USA
COver

Published 06-09-2021

Keywords

  • AWS migration,
  • cloud performance tuning,
  • AWS billing optimization

How to Cite

[1]
Muneer Ahmed Salamkar, Karthik Allam, and Jayaram Immaneni, “The Big Data Ecosystem: An overview of critical technologies like Hadoop, Spark, and their roles in data processing landscapes”, Journal of AI-Assisted Scientific Discovery, vol. 1, no. 2, pp. 355–377, Sep. 2021, Accessed: Dec. 23, 2024. [Online]. Available: https://scienceacadpress.com/index.php/jaasd/article/view/218

Abstract

The ability to effectively manage and process vast amounts of information is essential for businesses seeking actionable insights and a competitive advantage. The extensive data ecosystem, consisting of various interconnected tools and technologies, is pivotal in achieving this goal. Among the most significant frameworks are Hadoop and Spark, both open-source platforms that have revolutionized data processing. With its distributed storage system (HDFS) and batch processing capabilities (MapReduce), Hadoop provides a scalable solution for handling petabyte-scale datasets across multiple servers with reliability and fault tolerance. However, as data processing demands grew more complex and time-sensitive, Apache Spark emerged to complement Hadoop by offering faster, in-memory processing, drastically improving the speed of data analytics. While Hadoop is ideal for storing and managing large volumes of data, Spark excels at performing high-speed, real-time analytics and is well-suited for tasks like machine learning and streaming data. The synergy between Hadoop and Spark has led to their widespread adoption as critical components in modern big data architectures, where they are often integrated to leverage each other's strengths. Additional technologies like Hive and Pig support this ecosystem for querying and processing data, and Kafka and Flink for real-time data streaming. These tools together create a flexible and scalable infrastructure that enables organizations to handle massive volumes of data efficiently. In this ever-evolving landscape, Hadoop and Spark remain central players, helping businesses address significant data challenges and allowing them to perform high-performance analytics that drive innovation and decision-making.

Downloads

Download data is not yet available.

References

  1. Sharma, S. (2016). Expanded cloud plumes hiding Big Data ecosystem. Future Generation Computer Systems, 59, 63-92.
  2. Asch, M., Moore, T., Badia, R., Beck, M., Beckman, P., Bidot, T., ... & Zacharov, I. (2018). Big data and extreme-scale computing: Pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. The International Journal of High Performance Computing Applications, 32(4), 435-479.
  3. Jain, V. K. (2017). Big Data and Hadoop. Khanna Publishing.
  4. Singh, N. (2019). Big data technology: developments in current research and emerging landscape. Enterprise Information Systems, 13(6), 801-831.
  5. Manu, M. N., & Anandakumar, K. R. (2015, December). A current trends in big data landscape. In 2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC) (pp. 1-6). IEEE.
  6. Moreno, J., Fernandez, E. B., Serrano, M. A., & Fernandez-Medina, E. (2019). Secure development of big data ecosystems. IEEE access, 7, 96604-96619.
  7. Daki, H., El Hannani, A., Aqqal, A., Haidine, A., & Dahbi, A. (2017). Big Data management in smart grid: concepts, requirements and implementation. Journal of Big Data, 4, 1-19.
  8. Raj, P., Raman, A., Nagaraj, D., & Duggirala, S. (2015). High-performance big-data analytics. Computing Systems and Approaches (Springer, 2015), 1.
  9. Saggi, M. K., & Jain, S. (2018). A survey towards an integration of big data analytics to big insights for value-creation. Information Processing & Management, 54(5), 758-790.
  10. Liu, Y., He, J., Guo, M., Yang, Q., & Zhang, X. (2014). An overview of big data industry in China. China Communications, 11(12), 1-10.
  11. Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of Big Data challenges and analytical methods. Journal of business research, 70,263-286.
  12. Rodríguez-Mazahua, L., Rodríguez-Enríquez, C. A., Sánchez-Cervantes, J. L., Cervantes, J., García-Alcaraz, J. L., & Alor-Hernández, G. (2016). A general perspective of Big Data: applications, tools, challenges and trends. The Journal of Supercomputing, 72, 3073-3113.
  13. Poucke, S. V., Zhang, Z., Schmitz, M., Vukicevic, M., Laenen, M. V., Celi, L. A., & Deyne, C. D. (2016). Scalable predictive analysis in critically ill patients using a visual open data analysis platform. PloS one, 11(1), e0145791.
  14. Bibri, S. E., & Krogstie, J. (2017). ICT of the new wave of computing for sustainable urban forms: Their big data and context-aware augmented typologies and design concepts. Sustainable cities and society, 32, 449-474.
  15. Jesse, N. (2018). Internet of Things and Big Data: the disruption of the value chain and the rise of new software ecosystems. Ai & Society, 33(2), 229-239.
  16. Gade, K. R. (2019). Data Migration Strategies for Large-Scale Projects in the Cloud for Fintech. Innovative Computer Sciences Journal, 5(1).
  17. Gade, K. R. (2017). Integrations: ETL/ELT, Data Integration Challenges, Integration Patterns. Innovative Computer Sciences Journal, 3(1).
  18. Gade, K. R. (2018). Real-Time Analytics: Challenges and Opportunities. Innovative Computer Sciences Journal, 4(1).