Spark Ecosystem

Estimated reading: 2 minutes 880 views

Spark Ecosystem Overview

The Apache Spark ecosystem is an open-source distributed cluster-computing framework. Spark is a data processing engine developed to provide faster and easier analytics than Hadoop MapReduce. Background: Apache Spark started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010.

types of ecosystem SHOWN BELOW

1.) Apache Spark Core

2.) Spark SQL

3.) Spark Streaming

4.) MLlib (Machine Learning Library)

5.) GraphX

Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems.

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming

Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.

MLlib (Machine Learning Library)

MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).

GraphX

GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction.