Docy

Spark RDD Overview

Estimated reading: 2 minutes 888 views

SPark RDD overview

Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.

Resilient Distributed Dataset(RDD)
RDDs are the building blocks of any Spark application. RDDs Stands for:

Resilient: Fault tolerant and is capable of rebuilding data on failure
Distributed: Distributed data among the multiple nodes in a cluster
Dataset: Collection of partitioned data with values

 

It is a layer of abstracted data over the distributed collection. It is immutable in nature and follows lazy transformations.

SPark RDD working

Now you might be wondering about its working. Well, the data in an RDD is split into chunks based on a key. RDDs are highly resilient, i.e, they are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. Thus, even if one executor node fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes.

Moreover, once you create an RDD it becomes immutable. By immutable I mean, an object whose state cannot be modified after it is created, but they can surely be transformed.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or by referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, etc.

RDD operation

Transformations: They are the operations that are applied to create a new RDD.
Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver.

Leave a Comment

Share this Doc
CONTENTS