Spark Hadoop Distributed File System

Estimated reading: 2 minutes 338 views

Reading the data from HDFS File

in this first we have to identitfy File name for reading this as Dataframe.

				
					df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/user/aakash/first.csv")
df.show()
				
			

Writting the data from HDFS

in this first we have to identitfy The path and give the Folder name for Wriiting and it will write as part File.

				
					file_location = "/FileStore/tables/second.csv"
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(file_location)
#df.show()
df.write.format("csv").option("header", "true").save("/user/aakash/first_data")
print("Written Sucessfully")
				
			

Writting All data as Single File

Using the repartition we can write the data in Single File into HDFS.

				
					df.repartition(1).write.format("csv").option("header", "true").save("/user/aakash/first_repartition")
print("Written Sucessfully")
				
			

Wriitng All data as multiple File

Using the repartition we can write the data in Single File into S3.

				
					df.repartition(5).write.format("csv").option("header", "true").save("/user/aakash/first_repartition_single_File")
print("Written Sucessfully")
				
			

Writting the data as partitionBy (as group by data)

PySpark partitionBy() is used to partition based on column values while writing DataFrame to Disk/File system.
When you write DataFrame to Disk by calling partitionBy() Pyspark splits the records based on the partition column
and stores each partition data into a sub-directory.

				
					df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/user/aakash/first_second.csv")
df.show()
df.write.partitionBy("MATNR").format("csv").option("header", "true").save("/user/aakash/second_data")
print("Written Succesfully")
				
			

Leave a Comment

CONTENTS