Spark Hadoop Distributed File System
Estimated reading: 2 minutes
752 views
Reading the data from HDFS File
in this first we have to identitfy File name for reading this as Dataframe.
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/user/aakash/first.csv")
df.show()
Writting the data from HDFS
in this first we have to identitfy The path and give the Folder name for Wriiting and it will write as part File.
file_location = "/FileStore/tables/second.csv"
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(file_location)
#df.show()
df.write.format("csv").option("header", "true").save("/user/aakash/first_data")
print("Written Sucessfully")
Writting All data as Single File
Using the repartition we can write the data in Single File into HDFS.
df.repartition(1).write.format("csv").option("header", "true").save("/user/aakash/first_repartition")
print("Written Sucessfully")
Wriitng All data as multiple File
Using the repartition we can write the data in Single File into S3.
df.repartition(5).write.format("csv").option("header", "true").save("/user/aakash/first_repartition_single_File")
print("Written Sucessfully")
Writting the data as partitionBy (as group by data)
PySpark partitionBy() is used to partition based on column values while writing DataFrame to Disk/File system.
When you write DataFrame to Disk by calling partitionBy() Pyspark splits the records based on the partition column
and stores each partition data into a sub-directory.
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/user/aakash/first_second.csv")
df.show()
df.write.partitionBy("MATNR").format("csv").option("header", "true").save("/user/aakash/second_data")
print("Written Succesfully")