Spark s3 File System Connectivity

Estimated reading: 2 minutes 872 views

pyspark S3 Connection

For connecting Pyspark with S3 ,
we need to first genrate accessKeyId and secretAccessKey and then we need to Set the keys.

				
					accessKeyId = ""
secretAccessKey = ""
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", accessKeyId)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secretAccessKey)

Reading the data from S3 bucket

in this first we have to identitfy Bucket name and give the File name for reading this.

				
					df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3n://sumitpysparktest/first/first_test.csv")
df.show()

Writting the data from s3 bucket

in this first we have to identitfy Bucket name and give the Folder name for Wriiting and it will write as part File.

				
					file_location = "/FileStore/tables/second.csv"
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(file_location)
#df.show()
df.write.format("csv").option("header", "true").save("s3n://sumitpysparktest/first/data_write")
print("Written Sucessfully")

Writting All data as Single File

Using the repartition we can write the data in Single File into S3.

				
					df.repartition(1).write.format("csv").option("header", "true").save("s3n://pysparktestcodeshunger/write_today_repartition_24_08_2021")
print("Written Sucessfully")

Writting All data as multiple File

Using the repartition we can write the data in Single File into S3.

				
					df.repartition(5).write.format("csv").option("header", "true").save("s3n://pysparktestcodeshunger/write_today_repartition_24_08_2021")
print("Written Sucessfully")

Writting the data as partitionBy (as group by data)

PySpark partitionBy() is used to partition based on column values while writing DataFrame to Disk/File system.
When you write DataFrame to Disk by calling partitionBy() Pyspark splits the records based on the partition column
and stores each partition data into a sub-directory.

				
					df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3n://hereandthere/convertcsvselnew.csv")
df.show()
df.write.partitionBy("MATNR").format("csv").option("header", "true").save("s3n://hereandthere/write_today_partitonByColumn")
print("Written Succesfully")