PySpark SQL overview
PySpark SQL is a module in Spark which integrates relational processing with Spark’s functional programming API. We can extract the data by using an SQL query language. We can use the queries same as the SQL language.
using spark.sql we can run a sql Query but before that we need use createOrReplcaeTempView for registring the dataFrame as tempTable.
Then only , we can write the SQL over it.
Return Type will be a dataframe.
df = spark.read.format("csv").option("header", "true").option("inferSchema","true").load("/FileStore/tables/first_test.csv") df.createOrReplaceTempView("test") df2 = spark.sql("select * from test where id > 3 ") df2.show()