Skip to content

You are viewing documentation for Immuta version 2020.3.

For the latest version, view our documentation for Immuta SaaS or the latest self-hosted version.

Using the Immuta SparkSession (Spark 2)

Audience: Data Users

Content Summary: This page outlines how to use the Immuta SparkSession with spark-submit, spark-shell, and pyspark.

Immuta SparkSession Background: For Spark 2, the Immuta SparkSession must be used in order to access Immuta data sources. Once the Immuta Spark Installation has been completed on your Spark cluster, then you are able to use the special Immuta Spark interfaces that are detailed below. For data storage technologies that support batch processing workloads, the Immuta SparkSession allows users to query data sources the same way that they query Hive tables with Spark SQL.

When querying metastore-backed data sources, such as Hive and Impala, the Immuta Session accesses the data directly in HDFS. Other data source types will pass through the Query Engine. In order to take advantage of the performance gains provided by directly acting on the files in HDFS in your Spark jobs, you must create Immuta data sources for metastore-backed data sources with tables that are persisted in HDFS.

For guidance on querying data sources across multiple clusters and/or remote databases, see Leveraging Data on Other Clusters and Databases.

Using the Immuta SparkSession

With spark-submit

Launch the special immuta-spark-submit interface, and submit jobs just like you would with spark-submit:

immuta-spark-submit <job>

With spark-shell

First, launch the special immuta-spark-shell interface:

immuta-spark-shell

Then, Use the immuta variable just like you would spark:

immuta.catalog.listTables().show()
val df = immuta.table("my_immuta_datasource")
df.show()
val df2 = immuta.sql("SELECT * FROM my_immuta_datasource")
df2.show()

Next, use the immuta format to specify partition information:

val df3 = immuta.read.format("immuta")
    .option("dbtable", "my_immuta_datasource")
    .option("partitionColumn", "id")
    .option("lowerBound", "0")
    .option("upperBound", "300")
    .option("numPartitions", "3")
    .load()
df3.show()

The immuta format also supports query pushdown:

val df4 = immuta.read.format("immuta")
    .option("dbtable", "(SELECT * FROM my_immuta_datasource) as my_immuta_datasource")
    .load()
df4.show()

Finally, specify the fetch size:

val df5 = immuta.read.format("immuta")
    .option("dbtable", "my_immuta_datasource")
    .option("fetchsize", "500").load()
df5.show()

With pyspark

First, launch the special immuta-pyspark interface:

immuta-pyspark

Then, use the immuta variable just like you would spark:

immuta.catalog.listTables()
df = immuta.table("my_immuta_datasource")
df.show()
df2 = immuta.sql("SELECT * FROM my_immuta_datasource")
df2.show()

Finally, use the immuta format to specify partition information:

df3 = immuta.read.format("immuta")
    .option("dbtable", "my_immuta_datasource")
    .option("partitionColumn", "id")
    .option("lowerBound", "0")
    .option("upperBound", "300")
    .option("numPartitions", "3")
    .load()
df3.show()

The immuta format also supports query pushdown:

df4 = immuta.read.format("immuta")
    .option("dbtable", "(SELECT * FROM my_immuta_datasource) as my_immuta_datasource")
    .load()
df4.show()
To specify the fetch size:

df5 = immuta.read.format("immuta")
    .option("dbtable", "my_immuta_datasource")
    .option("fetchsize", "500").load()
df5.show()