Audience: Data Users
Content Summary: This page outlines how to use the Immuta SparkSession with spark-submit, spark-shell, and pyspark.
Immuta SparkSession Background: For Spark 2, the Immuta SparkSession must be used in order to access Immuta data sources. Once the Immuta Spark Installation has been completed on your Spark cluster, then you are able to use the special Immuta Spark interfaces that are detailed below. For data platforms that support batch processing workloads, the Immuta SparkSession allows users to query data sources the same way that they query Hive tables with Spark SQL.
When querying metastore-backed data sources, such as Hive and Impala, the Immuta Session accesses the data directly in HDFS. Other data source types will pass through the Query Engine. In order to take advantage of the performance gains provided by directly acting on the files in HDFS in your Spark jobs, you must create Immuta data sources for metastore-backed data sources with tables that are persisted in HDFS.
For guidance on querying data sources across multiple clusters and/or remote databases, see Leveraging Data on Other Clusters and Databases.
Launch the special immuta-spark-submit
interface, and submit jobs just like you would with spark-submit
:
First, launch the special immuta-spark-shell
interface:
Then, Use the immuta
variable just like you would spark
:
Next, use the immuta
format to specify partition information:
The immuta
format also supports query pushdown:
Finally, specify the fetch size:
First, launch the special immuta-pyspark
interface:
Then, use the immuta
variable just like you would spark
:
Finally, use the immuta
format to specify partition information:
The immuta
format also supports query pushdown: