DBFS Access

This page outlines how to access DBFS in Databricks for non-sensitive data. Databricks Administrators should place the desired configuration in the Spark environment variables (recommended) or the immuta_conf.xml file (not recommended).

DBFS FUSE Mount

This feature (provided by Databricks) mounts DBFS to the local cluster filesystem at /dbfs. Although disabled when using process isolation, this feature can safely be enabled if raw, unfiltered data is not stored in DBFS and all users on the cluster are authorized to see each other’s files. When enabled, the entirety of DBFS essentially becomes a scratch path where users can read and write files in /dfbs/path/to/my/file as though they were local files.

DBFS FUSE Mount limitation: This feature cannot be used in environments with E2 Private Link enabled.

For example,

%sh echo "I'm creating a new file in DBFS" > /dbfs/my/newfile.txt

In Python,

%python
with open("/dbfs/my/newfile.txt", "w") as f:
  f.write("I'm creating a new file in DBFS")

Note: This solution also works in R and Scala.

Enable DBFS FUSE Mount

To enable the DBFS FUSE mount, set this configuration: immuta.spark.databricks.dbfs.mount.enabled=true.

Mounting a bucket

Users can mount additional buckets to DBFS that can also be accessed using the FUSE mount.
Mounting a bucket is a one-time action, and the mount will be available to all clusters in the workspace from that point on.
Mounting must be performed from a non-Immuta cluster.

Scala DBUtils (and %fs magic) with Scratch Paths

Scratch paths will work when performing arbitrary remote filesystem operations with fs magic or Scala dbutils.fs functions. For example,

%fs put -f s3://my-bucket/my/scratch/path/mynewfile.txt "I'm creating a new file in S3"
%scala dbutils.fs.put("s3://my-bucket/my/scratch/path/mynewfile.txt", "I'm creating a new file in S3")

Configure Scala DBUtils (and %fs magic) with Scratch Paths

To support %fs magic and Scala DBUtils with scratch paths, configure

       <property>
           <name>immuta.spark.databricks.scratch.paths</name>
           <value>s3://my-bucket/my/scratch/path</value>
       </property>

Configure DBUtils in Python

To use dbutils in Python, set this configuration: immuta.spark.databricks.py4j.strict.enabled=false.

Example Workflow

This section illustrates the workflow for getting a file from a remote scratch path, editing it locally with Python, and writing it back to a remote scratch path.

%python
import os
import shutil

s3ScratchFile = "s3://some-bucket/path/to/scratch/file"
localScratchDir = os.environ.get("IMMUTA_LOCAL_SCRATCH_DIR")
localScratchFile = "{}/myfile.txt".format(localScratchDir)
localScratchFileCopy = "{}/myfile_copy.txt".format(localScratchDir)

Get the file from remote storage:

dbutils.fs.cp(s3ScratchFile, "file://{}".format(localScratchFile))

Make a copy if you want to explicitly edit localScratchFile, as it will be read-only and owned by root:

shutil.copy(localScratchFile, localScratchFileCopy)
with open(localScratchFileCopy, "a") as f:
    f.write("Some appended file content")

Write the new file back to remote storage:

dbutils.fs.cp("file://{}".format(localScratchFileCopy), s3ScratchFile)

PreviousInstall a Trusted Library NextLimited Enforcement in Databricks

Last updated 1 year ago

Was this helpful?