Skip to content

Immuta with Dask

Audience: Data Owners and Users

Content Summary: This page illustrates how to connect Dask to Immuta through an example using IPython Notebook (download here) and the NYC TLC data set, which can be found at the NYC Taxi & Limousine Commission website.

Prerequisites and Notes:

  • The Immuta Virtual Filesystem must be mounted on all workers using a specific user's credentials.

  • If reading files from a Query-backed Data Source (backed by a database), then the file size will be 0. >See Query-backed Data Source Workaround below.

IPython Notebook Example

!conda install -y dask==0.13.0 distributed==1.15.2
from distributed import Client
client = Client('daskscheduler.immuta:8786')
import dask.dataframe as dd

df = dd.read_csv('immuta/nytaxi/*.csv', blocksize=None, dtype={
        'VendorID': object,
        'RateCodeID': object,
        'Passenger_count': object,
        'Payment_type': object,
        'Trip_type ': object
    })

df.groupby(['VendorID'])['Total_amount'].mean().compute()

Query-backed Data Source Workaround

If you are using Dask with a Query-backed Data Source, then it is not possible for the Virtual Filesystem to determine a file size because Dask fails to load the files and determine proper block sizes. You must specify a blocksize of None. For example,

df = dd.read_csv('immuta/nytaxi/*.csv', blocksize=None, dtype={
        'VendorID': object,
        'RateCodeID': object,
        'Passenger_count': object,
        'Payment_type': object,
        'Trip_type ': object
    })

By setting blocksize to None, Dask will not partition data into blocks, so each partition will be the entire file. Then, you must use Immuta to create file sizes that are manageable. The default size for Dask is 128MB.