Skip to content

Determine Background Jobs' Estimated Completion

Audience: System Administrators

Content Summary: This page shows how to determine when Immuta background jobs will complete.

Query Overview

Below is a SQL query, which will help determine when some background jobs will be complete.

Note: This query unions the job and archive table. Adjust the top interval appropriately for your needs. In addition, collections of jobs aren’t typically differentiated. For example, if you hit LDAP three times, that is one collection of jobs, not three.

Caveats

  • Avg Time: The value for the average time will have a standard deviation applied to it.

  • Duration: How long it has been running for. If this reads towards 12 hours, then this value is longer than the reported value.

  • Until Complete: How long it will take to complete the remaining queue. This is assuming the job is the only active job to utilize all of the web pods.

  • When EDT: When the jobs will be completed. This is assuming the job is the only active job to utilize all of the web pods.

Instructions

Run the following SQL query to investigate Immuta's background jobs.

WITH
--FILTER by an INTERVAL or a START timestamp
vars AS (SELECT NOW() AS now_dt, NOW() - INTERVAL '4 hours' AS start, NOW()),
--vars AS (SELECT NOW() AS now_dt, TO_TIMESTAMP('2021-11-23 12:05:00', 'YYYY-MM-DD hh24:mi:ss')::timestamp without time zone at time zone 'EDT' AS start, NOW()),
all_jobs_raw AS (
    SELECT name, state, startedon, createdon, completedon, data FROM pgboss.job
    --UNION SELECT name, state, startedon, createdon, completedon, data FROM pgboss.archive
),
all_jobs AS (
    SELECT
        CASE
            WHEN name = 'automaticSubscription' AND data->>'profileId' IS NULL THEN 'autoSub - ByDataSource'
            WHEN name = 'automaticSubscription' AND data->>'profileId' IS NOT NULL THEN 'autoSub - ByUser'
            WHEN name ILIKE 'nativeSqlProfileRefresh_Snowflake%' THEN 'nativeSqlProfileRefresh_Snowflake%'
            WHEN name ILIKE 'nativeSqlProfileRefresh_Redshift%' THEN 'nativeSqlProfileRefresh_Redshift%'
            WHEN name ILIKE 'nativeSqlDataSourceSync_Snowflake%' THEN 'nativeSqlDataSourceSync_Snowflake%'
            WHEN name ILIKE 'nativeSqlDataSourceSync_Redshift%' THEN 'nativeSqlDataSourceSync_Redshift%'
            ELSE name
        END AS name, state, startedon, completedon
    FROM all_jobs_raw INNER JOIN vars ON TRUE
    WHERE all_jobs_raw.createdon > vars.start
    AND name NOT LIKE '\_\_%' AND name NOT ILIKE 'cleanUp%' AND name NOT ILIKE 'initiateScheduled%'
    AND name NOT IN (
        'bootstrapImmutaQueryEngine', 'checkForRollOver', 'checkInWithLicenseServer', 'clearMaxTTLTokens','columnEvolutionCheck', 'createPolicySearchRecords',
        'dataSourceTest', 'entitlements_cache_job_root', 'entitlements_cache_job_worker','executeCustomerMetricsQueries', 'expirePolicyAdjustments',
        'expirePolicyCertifications', 'loadIamGroups', 'nativeSqlMigration_Redshift', 'recomputeDslFromJsonPolicies','removeIamAuths', 'scheduleDbtUpdates',
        'scheduleSchemaEvolutionChecks', 'schemaProjectsQueryableMigration', 'visibilitySchemaUpdate') -- ignore list
    --AND (name ILIKE '%redshift%' OR name ILIKE '%bulk%') -- adjust to filter down to specific jobs
),
distinct_jobs AS ( SELECT DISTINCT name FROM all_jobs),
completed_jobs AS (
    SELECT
        name,
        COUNT(*) AS total,
        AVG(completedon - startedon) AS average_time,
        MIN(completedon) AS min_completedon,
        MAX(completedon) AS max_completedon
    FROM all_jobs
    WHERE state = 'completed'
    GROUP BY name),
active_jobs AS ( SELECT name, count(*) AS total FROM all_jobs WHERE state = 'active' GROUP BY name ),
queued_jobs AS ( SELECT name, count(*) AS total FROM all_jobs WHERE state = 'created' GROUP BY name ),
expired_jobs AS ( SELECT name, count(*) AS total FROM all_jobs WHERE state = 'expired' GROUP BY name ),
failed_jobs AS ( SELECT name, count(*) AS total FROM all_jobs WHERE state = 'failed' GROUP BY name )
SELECT
    distinct_jobs.name as "Job Name",
    --vars.start AT TIME ZONE 'EDT' AS "Filter EDT",
    completed_jobs.min_completedon AT TIME ZONE 'EDT' AS "Earliest Job EDT",
    --completed_jobs.max_completedon AT TIME ZONE 'EDT' AS "Latest Job EDT",
    TO_CHAR(completed_jobs.max_completedon - completed_jobs.min_completedon, 'HH24:MI:SS.MS' ) AS "Duration",
    TO_CHAR(completed_jobs.total, 'FM9,999,999,999') AS "# Completed",
    TO_CHAR(completed_jobs.average_time, 'MI:SS.MS') AS "Avg Time",
    TO_CHAR(active_jobs.total, 'FM9,999,999,999') AS "# Active",
    TO_CHAR(queued_jobs.total, 'FM9,999,999,999') AS "# Queued",
    TO_CHAR(expired_jobs.total, 'FM9,999,999,999') AS "# Expired",
    TO_CHAR(failed_jobs.total, 'FM9,999,999,999') AS "# Failed",
    ((queued_jobs.total / active_jobs.total) * completed_jobs.average_time) AS "Until Complete"
    --(vars.now_dt + ((queued_jobs.total / active_jobs.total) * completed_jobs.average_time)) AT TIME ZONE 'EDT' AS "When EDT"
FROM distinct_jobs
INNER JOIN vars ON TRUE
LEFT OUTER JOIN completed_jobs ON distinct_jobs.name = completed_jobs.name
LEFT OUTER JOIN active_jobs ON distinct_jobs.name = active_jobs.name
LEFT OUTER JOIN queued_jobs ON distinct_jobs.name = queued_jobs.name
LEFT OUTER JOIN expired_jobs ON distinct_jobs.name = expired_jobs.name
LEFT OUTER JOIN failed_jobs ON distinct_jobs.name = failed_jobs.name
ORDER BY queued_jobs.total DESC NULLS LAST;

Demo Video

For further instruction on how to monitor background jobs, refer to this video.