Skip to main content

1. Benefits of Using Multiple Tasks in Databricks Jobs

What are Multi-Task Jobs?

Jobs in Databricks can consist of multiple tasks that run in a specified order, with dependencies between them. This creates a workflow pipeline.

Key Benefits

  1. Modularity
    • Break complex workflows into smaller, manageable tasks (e.g., ingest → transform → analyze).
    • Easier debugging and maintenance.
  2. Parallel Execution
    • Independent tasks can run in parallel (e.g., processing different datasets simultaneously).
  3. Conditional Execution
    • Tasks can depend on the success/failure of previous tasks.
  4. Reusability
    • The same task can be reused across multiple jobs.
  5. Resource Optimization
    • Assign different clusters to different tasks based on workload needs.

Example Workflow

Task 1 (Ingest) → Task 2 (Clean) → Task 3 (Aggregate)
  • If Task 1 fails, downstream tasks (Task 2, Task 3) are skipped.

2. Setting Up a Predecessor Task in Jobs

What is a Predecessor Task?

A task that must complete before another task (successor) can run.

How to Set Up

  1. In Databricks Jobs UI:
    • Create a new job with multiple tasks.
    • In the task settings, select “Depends on” and choose the predecessor task.
  2. Using Jobs API:
    {
      "task_key": "transform_data",
      "depends_on": [{"task_key": "ingest_data"}]
    }
    

Example Scenario

  • Task 1 (ingest_data): Loads raw data.
  • Task 2 (transform_data): Cleans and processes data (depends on ingest_data).

3. When to Use Predecessor Tasks

Common Scenarios

  1. Data Dependency
    • A task requires output from a previous task (e.g., raw data must be ingested before transformation).
  2. Error Handling
    • If an early task fails, downstream tasks should not execute (e.g., avoid processing incomplete data).
  3. Cost Optimization
    • Skip expensive computations if upstream validation fails.

Example

validate_input → (if valid) → process_data → generate_report
  • If validate_input fails, the pipeline stops early.

4. Reviewing a Task’s Execution History

Why Review Execution History?

  • Debug failures.
  • Monitor performance (duration, resource usage).
  • Audit job runs.

How to Access

  1. Databricks UI:
    • Navigate to “Jobs” → Select job → “Runs” tab.
    • Click on a run to see task history.
  2. Key Details Available:
    • Start/end time.
    • Status (Success, Failed, Skipped).
    • Logs (stdout, stderr).
    • Cluster metrics (CPU, memory).

Example Debugging Flow

  1. Find failed run → Check logs.
  2. Identify error (e.g., FileNotFound).
  3. Fix issue (e.g., correct input path).

5. CRON Scheduling for Jobs

What is CRON?

A time-based job scheduler in Unix systems. Databricks supports CRON expressions for scheduling jobs.

Syntax

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6, Sun-Sat)
│ │ │ │ │
* * * * *

Examples

ScheduleCRON Expression
Daily at 2 AM0 2 * * *
Every Monday0 0 * * 1
Every 15 mins*/15 * * * *

How to Set Up

  1. In Job settings → “Schedule”“Cron Schedule”.
  2. Enter expression (e.g., 0 0 * * * for daily midnight runs).

6. Debugging a Failed Task

Steps to Debug

  1. Check Run Logs:
    • Navigate to the failed run → “Logs” tab.
    • Look for errors (e.g., Exception: File not found).
  2. Reproduce Locally:
    • Run the notebook interactively with the same inputs.
  3. Common Issues:
    • Missing data/files.
    • Permission errors.
    • Syntax errors in code.

Example Fix

  • Error: AnalysisException: Table not found.
  • Solution: Correct table name or ensure table exists.

7. Setting Up a Retry Policy

Why Retry?

  • Handle transient failures (e.g., network issues).
  • Avoid manual intervention.

Configuration Options

  1. Number of Retries: Max attempts (default: 0).
  2. Retry Delay: Wait time between retries (e.g., 5 mins).

How to Set Up

  1. UI:
    • In task settings → “Retry Policy” → Set max retries and delay.
  2. API:
    {
      "retry_on_timeout": true,
      "max_retries": 3,
      "min_retry_interval_millis": 300000
    }
    

Example

  • Task fails due to temporary API outage → Retries 3x with 5-minute gaps.

8. Creating Alerts for Failed Tasks

Why Alert?

  • Get notified immediately when a job fails.
  • Reduce downtime.

Alert Options

  1. Email Notifications:
    • Send alerts to individuals or groups.
  2. Webhooks:
    • Integrate with Slack, PagerDuty, etc.

How to Set Up

  1. UI:
    • Navigate to Jobs → Select job → “Alerts” tab.
    • Add email/webhook.
  2. API:
    {
      "email_notifications": {
        "on_failure": ["user@example.com"]
      }
    }
    

Example

  • Job fails → Email sent to team@company.com.

9. Email Alerts for Failed Tasks

How It Works

  • Databricks sends an email to specified addresses when:
    • A task fails.
    • The entire job fails.

Configuration

  1. UI:
    • Job settings → “Notifications” → Add email.
  2. Limitations:
    • Only supports email (for advanced integrations, use webhooks).

Example Email Content

Subject: Job Failed - "daily_etl" (Run ID: 123)
Details: Task "transform_data" failed at 2023-10-01 02:00.
Error: FileNotFoundError: No such file: /data/input.csv

Summary Table: Key Concepts

TopicKey Takeaway
Multi-Task JobsBreak workflows into modular, parallelizable tasks with dependencies.
Predecessor TasksEnsure tasks run in order (e.g., ingest → transform).
CRON SchedulingUse expressions like 0 0 * * * for daily runs.
Retry PoliciesConfigure retries (e.g., 3 attempts) for transient failures.
AlertsNotify via email/webhook when jobs fail.