YAML Spec Format

Every section and field in the DagSmith YAML specification, explained.

Section Order

DagSmith YAML specs follow a conventional section order. All sections except metadata, dag, and gcp are optional.

yaml

variables:          # 1. Optional - key-value pairs for ${VAR} substitution
configurations:     # 2. Optional - reusable config values
metadata:           # 3. Required - documentation metadata
dag:                # 4. Required - DAG constructor arguments
gcp:                # 5. Required - GCP connection defaults
default_args:       # 6. Optional - Airflow default_args
user_defined_macros: # 7. Optional - Jinja macros
tasks:              # 8. Optional - operator/sensor/group specs
dependencies:       # 9. Optional - task execution order

1. Variables

optional

Key-value pairs for ${VAR_NAME} substitution throughout the entire YAML. Expansion happens before Pydantic validation, so variables work in every section.

Naming rules (strictly enforced)

Rule	Example
Must be ALL UPPERCASE	`VAR__PROJECT_ID__VAR`
Must begin with `VAR__`	`VAR__DATASET__VAR`
Must end with `__VAR`	`VAR__ENV__VAR`

yaml

variables:
  VAR__PROJECT_ID__VAR: "my-gcp-project-001"
  VAR__DATASET__VAR: "warehouse_tables"
  VAR__ENV__VAR: "prod"
  VAR__BUNDLE__VAR: "daily_load"

# Usage anywhere in the spec:
gcp:
  project_id: "${VAR__PROJECT_ID__VAR}"   # expands to "my-gcp-project-001"

configurations:
  base_path: "/home/airflow/gcs/dags/${VAR__PROJECT_ID__VAR}/"

When to use variables: Same spec deployed across environments (dev/staging/prod), repeated values like project_id in multiple tasks, paths with embedded identifiers.

When to skip: Single-environment DAGs with no repeated values, quick prototyping where indirection adds noise.

How expansion works

DagSmith parses the variables section from raw YAML
All ${VAR__...__VAR} references are replaced with their values as plain strings
The expanded YAML is then parsed and validated by Pydantic
Invalid variable names are rejected with a clear error before any expansion happens

2. Configurations

optional — default: base_path="/home/airflow/gcs/dags"

Reusable typed configuration values. Unlike variables (which are substituted as strings), configurations are preserved as typed values. Supports arbitrary additional keys.

yaml

configurations:
  base_path: "/home/airflow/gcs/dags/${VAR__PROJECT_ID__VAR}/"
  # custom_key: "any additional config"  # extra keys allowed

3. Metadata

required — all fields required

Documentation metadata rendered into the generated file's docstring header.

title required Descriptive DAG title

owner required Team or individual owner

email required Contact email

version required Semantic version string

jira required JIRA ticket reference

developer_name required Developer or DAG identifier

yaml

metadata:
  title: "Daily Account Activity Load"
  owner: "data-team@example.com"
  email: "data-team@example.com"
  version: "1.0.0"
  jira: "PROJ-1234"
  developer_name: "daily_load"

4. DAG

required — maps to airflow.DAG() constructor

dag_id required Unique DAG identifier

description optional Default: ""

schedule optional Cron expression or preset (@daily, @hourly, etc). Default: None (manual only). alias: schedule_interval

start_date optional YYYY-MM-DD or YYYY-MM-DD HH:MM:SS. Default: now

timezone optional IANA timezone string. Default: "UTC"

catchup optional Backfill past runs. Default: false

max_active_runs optional Must be ≥ 1. Default: 1

dagrun_timeout required Maximum seconds per DAG run (must be > 0)

is_paused_upon_creation optional Default: true

tags optional List of organizational tags. Default: []

params optional DAG-level Airflow Params for Trigger UI. Default: None

template_searchpath optional Jinja template search paths. Default: []

sla_miss_callback optional Dotted path to callable. Default: None

doc_md optional DAG documentation markdown. Default: None

yaml

dag:
  dag_id: "daily_account_load"
  description: "Load daily account activity into BigQuery."
  schedule: "0 6 * * *"                  # 6 AM daily
  start_date: "2026-01-02 12:13:14"
  timezone: "America/New_York"
  catchup: false
  max_active_runs: 1
  dagrun_timeout: 7200
  is_paused_upon_creation: true
  tags:
    - "warehouse:bigquery"
    - "module:daily_load"

  # DAG-level params (optional)
  params:
    env:
      type: "string"
      default: "PROD"
      enum: ["PROD", "PLE", "DEV"]
      title: "Environment"
      description: "Target environment"

5. GCP

required — GCP connection defaults shared across tasks

project_id optional GCP project ID. Default: None

location optional GCP location (e.g. "US", "us-east4"). Default: None

gcp_conn_id optional Airflow connection ID. Default: "google_cloud_default". alias: google_cloud_conn_id

impersonation_chain optional Service account email or chained list. Default: None

deferrable optional Enable async mode for sensors. Default: false

yaml

gcp:
  project_id: "my-gcp-project-001"
  location: "US"
  gcp_conn_id: "google_cloud_default"
  # impersonation_chain: "sa@project.iam.gserviceaccount.com"
  # deferrable: false

6. Default Args

optional — applied to every task in the DAG

owner Default: "airflow"

depends_on_past Default: false

retries Must be ≥ 0. Default: 3

retry_delay Seconds, must be ≥ 30. Default: 60. alias: retry_delay_seconds

sla SLA in seconds, must be > 0. Default: None. alias: sla_seconds

deferrable Default: true

email List of alert recipients. Also accepts comma-separated string. Default: []

email_on_failure Default: false

email_on_retry Default: false

on_failure_callback Dotted path to callable. Default: None

on_success_callback Dotted path to callable. Default: None

on_retry_callback Dotted path to callable. Default: None

yaml

default_args:
  owner: "airflow"
  depends_on_past: false
  retries: 3
  retry_delay: 60
  deferrable: true
  email:
    - "data-team@example.com"
  email_on_failure: true
  email_on_retry: false
  # on_failure_callback: "mypackage.callbacks.on_failure"
  # on_success_callback: "mypackage.callbacks.on_success"
  # on_retry_callback: "mypackage.callbacks.on_retry"

7. User Defined Macros

optional — Jinja macros injected into the DAG constructor

All keys become macro names available as {{ macro_name }} in Jinja templates. Values must be scalars (str, int, float, bool, None).

yaml

user_defined_macros:
  project_name: "my-gcp-project-001"     # {{ project_name }} in templates
  fact_dataset: "warehouse_tables"        # {{ fact_dataset }}
  bundle: "daily_load"
  env: "prod"
  datastore: "BQ"
  latency: 1

8. Tasks

optional — defaults to []

A list of operator, sensor, and TaskGroup specs. Each task uses the operator field as a discriminator to select the correct Pydantic model. See the Operators & Sensors page for details on each type.

Common base fields (all task types)

task_id required Unique task identifier

operator required Operator/sensor class name (e.g. BigQueryInsertJobOperator)

trigger_rule optional Default: "all_success". Options: all_success, all_failed, all_done, one_success, one_failed, none_failed, etc.

retries optional Task-level override of default_args.retries

retry_delay optional Task-level override. alias: retry_delay_seconds

doc_md optional Task documentation markdown

params optional Dict of task-level Jinja params

kwargs optional Additional operator kwargs dict

on_failure_callback optional Dotted path to callable. Overrides DAG-level callback.

on_success_callback optional Dotted path to callable

on_retry_callback optional Dotted path to callable

Sensor base fields (sensors only)

All sensors inherit these additional fields from BaseSensorOperatorSpec:

poke_interval optional Seconds between pokes, > 0. Default: 60.0

timeout required Maximum seconds sensor can run, > 0

mode optional "poke" or "reschedule". Default: "poke"

soft_fail optional SKIPPED instead of FAILED on timeout. Default: false

exponential_backoff optional Exponentially increasing poke_interval. Default: false

max_wait optional Max backoff seconds, > 0. Default: None

silent_fail optional Log exceptions as warnings. Default: false

9. Dependencies

optional — defaults to [] (all tasks execute independently in parallel)

Task execution order using >> (downstream) and << (upstream) operators. References both task_id values and group_id values.

yaml

dependencies:
  - "task_a >> task_b >> task_c"            # sequential chain
  - "[task_x, task_y] >> task_z"            # fan-in (z waits for both x and y)
  - "task_z >> [task_a, task_b]"            # fan-out (both a and b run after z)
  - "task_c << [task_a, task_b]"            # fan-in (reverse notation)
  - "staging_group >> transform_group"      # task group references

Validation: All names in dependency strings are validated against declared task_id and group_id values at load time. An unknown name raises a ValueError with the list of valid names.

← Previous

CLI Reference

Operators & Sensors