YAML Spec Format

Every section and field in the DagSmith YAML specification, explained.

Section Order

DagSmith YAML specs follow a conventional section order. All sections except metadata, dag, and gcp are optional.

yaml
variables:          # 1. Optional - key-value pairs for ${VAR} substitution
configurations:     # 2. Optional - reusable config values
metadata:           # 3. Required - documentation metadata
dag:                # 4. Required - DAG constructor arguments
gcp:                # 5. Required - GCP connection defaults
default_args:       # 6. Optional - Airflow default_args
user_defined_macros: # 7. Optional - Jinja macros
tasks:              # 8. Optional - operator/sensor/group specs
dependencies:       # 9. Optional - task execution order

1. Variables

optional

Key-value pairs for ${VAR_NAME} substitution throughout the entire YAML. Expansion happens before Pydantic validation, so variables work in every section.

Naming rules (strictly enforced)

RuleExample
Must be ALL UPPERCASEVAR__PROJECT_ID__VAR
Must begin with VAR__VAR__DATASET__VAR
Must end with __VARVAR__ENV__VAR
yaml
variables:
  VAR__PROJECT_ID__VAR: "my-gcp-project-001"
  VAR__DATASET__VAR: "warehouse_tables"
  VAR__ENV__VAR: "prod"
  VAR__BUNDLE__VAR: "daily_load"

# Usage anywhere in the spec:
gcp:
  project_id: "${VAR__PROJECT_ID__VAR}"   # expands to "my-gcp-project-001"

configurations:
  base_path: "/home/airflow/gcs/dags/${VAR__PROJECT_ID__VAR}/"

When to use variables: Same spec deployed across environments (dev/staging/prod), repeated values like project_id in multiple tasks, paths with embedded identifiers.

When to skip: Single-environment DAGs with no repeated values, quick prototyping where indirection adds noise.

How expansion works

  1. DagSmith parses the variables section from raw YAML
  2. All ${VAR__...__VAR} references are replaced with their values as plain strings
  3. The expanded YAML is then parsed and validated by Pydantic
  4. Invalid variable names are rejected with a clear error before any expansion happens

2. Configurations

optional — default: base_path="/home/airflow/gcs/dags"

Reusable typed configuration values. Unlike variables (which are substituted as strings), configurations are preserved as typed values. Supports arbitrary additional keys.

yaml
configurations:
  base_path: "/home/airflow/gcs/dags/${VAR__PROJECT_ID__VAR}/"
  # custom_key: "any additional config"  # extra keys allowed

3. Metadata

required — all fields required

Documentation metadata rendered into the generated file's docstring header.

title required Descriptive DAG title
owner required Team or individual owner
email required Contact email
version required Semantic version string
jira required JIRA ticket reference
developer_name required Developer or DAG identifier
yaml
metadata:
  title: "Daily Account Activity Load"
  owner: "data-team@example.com"
  email: "data-team@example.com"
  version: "1.0.0"
  jira: "PROJ-1234"
  developer_name: "daily_load"

4. DAG

required — maps to airflow.DAG() constructor

dag_id required Unique DAG identifier
description optional Default: ""
schedule optional Cron expression or preset (@daily, @hourly, etc). Default: None (manual only). alias: schedule_interval
start_date optional YYYY-MM-DD or YYYY-MM-DD HH:MM:SS. Default: now
timezone optional IANA timezone string. Default: "UTC"
catchup optional Backfill past runs. Default: false
max_active_runs optional Must be ≥ 1. Default: 1
dagrun_timeout required Maximum seconds per DAG run (must be > 0)
is_paused_upon_creation optional Default: true
tags optional List of organizational tags. Default: []
params optional DAG-level Airflow Params for Trigger UI. Default: None
template_searchpath optional Jinja template search paths. Default: []
sla_miss_callback optional Dotted path to callable. Default: None
doc_md optional DAG documentation markdown. Default: None
yaml
dag:
  dag_id: "daily_account_load"
  description: "Load daily account activity into BigQuery."
  schedule: "0 6 * * *"                  # 6 AM daily
  start_date: "2026-01-02 12:13:14"
  timezone: "America/New_York"
  catchup: false
  max_active_runs: 1
  dagrun_timeout: 7200
  is_paused_upon_creation: true
  tags:
    - "warehouse:bigquery"
    - "module:daily_load"

  # DAG-level params (optional)
  params:
    env:
      type: "string"
      default: "PROD"
      enum: ["PROD", "PLE", "DEV"]
      title: "Environment"
      description: "Target environment"

5. GCP

required — GCP connection defaults shared across tasks

project_id optional GCP project ID. Default: None
location optional GCP location (e.g. "US", "us-east4"). Default: None
gcp_conn_id optional Airflow connection ID. Default: "google_cloud_default". alias: google_cloud_conn_id
impersonation_chain optional Service account email or chained list. Default: None
deferrable optional Enable async mode for sensors. Default: false
yaml
gcp:
  project_id: "my-gcp-project-001"
  location: "US"
  gcp_conn_id: "google_cloud_default"
  # impersonation_chain: "sa@project.iam.gserviceaccount.com"
  # deferrable: false

6. Default Args

optional — applied to every task in the DAG

owner Default: "airflow"
depends_on_past Default: false
retries Must be ≥ 0. Default: 3
retry_delay Seconds, must be ≥ 30. Default: 60. alias: retry_delay_seconds
sla SLA in seconds, must be > 0. Default: None. alias: sla_seconds
deferrable Default: true
email List of alert recipients. Also accepts comma-separated string. Default: []
email_on_failure Default: false
email_on_retry Default: false
on_failure_callback Dotted path to callable. Default: None
on_success_callback Dotted path to callable. Default: None
on_retry_callback Dotted path to callable. Default: None
yaml
default_args:
  owner: "airflow"
  depends_on_past: false
  retries: 3
  retry_delay: 60
  deferrable: true
  email:
    - "data-team@example.com"
  email_on_failure: true
  email_on_retry: false
  # on_failure_callback: "mypackage.callbacks.on_failure"
  # on_success_callback: "mypackage.callbacks.on_success"
  # on_retry_callback: "mypackage.callbacks.on_retry"

7. User Defined Macros

optional — Jinja macros injected into the DAG constructor

All keys become macro names available as {{ macro_name }} in Jinja templates. Values must be scalars (str, int, float, bool, None).

yaml
user_defined_macros:
  project_name: "my-gcp-project-001"     # {{ project_name }} in templates
  fact_dataset: "warehouse_tables"        # {{ fact_dataset }}
  bundle: "daily_load"
  env: "prod"
  datastore: "BQ"
  latency: 1

8. Tasks

optional — defaults to []

A list of operator, sensor, and TaskGroup specs. Each task uses the operator field as a discriminator to select the correct Pydantic model. See the Operators & Sensors page for details on each type.

Common base fields (all task types)

task_id required Unique task identifier
operator required Operator/sensor class name (e.g. BigQueryInsertJobOperator)
trigger_rule optional Default: "all_success". Options: all_success, all_failed, all_done, one_success, one_failed, none_failed, etc.
retries optional Task-level override of default_args.retries
retry_delay optional Task-level override. alias: retry_delay_seconds
doc_md optional Task documentation markdown
params optional Dict of task-level Jinja params
kwargs optional Additional operator kwargs dict
on_failure_callback optional Dotted path to callable. Overrides DAG-level callback.
on_success_callback optional Dotted path to callable
on_retry_callback optional Dotted path to callable

Sensor base fields (sensors only)

All sensors inherit these additional fields from BaseSensorOperatorSpec:

poke_interval optional Seconds between pokes, > 0. Default: 60.0
timeout required Maximum seconds sensor can run, > 0
mode optional "poke" or "reschedule". Default: "poke"
soft_fail optional SKIPPED instead of FAILED on timeout. Default: false
exponential_backoff optional Exponentially increasing poke_interval. Default: false
max_wait optional Max backoff seconds, > 0. Default: None
silent_fail optional Log exceptions as warnings. Default: false

9. Dependencies

optional — defaults to [] (all tasks execute independently in parallel)

Task execution order using >> (downstream) and << (upstream) operators. References both task_id values and group_id values.

yaml
dependencies:
  - "task_a >> task_b >> task_c"            # sequential chain
  - "[task_x, task_y] >> task_z"            # fan-in (z waits for both x and y)
  - "task_z >> [task_a, task_b]"            # fan-out (both a and b run after z)
  - "task_c << [task_a, task_b]"            # fan-in (reverse notation)
  - "staging_group >> transform_group"      # task group references

Validation: All names in dependency strings are validated against declared task_id and group_id values at load time. An unknown name raises a ValueError with the list of valid names.