python dag definition and compilation
Enables users to define workflows as Python code (DAGs) that are parsed, validated, and compiled into an internal task graph representation. The system uses dynamic Python execution to instantiate DAG objects from .py files in the DAG folder, extracting task dependencies through operator instantiation and bitshift operators (>> and <<). DAG serialization converts the graph into JSON for storage in the metadata database, enabling stateless scheduler restarts and multi-scheduler deployments.
Unique: Uses Python as the DSL itself rather than a separate configuration language, enabling full programmatic control with loops, conditionals, and function composition. DAG serialization to JSON (not pickle) enables scheduler statelessness and multi-version deployments. Dynamic task mapping via expand() allows single task definitions to generate hundreds of parallel instances based on runtime data.
vs alternatives: More flexible than YAML-based orchestrators (Prefect, Dagster) for complex logic, but requires more operational discipline around code review and testing compared to declarative alternatives.
distributed task execution with pluggable executors
Executes tasks across distributed workers using a pluggable executor architecture that abstracts the underlying compute infrastructure. The system supports LocalExecutor (single machine), CeleryExecutor (distributed via message broker), KubernetesExecutor (pod-per-task), and custom executors. Tasks are queued with metadata, workers poll for assignments, and execution results are reported back via XCom (cross-communication) to the metadata database. The Supervisor process manages task lifecycle on each worker, spawning task runner subprocesses and capturing logs.
Unique: Pluggable executor architecture decouples task scheduling from execution infrastructure, allowing same DAG code to run on laptop (LocalExecutor), Celery cluster, or Kubernetes without modification. Supervisor process on workers manages task lifecycle with subprocess isolation, enabling graceful shutdown and resource cleanup. XCom system provides lightweight inter-task communication via database, avoiding need for external message passing for small payloads.
vs alternatives: More flexible executor abstraction than Prefect (which is cloud-first) or Dagster (which couples execution to deployment), but requires more operational overhead than managed services like AWS Step Functions or Google Cloud Workflows.
monitoring, alerting, and sla enforcement
Provides built-in monitoring and alerting for DAG runs and task instances. SLA (Service Level Agreement) definitions on DAGs and tasks trigger alerts when execution exceeds time thresholds. The system integrates with external alerting systems (email, Slack, PagerDuty) via callback functions. Metrics are exposed in Prometheus format for integration with monitoring stacks. Deadline-based scheduling allows enforcing hard deadlines with automatic alerting. Task retry logic with exponential backoff provides automatic recovery from transient failures.
Unique: Built-in SLA and deadline enforcement with pluggable alerting backends, avoiding need for external monitoring tools for basic alerting. Prometheus metrics integration enables integration with existing monitoring stacks. Deadline-based scheduling allows enforcing hard time constraints with automatic alerting.
vs alternatives: More integrated monitoring than Prefect (which requires external tools) or Dagster (which has limited built-in alerting). Comparable to managed services (AWS Step Functions, Google Cloud Workflows) but with more customization options.
dag versioning and multi-version deployments
Enables running multiple versions of the same DAG simultaneously, allowing zero-downtime DAG updates. When a DAG definition changes, Airflow creates a new version while keeping the old version active for in-flight runs. The system tracks DAG version in the database, allowing queries to return results for specific versions. This enables gradual rollout of DAG changes: new runs use the new version while old runs continue with the old version. Version cleanup policies prevent unbounded growth of old versions.
Unique: Automatic DAG versioning on code changes enables zero-downtime updates without manual version management. In-flight runs continue with their original version while new runs use the new version. Version history provides audit trail of DAG modifications.
vs alternatives: More sophisticated than simple code replacement (which interrupts in-flight runs) but less flexible than manual version management. Comparable to Prefect's deployment versioning but with automatic version creation.
plugin system for custom operators, hooks, and executors
Extensibility mechanism allowing developers to create custom operators, hooks, executors, and other Airflow components without modifying core code. Plugins are discovered via entry points or by placing Python files in the plugins directory. The system provides base classes (BaseOperator, BaseHook, BaseExecutor) that plugins extend. Custom plugins are automatically registered and available in DAG definitions. This enables organizations to build proprietary operators for internal systems.
Unique: Entry point-based plugin discovery enables dynamic registration without modifying core code. Base classes provide clear extension points for operators, hooks, and executors. Plugins are automatically available in DAG definitions without explicit imports.
vs alternatives: More flexible than provider packages (which are published to PyPI) for internal-only extensions. Comparable to Prefect's custom tasks but with more mature plugin infrastructure.
sla monitoring and deadline-based alerts
Enables defining Service Level Agreements (SLAs) for tasks and DAGs, with automatic monitoring and alerting when SLAs are breached. SLAs are defined as timedelta values (e.g., task must complete within 1 hour of execution_date). The scheduler evaluates SLAs at each heartbeat and triggers alert callbacks when deadlines are missed. Supports custom alert handlers (email, Slack, webhooks) via callback functions.
Unique: Implements SLA monitoring at the scheduler level, enabling automatic deadline tracking without external monitoring tools. Supports custom alert callbacks, allowing teams to integrate SLA alerts with existing notification systems.
vs alternatives: More integrated than external SLA tools because SLAs are defined in DAG code and monitored by the scheduler; more flexible than cloud-native SLA services because alert logic is custom Python code.
database-backed state management and recovery
Uses a relational database (PostgreSQL, MySQL, SQLite) to persist all Airflow state: DAG definitions, task instances, execution history, connections, and variables. The database schema includes tables for dag, dag_run, task_instance, xcom, log, and connection. State is serialized to JSON for complex objects (DAG definitions, task parameters). The scheduler can recover from crashes by querying the database for incomplete tasks and resuming execution.
Unique: Uses a relational database as the single source of truth for all Airflow state, enabling stateless scheduler restarts and multi-scheduler deployments. Serializes complex objects (DAG definitions, task parameters) to JSON, enabling schema-less storage of dynamic data.
vs alternatives: More reliable than in-memory state because state is persisted across restarts; more scalable than file-based state because database queries are optimized for large datasets.
scheduler-driven dag run instantiation and task queuing
The SchedulerJobRunner process continuously parses DAG files, evaluates scheduling rules (cron expressions, asset dependencies, deadlines), and instantiates DagRun objects when conditions are met. For each DagRun, the scheduler traverses the task dependency graph, evaluates task-level scheduling rules, and queues TaskInstance objects to the executor's queue. The scheduler uses a heartbeat-based loop (default 1s) with database-backed state to track which DagRuns and TaskInstances have been processed, enabling recovery after restarts. Asset-based scheduling allows DAGs to trigger when upstream datasets (assets) are updated.
Unique: Decouples scheduling logic from execution via database-backed task queue, enabling multiple independent schedulers and stateless restarts. Supports multiple scheduling modes: time-based (cron), asset-based (data dependencies), and deadline-based (SLA enforcement). DAG file parsing happens in scheduler process, not in workers, centralizing parsing errors and reducing worker overhead.
vs alternatives: More sophisticated scheduling than cron-only systems (Unix cron, simple schedulers), with asset-based triggering comparable to dbt's manifest-based scheduling. Single-threaded scheduler is simpler than Prefect's distributed scheduler but requires careful tuning for large deployments.
+7 more capabilities