declarative expectation definition with fluent api
Enables data teams to define data quality rules declaratively using a fluent Python API that chains expectation methods (e.g., expect_column_values_to_be_in_set, expect_table_row_count_to_be_between). Expectations are serialized as JSON and stored in ExpectationSuite objects, allowing version control and reuse across validation runs. The system supports 50+ built-in expectation types covering schema, distribution, and custom metrics.
Unique: Uses a composable ExpectationSuite system where expectations are first-class JSON objects with metric providers, enabling expectations to be version-controlled, shared across teams, and executed against multiple execution engines (Pandas, SQL, Spark) without code changes
vs alternatives: More expressive and reusable than dbt tests (which are SQL-only) because it supports multiple data sources and provides a unified expectation language across engines; more maintainable than custom validation scripts because expectations are declarative and self-documenting
multi-engine validation execution with metric providers
Executes expectations against data using pluggable execution engines (Pandas, SQL, Spark, Databricks) by translating expectation definitions into engine-specific queries through a Metric Provider system. Each expectation maps to metrics (e.g., column_values, table_row_count) that are computed differently per engine — SQL expectations compile to WHERE clauses, Pandas uses vectorized operations, Spark uses DataFrame API. The Validator class orchestrates metric computation and result aggregation.
Unique: Implements a Metric Provider abstraction layer that decouples expectation definitions from execution engines, allowing the same ExpectationSuite to execute against Pandas, SQL, Spark, and Databricks without modification by translating metrics to engine-native operations
vs alternatives: More scalable than Pandera (Pandas-only) for large datasets because it pushes computation to the database; more flexible than dbt tests because it supports non-SQL data sources and provides a unified validation language across engines
gx cloud integration with centralized validation management
Provides cloud-hosted validation management through GX Cloud, which centralizes expectations, validation runs, and data quality insights across teams. GX Cloud agents run validation checkpoints on schedule and report results to the cloud backend, enabling web-based dashboards, team collaboration, and audit trails. The cloud platform supports role-based access control, validation scheduling, and integration with data sources (Snowflake, Redshift, Databricks) without requiring local infrastructure.
Unique: Provides a cloud-hosted SaaS platform that centralizes validation management, expectations, and results with web-based dashboards and team collaboration features, eliminating the need for teams to manage local GX infrastructure
vs alternatives: More managed than open-source GX Core because it eliminates infrastructure overhead; more collaborative than local deployments because it provides web-based dashboards and team access control
custom metric provider system for domain-specific validation
Enables teams to define custom metrics by subclassing MetricProvider and implementing compute methods for each execution engine (Pandas, SQL, Spark). Custom metrics are registered with the MetricProvider registry and can be used in expectations without modifying core GX code. The system supports metric parameters (e.g., 'column_name', 'threshold') and caching of metric results to avoid redundant computation.
Unique: Implements a MetricProvider registry system that allows custom metrics to be defined once and executed across multiple engines (Pandas, SQL, Spark) by implementing engine-specific compute methods, enabling domain-specific validation without modifying core GX code
vs alternatives: More extensible than fixed expectation sets because custom metrics can implement arbitrary validation logic; more maintainable than custom validation scripts because metrics are registered and reusable across expectations
automated data profiling with rule-based profiler
Generates ExpectationSuites automatically by analyzing data distributions using the Rule-Based Profiler, which applies heuristic rules to infer expectations (e.g., 'if a column has <10 unique values, expect values to be in set'). The profiler computes statistical metrics (cardinality, nullness, data types, value ranges) and applies configurable rules to suggest expectations. Results are stored as ExpectationSuites that can be reviewed, edited, and deployed without manual definition.
Unique: Uses a Rule-Based Profiler that applies domain-specific heuristics (e.g., 'if cardinality < 10, expect values in set') to infer expectations from data samples, enabling one-click expectation generation without manual definition or ML model training
vs alternatives: More interpretable than ML-based anomaly detection (e.g., Evidently) because rules are explicit and auditable; faster than manual expectation writing because it analyzes data distributions automatically; more practical than schema inference tools because it generates executable validation rules, not just schema definitions
checkpoint-based validation orchestration with scheduling
Organizes validation runs into Checkpoints, which bundle a set of ExpectationSuites, data assets, and validation actions (e.g., send alert, update metadata) into a single executable unit. Checkpoints can be scheduled via Airflow, Prefect, or cron, and support conditional actions based on validation results (e.g., 'if validation fails, trigger PagerDuty alert'). The Checkpoint system stores validation history and provides a unified interface for monitoring data quality across pipelines.
Unique: Implements a Checkpoint abstraction that decouples validation logic from orchestration, allowing the same checkpoint to be triggered by Airflow, Prefect, or manual API calls while maintaining consistent action execution and result tracking
vs alternatives: More orchestration-agnostic than dbt tests (which are tightly coupled to dbt) because checkpoints work with any scheduler; more comprehensive than simple data quality monitors because they include action execution and result history tracking
data context system with pluggable store backends
Provides a DataContext abstraction that manages configuration, expectations, validation results, and metadata through pluggable store backends (FileSystemStore, S3Store, DatabaseStore, GCSStore). The context system supports both file-based (YAML config) and cloud-based (GX Cloud) deployments, with stores handling persistence of expectations, validation results, and data docs. Stores are backend-agnostic, allowing teams to swap storage without changing application code.
Unique: Implements a pluggable Store system that abstracts persistence, allowing expectations and validation results to be stored in FileSystem, S3, GCS, or databases without changing application code, enabling seamless migration between storage backends
vs alternatives: More flexible than dbt's artifact storage (which is file-only) because it supports multiple backends; more scalable than local file storage because it enables cloud-native deployments with centralized metadata management
automated data docs generation with customizable renderers
Generates HTML documentation of expectations, validation results, and data quality metrics using a Site Builder that composes Page Renderers for different content types (ExpectationSuite pages, validation result pages, data asset pages). Renderers transform ExpectationSuite and ValidationResult objects into HTML using Jinja2 templates, with support for custom CSS and JavaScript. Data Docs are published to FileSystem, S3, or GCS and can be embedded in data catalogs or served as standalone sites.
Unique: Uses a composable Site Builder and Page Renderer system that transforms ExpectationSuite and ValidationResult objects into static HTML documentation with customizable Jinja2 templates, enabling auto-generated data quality documentation that stays in sync with validation logic
vs alternatives: More automated than manual documentation because it generates docs from expectations and validation results; more customizable than fixed-format reports because renderers are template-based and extensible
+4 more capabilities