Introduction to ETL
What is ETL?
ETL (Extract, Transform, Load) is a powerful data processing module that enables you to build sophisticated data transformation pipelines. Unlike simple scripts or one-off database queries, the ETL module provides a visual, composable, and reusable framework for creating complex data workflows.
The Ambience ETL module allows you to: - Extract data from multiple sources (databases, files, APIs, etc.) - Transform and enrich data through sequential processing steps - Load results into databases, files, or other systems - Automate recurring data operations - Integrate with other Ambience modules (Scheduler, Workflows, Datasets, Dashboards)
Why Use ETL?
Reusable Transformation Logic
ETL chains are designed once and can be: - Executed on demand through the ETL Designer interface - Scheduled to run automatically via the Scheduler module - Called from workflows as part of business process automation - Used by datasets and dashboards for real-time data retrieval - Imported by other chainsets to share common logic
Visual Design & Maintenance
- Build complex pipelines by composing individual steps
- See the complete transformation flow at a glance
- Test chains interactively with immediate feedback
- Debug with step-by-step execution and logging
- Document chains with descriptions and test inputs
Streaming Architecture
The ETL module uses a reactive streaming model: - Process large datasets efficiently without loading everything into memory - Handle backpressure automatically - Support both batch and real-time processing - Scale to millions of records
Integration & Orchestration
ETL chains excel at integrating disparate systems: - Connect databases, APIs, files, and cloud services - Orchestrate multi-step data workflows - Handle errors and retries gracefully - Coordinate with other Ambience modules
When to Use ETL vs Other Approaches
| Use ETL When | Use Other Approaches When |
|---|---|
| Multi-step data transformation required | Simple CRUD operations suffice |
| Reusable logic needed across contexts | One-off data manipulation |
| Integration between multiple systems | Single system operation |
| Scheduled or automated processing | Manual, ad-hoc queries |
| Complex business rules and validation | Simple filtering or aggregation |
| Need to test and debug transformations | Straightforward database operations |
ETL + Workflows: Use workflows for state-dependent business processes (approvals, lifecycle management) and ETL for data transformation within those processes.
ETL + Scheduler: Use the scheduler to trigger ETL chains at specific times or intervals for automated data operations.
ETL + Datasets/Dashboards: Use ETL chains to provide on-demand data retrieval and transformation for interactive visualizations.
Common Use Cases
Data Migration & Synchronization
- Migrate data from legacy systems to modern databases
- Synchronize data between multiple systems
- Transform data formats during migration
- Validate and cleanse data during transfer
- Handle incremental updates and deltas
Report Generation & Distribution
- Extract data from operational databases
- Aggregate and calculate metrics
- Generate Excel, PDF, or other report formats
- Email reports to recipients
- Schedule daily, weekly, or monthly reports
Data Validation & Cleansing
- Validate incoming data against business rules
- Detect and flag data quality issues
- Standardize formats (dates, phone numbers, addresses)
- Remove duplicates and inconsistencies
- Enrich data with reference information
API Integration & Orchestration
- Call external REST APIs
- Transform API responses to internal formats
- Aggregate data from multiple APIs
- Handle authentication and rate limiting
- Retry failed requests with backoff
Log Processing & Monitoring
- Parse log files from various sources
- Extract metrics and error patterns
- Aggregate statistics
- Alert on anomalies or thresholds
- Archive processed logs
Scheduled Maintenance & Cleanup
- Archive old records
- Delete expired data
- Rebuild indexes
- Generate backup snapshots
- Clean up temporary files
Real-time Data Enrichment
- Receive data from message queues or webhooks
- Enrich with reference data from databases
- Apply business rules and transformations
- Route to appropriate destinations
- Trigger downstream processes
Multi-source Data Aggregation
- Extract from multiple databases
- Join and merge related data
- Resolve conflicts and duplicates
- Calculate derived metrics
- Load to data warehouse or analytics platform
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ ETL Chainset │
│ (Container for related chains) │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Chain: "ProcessOrders" │ │
│ │ (Sequence of transformation steps) │ │
│ │ │ │
│ │ Step 1: MongoDB Reader ──→ [Records Stream] │ │
│ │ Step 2: Filter Valid ──→ [Filtered] │ │
│ │ Step 3: Enrich Data ──→ [Enriched] │ │
│ │ Step 4: Calculate Total ──→ [Calculated] │ │
│ │ Step 5: MongoDB Writer ──→ [Stored] │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Chain: "SendNotification" │ │
│ │ (Called by other chains) │ │
│ │ │ │
│ │ Step 1: Compose Email ──→ [Email Doc] │ │
│ │ Step 2: Send Mail ──→ [Sent] │ │
│ └────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Key Components
Chainset: A container that holds related chains. Chainsets can import other chainsets to reuse common chains.
Chain: A sequence of steps that transform data. Chains can call other chains, enabling composition and reusability.
Step: An individual transformation operation (e.g., read from database, filter records, calculate values, write to file). A comprehensive suite of steps is available across a wide range of categories.
Record: The unit of data flowing through a chain. Records are typically JSON documents that get transformed as they pass through steps.
ETL Step Categories
ETL provides a comprehensive suite of steps organised into categories covering every common data transformation need:
- Array - Array manipulation and operations
- String - Text processing and formatting
- MongoDB - Database operations
- Structure - Control flow and chain composition
- Number - Mathematical operations and calculations
- Date/Time - Date and timestamp operations
- File - File system operations
- JSON - JSON parsing and manipulation
- Mail - Email operations
- REST - HTTP API calls
- Validation - Data validation
- And many more categories…
Finding the right step is easy: the Dictionary (accessible from the ETL Designer toolbar) lists every available step with its description and parameters. The Add Step dialog provides a search bar — type a keyword to filter steps instantly. Each step also has built-in help accessible directly from its edit dialog.
Next Steps
- Quick Start Guide - Create your first ETL chain in 10 minutes
- Core Concepts - Understand chainsets, chains, steps, and records
- ETL Management - Manage chainsets and access control
- ETL Designer - Design and test ETL chains
- Examples - Complete real-world examples