Introduction to ETL

What is ETL?

ETL (Extract, Transform, Load) is a powerful data processing module that enables you to build sophisticated data transformation pipelines. Unlike simple scripts or one-off database queries, the ETL module provides a visual, composable, and reusable framework for creating complex data workflows.

The Ambience ETL module allows you to: - Extract data from multiple sources (databases, files, APIs, etc.) - Transform and enrich data through sequential processing steps - Load results into databases, files, or other systems - Automate recurring data operations - Integrate with other Ambience modules (Scheduler, Workflows, Datasets, Dashboards)

Why Use ETL?

Reusable Transformation Logic

ETL chains are designed once and can be: - Executed on demand through the ETL Designer interface - Scheduled to run automatically via the Scheduler module - Called from workflows as part of business process automation - Used by datasets and dashboards for real-time data retrieval - Imported by other chainsets to share common logic

Visual Design & Maintenance

Build complex pipelines by composing individual steps
See the complete transformation flow at a glance
Test chains interactively with immediate feedback
Debug with step-by-step execution and logging
Document chains with descriptions and test inputs

Streaming Architecture

The ETL module uses a reactive streaming model: - Process large datasets efficiently without loading everything into memory - Handle backpressure automatically - Support both batch and real-time processing - Scale to millions of records

Integration & Orchestration

ETL chains excel at integrating disparate systems: - Connect databases, APIs, files, and cloud services - Orchestrate multi-step data workflows - Handle errors and retries gracefully - Coordinate with other Ambience modules

When to Use ETL vs Other Approaches

Use ETL When	Use Other Approaches When
Multi-step data transformation required	Simple CRUD operations suffice
Reusable logic needed across contexts	One-off data manipulation
Integration between multiple systems	Single system operation
Scheduled or automated processing	Manual, ad-hoc queries
Complex business rules and validation	Simple filtering or aggregation
Need to test and debug transformations	Straightforward database operations

ETL + Workflows: Use workflows for state-dependent business processes (approvals, lifecycle management) and ETL for data transformation within those processes.

ETL + Scheduler: Use the scheduler to trigger ETL chains at specific times or intervals for automated data operations.

ETL + Datasets/Dashboards: Use ETL chains to provide on-demand data retrieval and transformation for interactive visualizations.

Common Use Cases

Data Migration & Synchronization

Migrate data from legacy systems to modern databases
Synchronize data between multiple systems
Transform data formats during migration
Validate and cleanse data during transfer
Handle incremental updates and deltas

Report Generation & Distribution

Extract data from operational databases
Aggregate and calculate metrics
Generate Excel, PDF, or other report formats
Email reports to recipients
Schedule daily, weekly, or monthly reports

Data Validation & Cleansing

Validate incoming data against business rules
Detect and flag data quality issues
Standardize formats (dates, phone numbers, addresses)
Remove duplicates and inconsistencies
Enrich data with reference information

API Integration & Orchestration

Call external REST APIs
Transform API responses to internal formats
Aggregate data from multiple APIs
Handle authentication and rate limiting
Retry failed requests with backoff

Log Processing & Monitoring

Parse log files from various sources
Extract metrics and error patterns
Aggregate statistics
Alert on anomalies or thresholds
Archive processed logs

Scheduled Maintenance & Cleanup

Archive old records
Delete expired data
Rebuild indexes
Generate backup snapshots
Clean up temporary files

Real-time Data Enrichment

Receive data from message queues or webhooks
Enrich with reference data from databases
Apply business rules and transformations
Route to appropriate destinations
Trigger downstream processes

Multi-source Data Aggregation

Extract from multiple databases
Join and merge related data
Resolve conflicts and duplicates
Calculate derived metrics
Load to data warehouse or analytics platform

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                      ETL Chainset                       │
│  (Container for related chains)                         │
│                                                          │
│  ┌────────────────────────────────────────────────┐    │
│  │              Chain: "ProcessOrders"             │    │
│  │  (Sequence of transformation steps)             │    │
│  │                                                  │    │
│  │  Step 1: MongoDB Reader  ──→  [Records Stream]  │    │
│  │  Step 2: Filter Valid    ──→  [Filtered]        │    │
│  │  Step 3: Enrich Data     ──→  [Enriched]        │    │
│  │  Step 4: Calculate Total ──→  [Calculated]      │    │
│  │  Step 5: MongoDB Writer  ──→  [Stored]          │    │
│  └────────────────────────────────────────────────┘    │
│                                                          │
│  ┌────────────────────────────────────────────────┐    │
│  │         Chain: "SendNotification"               │    │
│  │  (Called by other chains)                       │    │
│  │                                                  │    │
│  │  Step 1: Compose Email   ──→  [Email Doc]       │    │
│  │  Step 2: Send Mail       ──→  [Sent]            │    │
│  └────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘

Key Components

Chainset: A container that holds related chains. Chainsets can import other chainsets to reuse common chains.

Chain: A sequence of steps that transform data. Chains can call other chains, enabling composition and reusability.

Step: An individual transformation operation (e.g., read from database, filter records, calculate values, write to file). A comprehensive suite of steps is available across a wide range of categories.

Record: The unit of data flowing through a chain. Records are typically JSON documents that get transformed as they pass through steps.

ETL Step Categories

ETL provides a comprehensive suite of steps organised into categories covering every common data transformation need:

Array - Array manipulation and operations
String - Text processing and formatting
MongoDB - Database operations
Structure - Control flow and chain composition
Number - Mathematical operations and calculations
Date/Time - Date and timestamp operations
File - File system operations
JSON - JSON parsing and manipulation
Mail - Email operations
REST - HTTP API calls
Validation - Data validation
And many more categories…

Finding the right step is easy: the Dictionary (accessible from the ETL Designer toolbar) lists every available step with its description and parameters. The Add Step dialog provides a search bar — type a keyword to filter steps instantly. Each step also has built-in help accessible directly from its edit dialog.

Next Steps

Quick Start Guide - Create your first ETL chain in 10 minutes
Core Concepts - Understand chainsets, chains, steps, and records
ETL Management - Manage chainsets and access control
ETL Designer - Design and test ETL chains
Examples - Complete real-world examples

Next: Core Concepts