Introduction to ETL

What is ETL?

ETL (Extract, Transform, Load) is a powerful data processing module that enables you to build sophisticated data transformation pipelines. Unlike simple scripts or one-off database queries, the ETL module provides a visual, composable, and reusable framework for creating complex data workflows.

The Ambience ETL module allows you to: - Extract data from multiple sources (databases, files, APIs, etc.) - Transform and enrich data through sequential processing steps - Load results into databases, files, or other systems - Automate recurring data operations - Integrate with other Ambience modules (Scheduler, Workflows, Datasets, Dashboards)

Why Use ETL?

Reusable Transformation Logic

ETL chains are designed once and can be: - Executed on demand through the ETL Designer interface - Scheduled to run automatically via the Scheduler module - Called from workflows as part of business process automation - Used by datasets and dashboards for real-time data retrieval - Imported by other chainsets to share common logic

Visual Design & Maintenance

  • Build complex pipelines by composing individual steps
  • See the complete transformation flow at a glance
  • Test chains interactively with immediate feedback
  • Debug with step-by-step execution and logging
  • Document chains with descriptions and test inputs

Streaming Architecture

The ETL module uses a reactive streaming model: - Process large datasets efficiently without loading everything into memory - Handle backpressure automatically - Support both batch and real-time processing - Scale to millions of records

Integration & Orchestration

ETL chains excel at integrating disparate systems: - Connect databases, APIs, files, and cloud services - Orchestrate multi-step data workflows - Handle errors and retries gracefully - Coordinate with other Ambience modules

When to Use ETL vs Other Approaches

Use ETL When Use Other Approaches When
Multi-step data transformation required Simple CRUD operations suffice
Reusable logic needed across contexts One-off data manipulation
Integration between multiple systems Single system operation
Scheduled or automated processing Manual, ad-hoc queries
Complex business rules and validation Simple filtering or aggregation
Need to test and debug transformations Straightforward database operations

ETL + Workflows: Use workflows for state-dependent business processes (approvals, lifecycle management) and ETL for data transformation within those processes.

ETL + Scheduler: Use the scheduler to trigger ETL chains at specific times or intervals for automated data operations.

ETL + Datasets/Dashboards: Use ETL chains to provide on-demand data retrieval and transformation for interactive visualizations.

Common Use Cases

Data Migration & Synchronization

  • Migrate data from legacy systems to modern databases
  • Synchronize data between multiple systems
  • Transform data formats during migration
  • Validate and cleanse data during transfer
  • Handle incremental updates and deltas

Report Generation & Distribution

  • Extract data from operational databases
  • Aggregate and calculate metrics
  • Generate Excel, PDF, or other report formats
  • Email reports to recipients
  • Schedule daily, weekly, or monthly reports

Data Validation & Cleansing

  • Validate incoming data against business rules
  • Detect and flag data quality issues
  • Standardize formats (dates, phone numbers, addresses)
  • Remove duplicates and inconsistencies
  • Enrich data with reference information

API Integration & Orchestration

  • Call external REST APIs
  • Transform API responses to internal formats
  • Aggregate data from multiple APIs
  • Handle authentication and rate limiting
  • Retry failed requests with backoff

Log Processing & Monitoring

  • Parse log files from various sources
  • Extract metrics and error patterns
  • Aggregate statistics
  • Alert on anomalies or thresholds
  • Archive processed logs

Scheduled Maintenance & Cleanup

  • Archive old records
  • Delete expired data
  • Rebuild indexes
  • Generate backup snapshots
  • Clean up temporary files

Real-time Data Enrichment

  • Receive data from message queues or webhooks
  • Enrich with reference data from databases
  • Apply business rules and transformations
  • Route to appropriate destinations
  • Trigger downstream processes

Multi-source Data Aggregation

  • Extract from multiple databases
  • Join and merge related data
  • Resolve conflicts and duplicates
  • Calculate derived metrics
  • Load to data warehouse or analytics platform

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                      ETL Chainset                       │
│  (Container for related chains)                         │
│                                                          │
│  ┌────────────────────────────────────────────────┐    │
│  │              Chain: "ProcessOrders"             │    │
│  │  (Sequence of transformation steps)             │    │
│  │                                                  │    │
│  │  Step 1: MongoDB Reader  ──→  [Records Stream]  │    │
│  │  Step 2: Filter Valid    ──→  [Filtered]        │    │
│  │  Step 3: Enrich Data     ──→  [Enriched]        │    │
│  │  Step 4: Calculate Total ──→  [Calculated]      │    │
│  │  Step 5: MongoDB Writer  ──→  [Stored]          │    │
│  └────────────────────────────────────────────────┘    │
│                                                          │
│  ┌────────────────────────────────────────────────┐    │
│  │         Chain: "SendNotification"               │    │
│  │  (Called by other chains)                       │    │
│  │                                                  │    │
│  │  Step 1: Compose Email   ──→  [Email Doc]       │    │
│  │  Step 2: Send Mail       ──→  [Sent]            │    │
│  └────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘

Key Components

Chainset: A container that holds related chains. Chainsets can import other chainsets to reuse common chains.

Chain: A sequence of steps that transform data. Chains can call other chains, enabling composition and reusability.

Step: An individual transformation operation (e.g., read from database, filter records, calculate values, write to file). A comprehensive suite of steps is available across a wide range of categories.

Record: The unit of data flowing through a chain. Records are typically JSON documents that get transformed as they pass through steps.

ETL Step Categories

ETL provides a comprehensive suite of steps organised into categories covering every common data transformation need:

  • Array - Array manipulation and operations
  • String - Text processing and formatting
  • MongoDB - Database operations
  • Structure - Control flow and chain composition
  • Number - Mathematical operations and calculations
  • Date/Time - Date and timestamp operations
  • File - File system operations
  • JSON - JSON parsing and manipulation
  • Mail - Email operations
  • REST - HTTP API calls
  • Validation - Data validation
  • And many more categories…

Finding the right step is easy: the Dictionary (accessible from the ETL Designer toolbar) lists every available step with its description and parameters. The Add Step dialog provides a search bar — type a keyword to filter steps instantly. Each step also has built-in help accessible directly from its edit dialog.

Next Steps