Best Practices
ETL Design Principles
Keep Chains Focused and Single-Purpose
Each chain should have a clear, single purpose. Avoid creating monolithic chains that do too many unrelated things.
Good: - “ImportCustomers” - Imports customer data from external source - “ValidateCustomers” - Validates customer data quality - “EnrichCustomers” - Adds reference data to customers
Avoid: - “ProcessEverything” - Imports, validates, enriches, generates reports, sends emails
Benefits: - Easier to understand and maintain - Easier to test independently - Easier to reuse in different contexts - Easier to debug when issues occur
Use Meaningful Names
Choose descriptive names for chainsets, chains, and steps that clearly indicate their purpose.
Chainset Names: - ✅ “CustomerDataManagement” - ✅ “DailyReportGeneration” - ❌ “ETL1”, “MyChainset”, “Test”
Chain Names: - ✅ “ImportOrdersFromAPI” - ✅ “ValidateEmailAddresses” - ❌ “Chain1”, “Process”, “DoStuff”
Step Names: - ✅ “Filter Active Customers” - ✅ “Calculate Order Total” - ❌ “Filter”, “Step 3”, “Temp”
Break Complex Logic into Reusable Chains
When a chain becomes complex, break it into smaller chains connected with Chain Call steps.
Example:
Main Chain: "ProcessOrder"
1. MongoDB Reader (read order)
2. Chain Call "ValidateOrder"
3. Chain Call "EnrichOrderData"
4. Chain Call "CalculatePricing"
5. Chain Call "ApplyDiscounts"
6. MongoDB Writer (save processed order)
Supporting Chains:
- "ValidateOrder" - Check required fields, validate formats
- "EnrichOrderData" - Add customer info, product details
- "CalculatePricing" - Compute subtotals, taxes
- "ApplyDiscounts" - Apply promotional discounts
Benefits: - Each sub-chain can be tested independently - Sub-chains can be reused in other contexts - Main chain is easier to understand at a glance - Changes to one aspect don’t affect others
Use Typed Functions for Reusable Logic
When a chain is called from multiple places, make it a typed function using structure.FunctionSignature and structure.FunctionCall. This gives callers an explicit contract and avoids relying on implicit field name conventions.
How to create a typed function:
- Add
structure.FunctionSignatureas the first step of the chain.
- Parameters: the input field names callers must supply (one per line, up to 5)
- Returns: the output field names the chain will produce (one per line, up to 5)
- Name the chain with an
Fnprefix:Fn_validateEmail,Fn_lookupEmployee - Callers use
structure.FunctionCall, mapping their local fields toin0…in4and receiving results asout0…out4
Example:
Function chain: "Fn_lookupManager"
Step 1: structure.FunctionSignature
Parameters: employeeId
Returns: managerId, managerEmail
Step 2: mongodb.AggregationDefinition / string.Substitute / mongodb.Reader
Step 3: structure.AddFields
{ "managerId": "${_id}", "managerEmail": "${email}" }
Calling chain:
Step N: structure.FunctionCall
Function Name: Fn_lookupManager
In: employeeId → in0
Out: out0 → managerId, out1 → managerEmail
When to use typed functions vs plain Chain Calls: - Use structure.FunctionCall when the chain is reused from 2+ places, when you need explicit field remapping, or when you want a documented contract. - Use structure.ChainCall for one-off sub-chains where the full record flows through unchanged.
Use Chain Imports for Common Operations
Create a “CommonUtilities” or “SharedFunctions” chainset with frequently used chains, then import it into other chainsets.
Example:
CommonUtilities Chainset:
- "FormatAddress"
- "ValidateEmail"
- "CalculateTax"
- "SendNotificationEmail"
- "LogError"
OrderProcessing Chainset (imports CommonUtilities):
- Chains can call "FormatAddress", "ValidateEmail", etc.
CustomerManagement Chainset (imports CommonUtilities):
- Chains can call the same shared functions
Benefits: - Write once, use everywhere - Update in one place, affects all users - Consistent behavior across chainsets - Reduces duplication
Design for Testability
Always create test inputs for your chains. Include test cases for: - Happy path - Normal, valid data - Edge cases - Empty values, nulls, boundary conditions - Error cases - Invalid data that should be caught - Large datasets - Performance testing with many records
Example Test Inputs:
{
"name": "Valid Order",
"description": "Normal order with all required fields",
"testJSON": { "orderId": "ORD-001", "customerId": "C123", ... }
}
{
"name": "Missing Customer",
"description": "Order without customer ID - should fail validation",
"testJSON": { "orderId": "ORD-002", "items": [...] }
}
{
"name": "Large Order",
"description": "Order with 100 items - performance test",
"testJSON": { "orderId": "ORD-003", "items": [... 100 items ...] }
}
Handle Errors Explicitly
Don’t assume data is always valid or operations always succeed. Design chains to handle errors gracefully.
Validation Pattern:
1. Read data
2. Validate required fields → Filter invalid to error collection
3. Validate formats → Filter invalid to error collection
4. Validate business rules → Filter invalid to error collection
5. Process valid records
6. Write results
Error Logging Pattern:
1. Try operation
2. On error:
- Add error message to record
- Add timestamp
- Write to error collection
- Optionally send alert
Document Complex Chains
Add descriptions to chains explaining: - Purpose - What the chain does - Input - Expected input format - Output - What the chain produces - Dependencies - What it requires (databases, external APIs) - Side effects - What it modifies (databases, files, emails sent)
Use test inputs to document expected input formats with examples.
Performance Optimization
Understand Step Structure Patterns
Step structure patterns affect performance:
1 => N (Expansion): - Creates more records than input - Can significantly increase memory usage - Examples: Array Explode, String Split, MongoDB Reader
N => 1 (Reduction): - Reduces record count - May require buffering records in memory - Examples: Array Collect, Aggregation, Merge
Best Practice: Place N => 1 steps early and 1 => N steps late to minimize records in flight.
Filter Early
Apply filters as early as possible in the chain to reduce the number of records processed by subsequent steps.
Good:
1. MongoDB Reader (reads 10,000 records)
2. Filter Active Only (reduces to 1,000 records)
3. Enrich with Customer Data (processes 1,000)
4. Calculate Metrics (processes 1,000)
Avoid:
1. MongoDB Reader (reads 10,000 records)
2. Enrich with Customer Data (processes 10,000)
3. Calculate Metrics (processes 10,000)
4. Filter Active Only (reduces to 1,000)
Optimize Database Queries
When using MongoDB Reader or similar steps: - Use specific queries to limit results at the source - Create indexes on frequently queried fields - Use projections to retrieve only needed fields - Consider aggregation pipelines for complex queries
Example:
Instead of:
1. MongoDB Reader (read all customers)
2. Filter where status = "active"
Use:
1. MongoDB Reader with query: { status: "active" }
Batch vs Streaming Operations
Understand when to use batch operations vs streaming:
Streaming (default): - Records processed one at a time - Low memory usage - Good for large datasets - Examples: Most transformation steps
Batch: - Records collected and processed together - Higher memory usage - Required for aggregations, sorting - Examples: Aggregation, Sort, Distinct
Monitor Performance
Use debug logging and execution times to identify bottlenecks: - Enable debug logging on suspected slow steps - Check execution time in Results panel - Look for steps that process many records - Consider breaking slow chains into smaller chunks
Error Handling Strategies
Validation Steps
Use the dedicated validation steps to accumulate issues and check them all before deciding how to proceed:
1. Read data
2. validation.MandatoryFields
Fields: requiredField1, requiredField2
(adds an issue for each missing field)
3. filter.Compare: someValue > 0 → field: checkPositive
4. validation.AddValidationIssue
IsValid Field: checkPositive
Message: "someValue must be greater than zero"
5. validation.IsValid
Result Field: isValid
6. structure.ChainIfElse
Boolean Field: isValid
True Chain: ProcessValidRecord
False Chain: HandleValidationErrors
validation.AddValidationIssue appends a message to the validationIssues array only when its boolean field is false — the record flows through unchanged when the check passes. This lets you run all checks before branching, so you can surface all issues at once rather than stopping at the first failure.
Conditional Routing
For branching logic, use structure.ChainIf or structure.ChainIfElse with a boolean field set by a preceding compare step:
Main Chain:
1. Read orders
2. filter.Compare: amount > 1000 → field: isHighValue
3. structure.ChainIfElse
Boolean Field: isHighValue
True Chain: HighValueOrderProcess
False Chain: StandardOrderProcess
Use structure.ChainIf when only one branch needs special handling and the other should pass through unchanged:
2. filter.Compare: status == 'error' → field: hasError
3. structure.ChainIf
Boolean Field: hasError
Call If: true
Chain Name: HandleOrderError
(records without errors continue unchanged)
Error Logging
Create a standard error logging pattern:
Error Logging Chain:
1. structure.AddFields: { "errorTimestamp": (current timestamp), "errorSource": (chain name) }
3. MongoDB Writer: write to "errors" collection
4. Optionally: Chain Call "SendErrorAlert"
Use this chain whenever errors are detected.
Graceful Degradation
Design chains to continue processing even when some operations fail:
1. Read records
2. Try to enrich with external API
- On success: continue with enriched data
- On failure: continue with original data, log warning
3. Process records (works with or without enrichment)
Testing Strategies
Create Comprehensive Test Cases
For each chain, create test inputs covering:
- Happy Path - Valid data, expected flow
- Empty Input - No records
- Single Record - Boundary case
- Multiple Records - Normal volume
- Large Dataset - Performance test
- Invalid Data - Missing fields, wrong types
- Edge Cases - Nulls, empty strings, special characters
Test Incrementally
When building a chain: 1. Add one step 2. Create test input 3. Run and verify output 4. Add next step 5. Run and verify output 6. Repeat
Don’t wait until the entire chain is built to test.
Use Step Disable for Diagnostic Steps
When you add a diagnostic step (e.g. a mongodb.TeeWriter to capture intermediate records), disable it rather than deleting it when you no longer need it to run. A disabled step is skipped during execution but stays in the chain with its configuration intact — ready to be re-enabled at any time.
This is especially useful for: - Keeping a mongodb.TeeWriter attached to a chain for future debugging sessions - Temporarily bypassing a slow or broken step without losing its configuration - Retaining an alternative implementation alongside the active one
To disable a step: select it → “More Actions” → “Disable Step”.
Use Debug Logging
Enable debug logging to see: - Exact input to each step - Step parameters being used - Exact output from each step - Any warnings or errors
This helps verify each step is doing what you expect.
Test Error Paths
Don’t just test the happy path. Verify error handling: - What happens with invalid data? - Are errors logged correctly? - Do error records go to the right place? - Are error messages helpful?
Integration Testing
Test chains in context: - Test with real database connections - Test with actual external APIs - Test with realistic data volumes - Test scheduled execution
Chain Organization
Structure Chainsets by Domain
Organize chainsets around business domains or functional areas:
CustomerDataManagement
- ImportCustomers
- ValidateCustomers
- EnrichCustomers
- ExportCustomers
OrderProcessing
- ImportOrders
- ValidateOrders
- ProcessOrders
- GenerateInvoices
ReportGeneration
- DailySalesReport
- WeeklySummaryReport
- MonthlyAnalytics
Use Groups Effectively
Within a chainset, use groups to organize chains:
By Purpose: - Import - Validation - Processing - Export - Utilities
By Frequency: - Hourly - Daily - Weekly - On-Demand
By Domain: - Customers - Orders - Products - Inventory
Naming Conventions
Establish and follow naming conventions:
Chainsets: - PascalCase: “CustomerDataManagement” - Descriptive: Indicates domain or purpose
Chains: - PascalCase: “ImportCustomersFromAPI” - Verb-Noun format: “ValidateEmailAddresses” - Specific: Indicates what the chain does
Steps: - Descriptive phrases: “Filter Active Customers” - Action-oriented: “Calculate Order Total” - Clear purpose: “Write to Error Collection”
Manage Dependencies
When using chain imports: - Document which chainsets import which - Avoid circular dependencies - Keep imports shallow (avoid deep chains of imports) - Consider creating a dependency diagram for complex setups
Version Control Considerations
When using external version control: - Export chainsets as JSON files - Commit with meaningful messages - Tag releases - Document breaking changes - Consider separate branches for development/production
Security Best Practices
Role-Based Access Control
Use roles to control who can: - View chainsets (read access) - Modify chainsets (ownership) - Execute chains (via scheduler, workflows, datasets)
Workgroup Isolation
Use workgroups to: - Separate chainsets by department or team - Control visibility of sensitive data - Organize chainsets by project or client
Credential Management
Never hardcode credentials in chains: - ❌ Don’t: Hardcode passwords in step parameters - ✅ Do: Use Ambience Secrets module - ✅ Do: Use database connection pools with stored credentials - ✅ Do: Use environment-specific configuration
Sensitive Data Handling
Be careful with sensitive data: - Don’t log sensitive data in error messages - Use appropriate database security - Consider encryption for sensitive fields - Follow data retention policies - Comply with regulations (GDPR, etc.)
Common Patterns
Read-Transform-Write Pattern
The most common ETL pattern:
1. Read from source (MongoDB Reader, File Reader, REST API)
2. Transform data (various transformation steps)
3. Write to destination (MongoDB Writer, File Writer, Send Mail)
Incremental Processing Pattern
Process only new or changed records:
1. Read last processed timestamp from control collection
2. Query source for records since last timestamp
3. Process records
4. Update control collection with new timestamp
Error Handling Pattern
Separate valid and invalid records:
1. Read data
2. Validate
3. Filter valid → Process → Write to main collection
4. Filter invalid → Add error info → Write to error collection
Aggregation Pattern
Summarize data:
1. Read detail records
2. Group by key fields
3. Calculate aggregates (sum, count, average)
4. Write summary records
Enrichment Pattern
Add reference data:
1. Read main records
2. For each record:
- Look up reference data (Chain Call to lookup chain)
- Merge reference data into record
3. Write enriched records
Fan-Out Pattern
Process one record multiple ways:
Main Chain:
1. Read record
2. Chain Call "ProcessTypeA" (if type = A)
3. Chain Call "ProcessTypeB" (if type = B)
4. Chain Call "ProcessTypeC" (if type = C)
Save-Branch-Restore Pattern
Preserve the input record across a 1 => N / N => 1 round-trip using job state:
1. config.WriteState
Field: (empty — saves entire record)
Key: originalInput
2. array.Unwind
Field: items ← 1 => N
3. (process each item)
4. array.Collect
Field: processedItems ← N => 1
5. config.ReadState
Key: originalInput
Field: (empty — merges saved record back)
Use this whenever a 1 => N step would cause original fields to be unavailable after aggregation. See Core Concepts for detail on job state scope and caveats.
Notification Pattern
Alert on specific conditions:
1. Process records
2. Filter records meeting alert criteria
3. Compose notification message
4. Send email/notification
Troubleshooting Tips
Chain Doesn’t Run
- Check all required step parameters are filled in
- Verify database connections are valid
- Check role permissions
- Look for error messages in Results panel
Unexpected Results
- Enable debug logging on steps
- Check step parameters carefully
- Verify field names match your data
- Test with simpler input first
Performance Issues
- Check for N => 1 steps late in chain (move earlier)
- Check for 1 => N steps early in chain (move later)
- Optimize database queries
- Consider breaking into smaller chains
- Check for unnecessary transformations
Memory Issues
- Reduce record expansion (1 => N operations)
- Process in smaller batches
- Avoid collecting large arrays
- Check for memory leaks in custom steps
Debugging Workflow
- Isolate the problem - Which step is causing issues?
- Enable debug logging - See exact inputs/outputs
- Test with minimal data - Use single record test input
- Check parameters - Verify all settings are correct
- Test incrementally - Run to each step, verify output
- Check logs - Look for warnings or errors
- Ask for help - Consult documentation or colleagues
Summary
Following these best practices will help you: - ✅ Design maintainable and reusable ETL chains - ✅ Optimize performance for large datasets - ✅ Handle errors gracefully - ✅ Test thoroughly and debug effectively - ✅ Organize chainsets for team collaboration - ✅ Secure sensitive data appropriately
Remember: Start simple, test frequently, and iterate. Build complex solutions from simple, well-tested building blocks.