Best Practices

ETL Design Principles

Keep Chains Focused and Single-Purpose

Each chain should have a clear, single purpose. Avoid creating monolithic chains that do too many unrelated things.

Good: - “ImportCustomers” - Imports customer data from external source - “ValidateCustomers” - Validates customer data quality - “EnrichCustomers” - Adds reference data to customers

Avoid: - “ProcessEverything” - Imports, validates, enriches, generates reports, sends emails

Benefits: - Easier to understand and maintain - Easier to test independently - Easier to reuse in different contexts - Easier to debug when issues occur

Use Meaningful Names

Choose descriptive names for chainsets, chains, and steps that clearly indicate their purpose.

Chainset Names: - ✅ “CustomerDataManagement” - ✅ “DailyReportGeneration” - ❌ “ETL1”, “MyChainset”, “Test”

Chain Names: - ✅ “ImportOrdersFromAPI” - ✅ “ValidateEmailAddresses” - ❌ “Chain1”, “Process”, “DoStuff”

Step Names: - ✅ “Filter Active Customers” - ✅ “Calculate Order Total” - ❌ “Filter”, “Step 3”, “Temp”

Break Complex Logic into Reusable Chains

When a chain becomes complex, break it into smaller chains connected with Chain Call steps.

Example:

Main Chain: "ProcessOrder"
  1. MongoDB Reader (read order)
  2. Chain Call "ValidateOrder"
  3. Chain Call "EnrichOrderData"
  4. Chain Call "CalculatePricing"
  5. Chain Call "ApplyDiscounts"
  6. MongoDB Writer (save processed order)

Supporting Chains:
  - "ValidateOrder" - Check required fields, validate formats
  - "EnrichOrderData" - Add customer info, product details
  - "CalculatePricing" - Compute subtotals, taxes
  - "ApplyDiscounts" - Apply promotional discounts

Benefits: - Each sub-chain can be tested independently - Sub-chains can be reused in other contexts - Main chain is easier to understand at a glance - Changes to one aspect don’t affect others

Use Typed Functions for Reusable Logic

When a chain is called from multiple places, make it a typed function using structure.FunctionSignature and structure.FunctionCall. This gives callers an explicit contract and avoids relying on implicit field name conventions.

How to create a typed function:

  1. Add structure.FunctionSignature as the first step of the chain.
  • Parameters: the input field names callers must supply (one per line, up to 5)
  • Returns: the output field names the chain will produce (one per line, up to 5)
  1. Name the chain with an Fn prefix: Fn_validateEmail, Fn_lookupEmployee
  2. Callers use structure.FunctionCall, mapping their local fields to in0in4 and receiving results as out0out4

Example:

Function chain: "Fn_lookupManager"
  Step 1: structure.FunctionSignature
    Parameters: employeeId
    Returns: managerId, managerEmail

  Step 2: mongodb.AggregationDefinition / string.Substitute / mongodb.Reader

  Step 3: structure.AddFields
    { "managerId": "${_id}", "managerEmail": "${email}" }

Calling chain:
  Step N: structure.FunctionCall
    Function Name: Fn_lookupManager
    In: employeeId → in0
    Out: out0 → managerId, out1 → managerEmail

When to use typed functions vs plain Chain Calls: - Use structure.FunctionCall when the chain is reused from 2+ places, when you need explicit field remapping, or when you want a documented contract. - Use structure.ChainCall for one-off sub-chains where the full record flows through unchanged.

Use Chain Imports for Common Operations

Create a “CommonUtilities” or “SharedFunctions” chainset with frequently used chains, then import it into other chainsets.

Example:

CommonUtilities Chainset:
  - "FormatAddress"
  - "ValidateEmail"
  - "CalculateTax"
  - "SendNotificationEmail"
  - "LogError"

OrderProcessing Chainset (imports CommonUtilities):
  - Chains can call "FormatAddress", "ValidateEmail", etc.
  
CustomerManagement Chainset (imports CommonUtilities):
  - Chains can call the same shared functions

Benefits: - Write once, use everywhere - Update in one place, affects all users - Consistent behavior across chainsets - Reduces duplication

Design for Testability

Always create test inputs for your chains. Include test cases for: - Happy path - Normal, valid data - Edge cases - Empty values, nulls, boundary conditions - Error cases - Invalid data that should be caught - Large datasets - Performance testing with many records

Example Test Inputs:

{
  "name": "Valid Order",
  "description": "Normal order with all required fields",
  "testJSON": { "orderId": "ORD-001", "customerId": "C123", ... }
}

{
  "name": "Missing Customer",
  "description": "Order without customer ID - should fail validation",
  "testJSON": { "orderId": "ORD-002", "items": [...] }
}

{
  "name": "Large Order",
  "description": "Order with 100 items - performance test",
  "testJSON": { "orderId": "ORD-003", "items": [... 100 items ...] }
}

Handle Errors Explicitly

Don’t assume data is always valid or operations always succeed. Design chains to handle errors gracefully.

Validation Pattern:

1. Read data
2. Validate required fields → Filter invalid to error collection
3. Validate formats → Filter invalid to error collection
4. Validate business rules → Filter invalid to error collection
5. Process valid records
6. Write results

Error Logging Pattern:

1. Try operation
2. On error:
   - Add error message to record
   - Add timestamp
   - Write to error collection
   - Optionally send alert

Document Complex Chains

Add descriptions to chains explaining: - Purpose - What the chain does - Input - Expected input format - Output - What the chain produces - Dependencies - What it requires (databases, external APIs) - Side effects - What it modifies (databases, files, emails sent)

Use test inputs to document expected input formats with examples.

Performance Optimization

Understand Step Structure Patterns

Step structure patterns affect performance:

1 => N (Expansion): - Creates more records than input - Can significantly increase memory usage - Examples: Array Explode, String Split, MongoDB Reader

N => 1 (Reduction): - Reduces record count - May require buffering records in memory - Examples: Array Collect, Aggregation, Merge

Best Practice: Place N => 1 steps early and 1 => N steps late to minimize records in flight.

Filter Early

Apply filters as early as possible in the chain to reduce the number of records processed by subsequent steps.

Good:

1. MongoDB Reader (reads 10,000 records)
2. Filter Active Only (reduces to 1,000 records)
3. Enrich with Customer Data (processes 1,000)
4. Calculate Metrics (processes 1,000)

Avoid:

1. MongoDB Reader (reads 10,000 records)
2. Enrich with Customer Data (processes 10,000)
3. Calculate Metrics (processes 10,000)
4. Filter Active Only (reduces to 1,000)

Optimize Database Queries

When using MongoDB Reader or similar steps: - Use specific queries to limit results at the source - Create indexes on frequently queried fields - Use projections to retrieve only needed fields - Consider aggregation pipelines for complex queries

Example:

Instead of:
  1. MongoDB Reader (read all customers)
  2. Filter where status = "active"
  
Use:
  1. MongoDB Reader with query: { status: "active" }

Batch vs Streaming Operations

Understand when to use batch operations vs streaming:

Streaming (default): - Records processed one at a time - Low memory usage - Good for large datasets - Examples: Most transformation steps

Batch: - Records collected and processed together - Higher memory usage - Required for aggregations, sorting - Examples: Aggregation, Sort, Distinct

Monitor Performance

Use debug logging and execution times to identify bottlenecks: - Enable debug logging on suspected slow steps - Check execution time in Results panel - Look for steps that process many records - Consider breaking slow chains into smaller chunks

Error Handling Strategies

Validation Steps

Use the dedicated validation steps to accumulate issues and check them all before deciding how to proceed:

1. Read data

2. validation.MandatoryFields
   Fields: requiredField1, requiredField2
   (adds an issue for each missing field)

3. filter.Compare: someValue > 0 → field: checkPositive
4. validation.AddValidationIssue
   IsValid Field: checkPositive
   Message: "someValue must be greater than zero"

5. validation.IsValid
   Result Field: isValid

6. structure.ChainIfElse
   Boolean Field: isValid
   True Chain: ProcessValidRecord
   False Chain: HandleValidationErrors

validation.AddValidationIssue appends a message to the validationIssues array only when its boolean field is false — the record flows through unchanged when the check passes. This lets you run all checks before branching, so you can surface all issues at once rather than stopping at the first failure.

Conditional Routing

For branching logic, use structure.ChainIf or structure.ChainIfElse with a boolean field set by a preceding compare step:

Main Chain:
  1. Read orders
  2. filter.Compare: amount > 1000 → field: isHighValue
  3. structure.ChainIfElse
       Boolean Field: isHighValue
       True Chain: HighValueOrderProcess
       False Chain: StandardOrderProcess

Use structure.ChainIf when only one branch needs special handling and the other should pass through unchanged:

  2. filter.Compare: status == 'error' → field: hasError
  3. structure.ChainIf
       Boolean Field: hasError
       Call If: true
       Chain Name: HandleOrderError
     (records without errors continue unchanged)

Error Logging

Create a standard error logging pattern:

Error Logging Chain:
  1. structure.AddFields: { "errorTimestamp": (current timestamp), "errorSource": (chain name) }
  3. MongoDB Writer: write to "errors" collection
  4. Optionally: Chain Call "SendErrorAlert"

Use this chain whenever errors are detected.

Graceful Degradation

Design chains to continue processing even when some operations fail:

1. Read records
2. Try to enrich with external API
   - On success: continue with enriched data
   - On failure: continue with original data, log warning
3. Process records (works with or without enrichment)

Testing Strategies

Create Comprehensive Test Cases

For each chain, create test inputs covering:

  1. Happy Path - Valid data, expected flow
  2. Empty Input - No records
  3. Single Record - Boundary case
  4. Multiple Records - Normal volume
  5. Large Dataset - Performance test
  6. Invalid Data - Missing fields, wrong types
  7. Edge Cases - Nulls, empty strings, special characters

Test Incrementally

When building a chain: 1. Add one step 2. Create test input 3. Run and verify output 4. Add next step 5. Run and verify output 6. Repeat

Don’t wait until the entire chain is built to test.

Use Step Disable for Diagnostic Steps

When you add a diagnostic step (e.g. a mongodb.TeeWriter to capture intermediate records), disable it rather than deleting it when you no longer need it to run. A disabled step is skipped during execution but stays in the chain with its configuration intact — ready to be re-enabled at any time.

This is especially useful for: - Keeping a mongodb.TeeWriter attached to a chain for future debugging sessions - Temporarily bypassing a slow or broken step without losing its configuration - Retaining an alternative implementation alongside the active one

To disable a step: select it → “More Actions” → “Disable Step”.

Use Debug Logging

Enable debug logging to see: - Exact input to each step - Step parameters being used - Exact output from each step - Any warnings or errors

This helps verify each step is doing what you expect.

Test Error Paths

Don’t just test the happy path. Verify error handling: - What happens with invalid data? - Are errors logged correctly? - Do error records go to the right place? - Are error messages helpful?

Integration Testing

Test chains in context: - Test with real database connections - Test with actual external APIs - Test with realistic data volumes - Test scheduled execution

Chain Organization

Structure Chainsets by Domain

Organize chainsets around business domains or functional areas:

CustomerDataManagement
  - ImportCustomers
  - ValidateCustomers
  - EnrichCustomers
  - ExportCustomers

OrderProcessing
  - ImportOrders
  - ValidateOrders
  - ProcessOrders
  - GenerateInvoices

ReportGeneration
  - DailySalesReport
  - WeeklySummaryReport
  - MonthlyAnalytics

Use Groups Effectively

Within a chainset, use groups to organize chains:

By Purpose: - Import - Validation - Processing - Export - Utilities

By Frequency: - Hourly - Daily - Weekly - On-Demand

By Domain: - Customers - Orders - Products - Inventory

Naming Conventions

Establish and follow naming conventions:

Chainsets: - PascalCase: “CustomerDataManagement” - Descriptive: Indicates domain or purpose

Chains: - PascalCase: “ImportCustomersFromAPI” - Verb-Noun format: “ValidateEmailAddresses” - Specific: Indicates what the chain does

Steps: - Descriptive phrases: “Filter Active Customers” - Action-oriented: “Calculate Order Total” - Clear purpose: “Write to Error Collection”

Manage Dependencies

When using chain imports: - Document which chainsets import which - Avoid circular dependencies - Keep imports shallow (avoid deep chains of imports) - Consider creating a dependency diagram for complex setups

Version Control Considerations

When using external version control: - Export chainsets as JSON files - Commit with meaningful messages - Tag releases - Document breaking changes - Consider separate branches for development/production

Security Best Practices

Role-Based Access Control

Use roles to control who can: - View chainsets (read access) - Modify chainsets (ownership) - Execute chains (via scheduler, workflows, datasets)

Workgroup Isolation

Use workgroups to: - Separate chainsets by department or team - Control visibility of sensitive data - Organize chainsets by project or client

Credential Management

Never hardcode credentials in chains: - ❌ Don’t: Hardcode passwords in step parameters - ✅ Do: Use Ambience Secrets module - ✅ Do: Use database connection pools with stored credentials - ✅ Do: Use environment-specific configuration

Sensitive Data Handling

Be careful with sensitive data: - Don’t log sensitive data in error messages - Use appropriate database security - Consider encryption for sensitive fields - Follow data retention policies - Comply with regulations (GDPR, etc.)

Common Patterns

Read-Transform-Write Pattern

The most common ETL pattern:

1. Read from source (MongoDB Reader, File Reader, REST API)
2. Transform data (various transformation steps)
3. Write to destination (MongoDB Writer, File Writer, Send Mail)

Incremental Processing Pattern

Process only new or changed records:

1. Read last processed timestamp from control collection
2. Query source for records since last timestamp
3. Process records
4. Update control collection with new timestamp

Error Handling Pattern

Separate valid and invalid records:

1. Read data
2. Validate
3. Filter valid → Process → Write to main collection
4. Filter invalid → Add error info → Write to error collection

Aggregation Pattern

Summarize data:

1. Read detail records
2. Group by key fields
3. Calculate aggregates (sum, count, average)
4. Write summary records

Enrichment Pattern

Add reference data:

1. Read main records
2. For each record:
   - Look up reference data (Chain Call to lookup chain)
   - Merge reference data into record
3. Write enriched records

Fan-Out Pattern

Process one record multiple ways:

Main Chain:
  1. Read record
  2. Chain Call "ProcessTypeA" (if type = A)
  3. Chain Call "ProcessTypeB" (if type = B)
  4. Chain Call "ProcessTypeC" (if type = C)

Save-Branch-Restore Pattern

Preserve the input record across a 1 => N / N => 1 round-trip using job state:

1. config.WriteState
   Field: (empty — saves entire record)
   Key: originalInput

2. array.Unwind
   Field: items           ← 1 => N

3. (process each item)

4. array.Collect
   Field: processedItems  ← N => 1

5. config.ReadState
   Key: originalInput
   Field: (empty — merges saved record back)

Use this whenever a 1 => N step would cause original fields to be unavailable after aggregation. See Core Concepts for detail on job state scope and caveats.

Notification Pattern

Alert on specific conditions:

1. Process records
2. Filter records meeting alert criteria
3. Compose notification message
4. Send email/notification

Troubleshooting Tips

Chain Doesn’t Run

  • Check all required step parameters are filled in
  • Verify database connections are valid
  • Check role permissions
  • Look for error messages in Results panel

Unexpected Results

  • Enable debug logging on steps
  • Check step parameters carefully
  • Verify field names match your data
  • Test with simpler input first

Performance Issues

  • Check for N => 1 steps late in chain (move earlier)
  • Check for 1 => N steps early in chain (move later)
  • Optimize database queries
  • Consider breaking into smaller chains
  • Check for unnecessary transformations

Memory Issues

  • Reduce record expansion (1 => N operations)
  • Process in smaller batches
  • Avoid collecting large arrays
  • Check for memory leaks in custom steps

Debugging Workflow

  1. Isolate the problem - Which step is causing issues?
  2. Enable debug logging - See exact inputs/outputs
  3. Test with minimal data - Use single record test input
  4. Check parameters - Verify all settings are correct
  5. Test incrementally - Run to each step, verify output
  6. Check logs - Look for warnings or errors
  7. Ask for help - Consult documentation or colleagues

Summary

Following these best practices will help you: - ✅ Design maintainable and reusable ETL chains - ✅ Optimize performance for large datasets - ✅ Handle errors gracefully - ✅ Test thoroughly and debug effectively - ✅ Organize chainsets for team collaboration - ✅ Secure sensitive data appropriately

Remember: Start simple, test frequently, and iterate. Build complex solutions from simple, well-tested building blocks.