Scheduler Integration

Overview

The Scheduler module allows you to run ETL chains automatically at specified times or intervals. This is ideal for: - Recurring data processing tasks - Automated report generation - Data synchronization between systems - Maintenance and cleanup operations - Monitoring and alerting

Creating a Scheduled Job

Step 1: Create the ETL Chain

First, create and test your ETL chain in the ETL Designer:

Example Chain: "DailySalesReport"
  1. mongodb.AggregationDefinition / string.Substitute / mongodb.Reader
     (read yesterday's sales)
  2. Aggregate by product category
  3. Calculate totals and percentages
  4. Generate Excel report
  5. Send email with report

Test the chain thoroughly with test inputs before scheduling.

Step 2: Create a Scheduler Job

Navigate to the Scheduler module
Click Add Job
Configure the job:

Name: “Daily Sales Report Job”
Type: Select “ETL”
Chainset: Select your chainset
Chain: Select your chain
Schedule: Configure when to run (e.g., “Every day at 8:00 AM”)
Enabled: Check to activate

Step 3: Configure Job Parameters

If your ETL chain expects input parameters, configure them in the job:

Example: Pass date range to the chain

{
  "startDate": "${yesterday}",
  "endDate": "${today}",
  "reportType": "daily"
}

The scheduler will pass this JSON as the initial record to your chain.

Step 4: Monitor Execution

After the job runs: 1. Check job history in the Scheduler module 2. View execution status (success/failure) 3. See execution time and any error messages 4. Review ETL chain results in ETL Designer

Schedule Patterns

Fixed Time Schedules

Run at specific times: - Daily at 8:00 AM - Morning reports - Every Monday at 9:00 AM - Weekly summaries - First day of month at 6:00 AM - Monthly processing

Interval Schedules

Run at regular intervals: - Every 5 minutes - Real-time monitoring - Every hour - Hourly data synchronization - Every 6 hours - Periodic updates

Cron Expressions

Use cron syntax for complex schedules: - 0 8 * * 1-5 - Weekdays at 8:00 AM - 0 0 1 * * - First day of each month at midnight - */15 * * * * - Every 15 minutes

Common Patterns

Note: The chain examples below use ${yesterday}, ${lastHour}, etc. in aggregation strings. These are resolved by the string.Substitute step at runtime, meaning the record must already contain fields named yesterday, lastHour, etc. The scheduler job configuration should pass those values as parameters (e.g. "yesterday": "${yesterday}") so the scheduler resolves them before the chain starts. Any date values not in the scheduler variables table (such as 30DaysAgo) must be computed inside the chain using date.Today and date.Add-Subtract before the string.Substitute step.

Daily Report Generation

Schedule: Every day at 8:00 AM

ETL Chain: "GenerateDailyReport"
  1. mongodb.AggregationDefinition
     Pool: your-pool-name
     Collection: orders
     Aggregation: [{"$match": {"date": "${yesterday}"}}]

  2. string.Substitute
     Field: aggregation

  3. mongodb.Reader (reads previous day's data)
  
  4. Aggregate Sales by Category
     - Groups by product category
     - Calculates totals
  
  5. Generate Excel Report
     - Creates formatted spreadsheet
     - Includes charts and summaries
  
  6. Compose Email
     - Subject: "Daily Sales Report - ${yesterday}"
     - Body: Summary text
     - Attachment: Excel file
  
  7. Send Mail
     - Recipients: management team
     - Send report

Hourly Data Synchronization

Schedule: Every hour

ETL Chain: "SyncCustomerData"
  1. mongodb.AggregationDefinition
     Pool: your-pool-name
     Collection: customers
     Aggregation: [{"$match": {"lastModified": {"$gte": "${lastHour}"}}}]

  2. string.Substitute
     Field: aggregation

  3. mongodb.Reader (reads recently modified customers)
  
  4. Transform to External Format
     - Map fields to external system schema
     - Format dates and values
  
  5. REST API Call
     - POST to external system
     - Send customer updates
  
  6. Log Results
     - Write sync status to log collection
     - Record success/failure

Weekly Cleanup

Schedule: Every Sunday at 2:00 AM

ETL Chain: "CleanupOldRecords"
  1. mongodb.AggregationDefinition
     Pool: your-pool-name
     Collection: your-collection
     Aggregation: [{"$match": {"createdDate": {"$lt": "${30DaysAgo}"}}}]

  2. string.Substitute
     Field: aggregation

  3. mongodb.Reader (reads records older than 30 days)
  
  4. Archive to File
     - Export to JSON file
     - Store in archive location
  
  5. MongoDB Delete
     - Remove archived records from database
     - Free up space
  
  6. Send Summary Email
     - Report records archived
     - Report space freed

Monitoring and Alerting

Schedule: Every 5 minutes

ETL Chain: "MonitorSystemHealth"
  1. Check Database Connections
     - Test connectivity
     - Measure response time
  
  2. Check Disk Space
     - Query system metrics
     - Calculate usage percentages
  
  3. Check Error Logs
     - Read recent error entries
     - Count errors by type
  
  4. Filter Alerts
     - Identify issues requiring attention
     - Filter by severity
  
  5. Send Alert Email (if issues found)
     - Notify operations team
     - Include issue details

Passing Parameters to Scheduled Chains

Using Scheduler Variables

The scheduler resolves built-in variables when constructing the job parameters JSON. Use them in your scheduler job configuration:

Variable	Description	Example
`${today}`	Current date	“2026-03-03”
`${yesterday}`	Previous day	“2026-03-02”
`${lastHour}`	One hour ago timestamp	“2026-03-03T12:00:00Z”
`${lastWeek}`	Seven days ago	“2026-02-24”
`${lastMonth}`	One month ago	“2026-02-03”

These are scheduler-only placeholders — they are resolved before the initial record is passed to the ETL chain. Inside the chain, access the resolved values as ordinary record fields using string.Substitute. For example, if the job parameters include "reportDate": "${yesterday}", the chain receives reportDate = "2026-03-02" and can substitute it in an aggregation string as ${reportDate}.

For ETL chains that are not scheduler-triggered, compute dates using date.Today and date.Add-Subtract, or timestamp.Now.

Custom Parameters

Define custom parameters in the job configuration:

{
  "reportType": "sales",
  "region": "EMEA",
  "includeDetails": true,
  "emailRecipients": ["manager@example.com", "analyst@example.com"]
}

Access these in your ETL chain using field references.

Dynamic Date Ranges

Calculate date ranges dynamically:

{
  "startDate": "${firstDayOfMonth}",
  "endDate": "${lastDayOfMonth}",
  "reportPeriod": "monthly"
}

Error Handling in Scheduled Jobs

Chain-Level Error Handling

Design ETL chains to handle errors gracefully:

1. Try to process records
2. On error:
   - Log error details
   - Write failed records to error collection
   - Continue processing remaining records
3. Send summary email
   - Report success count
   - Report error count
   - Include error details if any

Job-Level Error Handling

When a scheduled job fails, it is not automatically restarted. Instead, the job owner and any users with the mod-production-manager privilege receive a notification.

Automatic restart is intentionally not supported: by the time a failure is detected, the job may be in an indeterminate state. Rollback is often not feasible — for example, emails that have been sent or MQTT messages that have been published cannot be recalled. Restarting from an unknown midpoint could cause duplicate side-effects or data corruption.

If a failed job needs to be re-run, it should be triggered manually after the root cause has been investigated and resolved.

Monitoring Job Health

Regularly review: - Job execution history - Success/failure rates - Execution duration trends - Error patterns

Set up alerts for: - Job failures - Unusually long execution times - Repeated errors - Jobs not running on schedule

Performance Considerations

Avoid Peak Times

Schedule resource-intensive jobs during off-peak hours: - Run large data processing at night - Avoid scheduling during business hours - Stagger multiple jobs to avoid conflicts

Optimize Chain Performance

For scheduled chains: - Filter data at the source (database queries) - Process only changed data (incremental processing) - Use efficient step patterns - Avoid unnecessary transformations

Incremental Processing

Instead of processing all data every time:

1. Read last processed timestamp from control collection
2. Query only records modified since last run
3. Process new/changed records
4. Update control collection with current timestamp

This is much more efficient than full reprocessing.

Parallel Processing

For very large datasets, consider: - Breaking into multiple smaller jobs - Processing different data segments in parallel - Using separate chainsets to enable concurrent execution

Best Practices

Test Before Scheduling

Create and test the ETL chain manually
Test with realistic data volumes
Test error scenarios
Verify email notifications work
Only then schedule the job

Start with Disabled Jobs

When creating a new scheduled job: 1. Create it in disabled state 2. Verify the schedule is correct 3. Test manually first 4. Enable only when confident

Monitor Initially

When first scheduling a job: - Monitor the first few executions closely - Check results are as expected - Verify timing is appropriate - Adjust schedule if needed

Document Job Purpose

For each scheduled job, document: - What it does - Why it runs at that schedule - Who depends on it - What to do if it fails - Contact person for issues

Use Meaningful Names

Name jobs clearly: - ✅ “Daily Sales Report - 8AM” - ✅ “Hourly Customer Sync to CRM” - ❌ “Job1”, “Test”, “ETL”

Set Up Alerts

Configure alerts for: - Job failures - Jobs taking too long - Jobs not producing expected results - Critical jobs not running

Review Regularly

Periodically review scheduled jobs: - Are they still needed? - Is the schedule still appropriate? - Are they performing well? - Can they be optimized? - Should they be consolidated?

Troubleshooting

Job Doesn’t Run

Check: - Job is enabled - Schedule is correct (check timezone) - Chainset is enabled - Chain exists and is valid - User has appropriate privileges

Job Fails

Check: - ETL chain results for error messages - Database connections are valid - External APIs are accessible - Required data exists - Parameters are correct

Job Runs Too Long

Optimize: - Add filters to reduce data volume - Use incremental processing - Optimize database queries - Remove unnecessary steps - Consider breaking into smaller jobs

Unexpected Results

Verify: - Input parameters are correct - Date variables are as expected - Chain logic is correct - Test chain manually with same parameters - Check for data changes

Example: Complete Scheduled Report

Here’s a complete example of a scheduled daily report:

Scheduler Job Configuration:

Name: Daily Sales Report
Type: ETL
Chainset: ReportGeneration
Chain: DailySalesReport
Schedule: Every day at 8:00 AM
Enabled: Yes
Parameters:
{
  "reportDate": "${yesterday}",
  "recipients": ["sales@example.com", "management@example.com"]
}

ETL Chain: “DailySalesReport”

1. JSON Record (get parameters)
   - Receives: { reportDate: "2026-03-02", recipients: [...] }

2. mongodb.AggregationDefinition
   Pool: your-pool-name
   Collection: orders
   Aggregation: [{"$match": {"orderDate": "${reportDate}", "status": "completed"}}]

3. string.Substitute
   Field: aggregation

4. mongodb.Reader

5. Aggregate by Product
   - Group by: productCategory
   - Sum: orderTotal
   - Count: orderCount

6. Sort by Total Descending
   - Sort field: orderTotal
   - Order: descending

7. Generate Excel
   - Template: sales_report_template.xlsx
   - Sheet: Daily Sales
   - Include charts

8. Compose Email
   - To: ${recipients}
   - Subject: "Daily Sales Report - ${reportDate}"
   - Body: "Please find attached the daily sales report."
   - Attachment: generated Excel file

9. Send Mail
   - Send the composed email

10. timestamp.Now
    - To Field: now

11. MongoDB Writer (log completion)
    - Pool: your-pool-name
    - Collection: report_log
    - Document: { reportType: "daily_sales", date: "${reportDate}", 
                  status: "sent", timestamp: "${now}" }

This chain runs automatically every morning, generates the report, and emails it to the sales and management teams.

Next Steps

Workflow Integration - Use ETL in business processes
Dataset Integration - Provide data for visualizations
Examples - More complete examples
Best Practices - Design effective ETL chains

Next: Workflow Integration