Skip to content

[Feature]: Add Watermark Support for Event-Time Processing #197

@eneshoxha

Description

@eneshoxha

Summary

Implement watermark-based event-time processing to handle out-of-order events and enable accurate windowing based on event timestamps rather than processing time.

Problem Statement

Currently, Cortex.Streams uses processing time for all window operations (TumblingWindow, SlidingWindow, SessionWindow). This works well for real-time scenarios but fails when:

  • Out-of-order events: Events arrive late due to network delays, distributed system behavior, or replay scenarios
  • Historical data processing: Replaying events from Kafka/Pulsar with original timestamps
  • Accurate analytics: Business requirements demand event-time accuracy (e.g., "orders placed in the last hour" vs "orders received in the last hour")
  • Multi-source correlation: Joining streams from different sources with varying latencies

Current Behavior

// Current: Uses processing time (DateTime.UtcNow when event is processed)
.TumblingWindow(
    keySelector: e => e.SensorId,
    timestampSelector: e => e.EventTime,  // This is stored but window boundaries use processing time
    windowSize: TimeSpan.FromMinutes(1))

Impact

Without watermarks, late-arriving events are either:

  • Dropped silently (data loss)
  • Assigned to wrong windows (incorrect results)
  • Cannot be handled gracefully (no side output for late data)

Acceptance Criteria

  • Events are windowed based on event-time, not processing time
  • Out-of-order events within the watermark threshold are correctly assigned to windows
  • Late events (past watermark) can be handled via configuration
  • Watermarks propagate through the operator chain
  • Idle sources emit watermarks periodically
  • Backward compatible - existing streams without watermarks continue to work
  • Performance overhead is minimal (<5% for in-order streams)

Technical Considerations

  1. Watermark Propagation: Watermarks must flow through the operator chain. Each operator should track the minimum watermark of its inputs.

  2. Idle Source Handling: Sources with no events should still emit watermarks periodically to avoid blocking downstream windows.

  3. Memory Management: Window state should be cleaned up after watermark + allowed lateness passes.

  4. Multi-Source Streams: When joining streams, use the minimum watermark across all sources.

  5. Checkpointing Integration: Watermark state should be included in checkpoints.

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestfeatureThis label is in use for minor version increments

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions