Skip to content

Conversation

@rogercoll
Copy link
Contributor

@rogercoll rogercoll commented Jan 7, 2026

A channel of errors is provided to the tracer start up function, this is required to react over errors produced during the asynchronous execution of monitorPIDEventsMap. Controller shutdown function is run after an unrecoverable error to stop tracer routines.

Even the refactored fatal/panics are very unluckily to happen, we need to take into account that they affect the whole binary. Meaning that when running the profiler as part of a bigger binary like the Otel collector, a fatal will bring down all the other components.

Fixes #888

return func() {
if err := unix.Setrlimit(unix.RLIMIT_MEMLOCK, &oldLimit); err != nil {
log.Fatalf("Failed to set old rlimit: %v", err)
log.Errorf("Failed to set old rlimit: %v", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change the function signature to return a func() error instead, but given this function execution is normally deferred, it will make the code more verbose just to propagate the error.

Instead of Fatal changed to just log the error with Error severity to prevent exiting the whole binary, but open to propagate the error upstream if preferred.

cancelReporting()
if err := otlpGrpcConn.Close(); err != nil {
log.Fatalf("Stopping connection of OTLP client client failed: %v", err)
log.Errorf("Stopping connection of OTLP client client failed: %v", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fatalf to Errorf -> graceful shutdown

Copy link
Member

@christos68k christos68k Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not shutdown the profiler, only the reporter runloop will stop but the other subsystems (e.g. tracer) will keep running.

@rogercoll rogercoll marked this pull request as ready for review January 13, 2026 18:31
@rogercoll rogercoll requested review from a team as code owners January 13, 2026 18:31
Copy link
Member

@christos68k christos68k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did one pass, will go over it again

if err := impl.InterpreterOffsets.Update(unsafe.Pointer(&key), unsafe.Pointer(&value),
cebpf.UpdateAny); err != nil {
log.Fatalf("Failed to place interpreter range in map: %v", err)
return fmt.Errorf("Failed to place interpreter range in map: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're changing the semantics here as we're now allowing the profiler receiver to continue execution instead of triggering shutdown. I need to dig a bit more here to understand the context and side-effects (UpdateInterpreterOffsets is called on-demand by interpreter Loader functions).

mapVal, ok := maps[nameTag]
if !ok {
log.Fatalf("Map %v is not available", nameTag)
return nil, fmt.Errorf("Map %v is not available", nameTag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK as it will trigger orderly Controller shutdown.

cancelReporting()
if err := otlpGrpcConn.Close(); err != nil {
log.Fatalf("Stopping connection of OTLP client client failed: %v", err)
log.Errorf("Stopping connection of OTLP client client failed: %v", err)
Copy link
Member

@christos68k christos68k Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not shutdown the profiler, only the reporter runloop will stop but the other subsystems (e.g. tracer) will keep running.

log.Info("Attached sched monitor")

if err := startTraceHandling(ctx, trc); err != nil {
if err := c.startTraceHandling(ctx, trc); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to stop the reporter before returning (as AFAICT there's nothing else that will do that) and also the other subsystems started before this point, to avoid leaving errant goroutines running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reporter routine loop should exit once the context provided during the Start function is cancelled: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/main/reporter/collector_reporter.go#L59

After these changes https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/1051/files, the Shutdown should terminate the reporter loop. I will try to add a test case for this.

rogercoll and others added 5 commits January 14, 2026 14:43
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
@rogercoll rogercoll marked this pull request as draft January 15, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure graceful shutdown by removing log.Fatal and panic calls

2 participants