-
Notifications
You must be signed in to change notification settings - Fork 364
Remove panics/fatals #1047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Remove panics/fatals #1047
Conversation
| return func() { | ||
| if err := unix.Setrlimit(unix.RLIMIT_MEMLOCK, &oldLimit); err != nil { | ||
| log.Fatalf("Failed to set old rlimit: %v", err) | ||
| log.Errorf("Failed to set old rlimit: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could change the function signature to return a func() error instead, but given this function execution is normally deferred, it will make the code more verbose just to propagate the error.
Instead of Fatal changed to just log the error with Error severity to prevent exiting the whole binary, but open to propagate the error upstream if preferred.
| cancelReporting() | ||
| if err := otlpGrpcConn.Close(); err != nil { | ||
| log.Fatalf("Stopping connection of OTLP client client failed: %v", err) | ||
| log.Errorf("Stopping connection of OTLP client client failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fatalf to Errorf -> graceful shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not shutdown the profiler, only the reporter runloop will stop but the other subsystems (e.g. tracer) will keep running.
christos68k
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did one pass, will go over it again
| if err := impl.InterpreterOffsets.Update(unsafe.Pointer(&key), unsafe.Pointer(&value), | ||
| cebpf.UpdateAny); err != nil { | ||
| log.Fatalf("Failed to place interpreter range in map: %v", err) | ||
| return fmt.Errorf("Failed to place interpreter range in map: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're changing the semantics here as we're now allowing the profiler receiver to continue execution instead of triggering shutdown. I need to dig a bit more here to understand the context and side-effects (UpdateInterpreterOffsets is called on-demand by interpreter Loader functions).
| mapVal, ok := maps[nameTag] | ||
| if !ok { | ||
| log.Fatalf("Map %v is not available", nameTag) | ||
| return nil, fmt.Errorf("Map %v is not available", nameTag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK as it will trigger orderly Controller shutdown.
| cancelReporting() | ||
| if err := otlpGrpcConn.Close(); err != nil { | ||
| log.Fatalf("Stopping connection of OTLP client client failed: %v", err) | ||
| log.Errorf("Stopping connection of OTLP client client failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not shutdown the profiler, only the reporter runloop will stop but the other subsystems (e.g. tracer) will keep running.
| log.Info("Attached sched monitor") | ||
|
|
||
| if err := startTraceHandling(ctx, trc); err != nil { | ||
| if err := c.startTraceHandling(ctx, trc); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to stop the reporter before returning (as AFAICT there's nothing else that will do that) and also the other subsystems started before this point, to avoid leaving errant goroutines running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reporter routine loop should exit once the context provided during the Start function is cancelled: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/main/reporter/collector_reporter.go#L59
After these changes https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/1051/files, the Shutdown should terminate the reporter loop. I will try to add a test case for this.
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
A channel of errors is provided to the tracer start up function, this is required to react over errors produced during the asynchronous execution of
monitorPIDEventsMap. Controller shutdown function is run after an unrecoverable error to stop tracer routines.Even the refactored
fatal/panicsare very unluckily to happen, we need to take into account that they affect the whole binary. Meaning that when running the profiler as part of a bigger binary like the Otel collector, a fatal will bring down all the other components.Fixes #888