Skip to content

Conversation

@EdJoPaTo
Copy link

While my machine was under memory pressure and the OOM killer went around, it killed a bunch of sssd services too. sssd itself restarted them and at some point failed to do so. Then it decided to terminate itself with exit code 1, which is not abnormal so systemd never restarted it. This resulted in me not being able to connect to this machine via ssh anymore, so I needed the IT to get access again. (See sssd log below)
While other sssd services have Restart=on-failure the main one uses Restart=on-abnormal which in this case resulted in an inaccessible system state for me. I would expect a system to recover from this, which should happen with this change.
Also see the man pages about the different systemd service Restart= settings.

This PR is about the systemd service which should restart sssd.
There are other issues about sssd stopping to work which are about sssd itself. So I think they are somewhat related but different. The other issues are about the root causes which do not apply here as my issue was the OOM Killer and Memory Pressure, not sssd being unstable by itself. This should also mitigate the other mentioned issues but won’t solve their underlying cause, so they should stay open.


For the short term workaround this systemd service override was added:

# /etc/systemd/system/sssd.service.d/override.conf
[Service]
Restart=on-failure
OOMScoreAdjust=-500

Adding OOMScoreAdjust to this service is also propagated to the child services being started. If this is also something you would like for the service, I can create another PR with it. Personally I think Restart=on-failure is a fix, so different from an addition → separate discussion.


Existing restart conditions in this Repo before this PR. Note that this is the only on-abnormal service.

$ rg 'Restart='
src/sysv/systemd/sssd.service.in
26:Restart=on-abnormal

src/sysv/systemd/sssd-sudo.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-ifp.service.in
15:Restart=on-failure

src/sysv/systemd/sssd-ssh.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-autofs.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-pam.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-pac.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-nss.service.in
17:Restart=on-failure

Log of the situation in the journal (Some things are censored with ###).
System reboot at Jan 24, OOM killer going around on Jan 26, sssd manually restarted on Jan 27

$ sudo journalctl -fn200 --unit sssd.service
Jan 24 07:16:28 ### systemd[1]: Starting sssd.service - System Security Services Daemon...
Jan 24 07:16:28 ### sssd[1537]: Starting up
Jan 24 07:16:28 ### sssd_be[1746]: Starting up
Jan 24 07:16:29 ### sssd_pam[1800]: Starting up
Jan 24 07:16:29 ### sssd_nss[1799]: Starting up
Jan 24 07:16:29 ### sssd_pac[1801]: Starting up
Jan 24 07:16:29 ### systemd[1]: Started sssd.service - System Security Services Daemon.
Jan 26 17:49:50 ### sssd[1537]: Child [1746] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:00 ### sssd_be[921889]: Starting up
Jan 26 17:50:20 ### sssd[1537]: Child [1799] ('nss':'nss') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:20 ### sssd[1537]: Child [1800] ('pam':'pam') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:30 ### sssd[1537]: Child [1801] ('pac':'pac') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:31 ### sssd[1537]: Child [921889] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:45 ### sssd_be[921899]: Starting up
Jan 26 17:50:45 ### sssd_pac[921898]: Starting up
Jan 26 17:50:50 ### sssd_pac[921910]: Starting up
Jan 26 17:50:50 ### sssd_nss[921915]: Starting up
Jan 26 17:50:57 ### sssd_pam[921916]: Starting up
Jan 26 17:51:28 ### sssd[1537]: Child [921899] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:51:50 ### sssd_be[921922]: Starting up
Jan 26 17:51:51 ### sssd[1537]: Child [921915] ('nss':'nss') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:52:04 ### sssd_nss[921926]: Starting up
Jan 26 17:52:19 ### sssd[1537]: Child [921916] ('pam':'pam') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:52:23 ### sssd[1537]: Child [921922] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:52:34 ### sssd_pam[921929]: Starting up
Jan 26 17:52:34 ### sssd_be[921930]: Starting up
Jan 26 17:52:56 ### sssd_pam[921932]: Starting up
Jan 26 17:52:56 ### sssd_nss[921933]: Starting up
Jan 26 17:53:06 ### sssd[1537]: Child [921930] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:53:36 ### sssd_nss[921939]: Starting up
Jan 26 17:53:38 ### sssd[1537]: Exiting the SSSD. Could not restart critical service [nss].
Jan 26 17:53:38 ### sssd_be[921937]: Starting up
Jan 26 17:53:59 ### sssd_be[921937]: Shutting down (status = 0)
Jan 26 17:54:03 ### sssd_pac[921910]: Shutting down (status = 0)
Jan 26 17:54:08 ### systemd[1]: sssd.service: Main process exited, code=exited, status=1/FAILURE
Jan 26 17:54:08 ### systemd[1]: sssd.service: Failed with result 'exit-code'.
Jan 26 17:54:09 ### systemd[1]: sssd.service: Consumed 11min 35.042s CPU time, 92.8M memory peak, 0B memory swap peak.
Jan 27 08:57:40 ### systemd[1]: Starting sssd.service - System Security Services Daemon...
Jan 27 08:57:40 ### sssd[1040051]: Starting up
Jan 27 08:57:40 ### sssd_be[1040052]: Starting up
Jan 27 08:57:41 ### sssd_nss[1040053]: Starting up
Jan 27 08:57:41 ### sssd_pam[1040054]: Starting up
Jan 27 08:57:41 ### sssd_pac[1040055]: Starting up
Jan 27 08:57:41 ### systemd[1]: Started sssd.service - System Security Services Daemon.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request changes the systemd restart policy for the main sssd service from on-abnormal to on-failure. The detailed description provides excellent justification for this change, highlighting a real-world scenario where an OOM event caused sssd to terminate with an exit code that on-abnormal does not handle, leading to system inaccessibility. Changing to on-failure ensures that sssd will be restarted on any failure, including non-zero exit codes, which significantly improves the service's resilience. This also aligns the main service's behavior with all other sssd sub-services, which already use on-failure. The change is correct, well-reasoned, and a clear improvement for system stability.

@Squiddim
Copy link

Squiddim commented Jan 28, 2026

Duplicate of #8407 that i created a day earlier…

@EdJoPaTo
Copy link
Author

ah lol, 15h ago, this wasn't there when I checked for other issues yesterday. Thank you, @Squiddim, for the same thing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants