systemd: Restart=on-failure #8408

EdJoPaTo · 2026-01-28T09:35:44Z

While my machine was under memory pressure and the OOM killer went around, it killed a bunch of sssd services too. sssd itself restarted them and at some point failed to do so. Then it decided to terminate itself with exit code 1, which is not abnormal so systemd never restarted it. This resulted in me not being able to connect to this machine via ssh anymore, so I needed the IT to get access again. (See sssd log below)
While other sssd services have Restart=on-failure the main one uses Restart=on-abnormal which in this case resulted in an inaccessible system state for me. I would expect a system to recover from this, which should happen with this change.
Also see the man pages about the different systemd service Restart= settings.

This PR is about the systemd service which should restart sssd.
There are other issues about sssd stopping to work which are about sssd itself. So I think they are somewhat related but different. The other issues are about the root causes which do not apply here as my issue was the OOM Killer and Memory Pressure, not sssd being unstable by itself. This should also mitigate the other mentioned issues but won’t solve their underlying cause, so they should stay open.

For the short term workaround this systemd service override was added:

# /etc/systemd/system/sssd.service.d/override.conf
[Service]
Restart=on-failure
OOMScoreAdjust=-500

Adding OOMScoreAdjust to this service is also propagated to the child services being started. If this is also something you would like for the service, I can create another PR with it. Personally I think Restart=on-failure is a fix, so different from an addition → separate discussion.

Existing restart conditions in this Repo before this PR. Note that this is the only on-abnormal service.

$ rg 'Restart='
src/sysv/systemd/sssd.service.in
26:Restart=on-abnormal

src/sysv/systemd/sssd-sudo.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-ifp.service.in
15:Restart=on-failure

src/sysv/systemd/sssd-ssh.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-autofs.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-pam.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-pac.service.in
17:Restart=on-failure

src/sysv/systemd/sssd-nss.service.in
17:Restart=on-failure

Log of the situation in the journal (Some things are censored with ###).
System reboot at Jan 24, OOM killer going around on Jan 26, sssd manually restarted on Jan 27

$ sudo journalctl -fn200 --unit sssd.service
Jan 24 07:16:28 ### systemd[1]: Starting sssd.service - System Security Services Daemon...
Jan 24 07:16:28 ### sssd[1537]: Starting up
Jan 24 07:16:28 ### sssd_be[1746]: Starting up
Jan 24 07:16:29 ### sssd_pam[1800]: Starting up
Jan 24 07:16:29 ### sssd_nss[1799]: Starting up
Jan 24 07:16:29 ### sssd_pac[1801]: Starting up
Jan 24 07:16:29 ### systemd[1]: Started sssd.service - System Security Services Daemon.
Jan 26 17:49:50 ### sssd[1537]: Child [1746] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:00 ### sssd_be[921889]: Starting up
Jan 26 17:50:20 ### sssd[1537]: Child [1799] ('nss':'nss') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:20 ### sssd[1537]: Child [1800] ('pam':'pam') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:30 ### sssd[1537]: Child [1801] ('pac':'pac') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:31 ### sssd[1537]: Child [921889] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:50:45 ### sssd_be[921899]: Starting up
Jan 26 17:50:45 ### sssd_pac[921898]: Starting up
Jan 26 17:50:50 ### sssd_pac[921910]: Starting up
Jan 26 17:50:50 ### sssd_nss[921915]: Starting up
Jan 26 17:50:57 ### sssd_pam[921916]: Starting up
Jan 26 17:51:28 ### sssd[1537]: Child [921899] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:51:50 ### sssd_be[921922]: Starting up
Jan 26 17:51:51 ### sssd[1537]: Child [921915] ('nss':'nss') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:52:04 ### sssd_nss[921926]: Starting up
Jan 26 17:52:19 ### sssd[1537]: Child [921916] ('pam':'pam') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:52:23 ### sssd[1537]: Child [921922] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:52:34 ### sssd_pam[921929]: Starting up
Jan 26 17:52:34 ### sssd_be[921930]: Starting up
Jan 26 17:52:56 ### sssd_pam[921932]: Starting up
Jan 26 17:52:56 ### sssd_nss[921933]: Starting up
Jan 26 17:53:06 ### sssd[1537]: Child [921930] ('###':'###') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Jan 26 17:53:36 ### sssd_nss[921939]: Starting up
Jan 26 17:53:38 ### sssd[1537]: Exiting the SSSD. Could not restart critical service [nss].
Jan 26 17:53:38 ### sssd_be[921937]: Starting up
Jan 26 17:53:59 ### sssd_be[921937]: Shutting down (status = 0)
Jan 26 17:54:03 ### sssd_pac[921910]: Shutting down (status = 0)
Jan 26 17:54:08 ### systemd[1]: sssd.service: Main process exited, code=exited, status=1/FAILURE
Jan 26 17:54:08 ### systemd[1]: sssd.service: Failed with result 'exit-code'.
Jan 26 17:54:09 ### systemd[1]: sssd.service: Consumed 11min 35.042s CPU time, 92.8M memory peak, 0B memory swap peak.
Jan 27 08:57:40 ### systemd[1]: Starting sssd.service - System Security Services Daemon...
Jan 27 08:57:40 ### sssd[1040051]: Starting up
Jan 27 08:57:40 ### sssd_be[1040052]: Starting up
Jan 27 08:57:41 ### sssd_nss[1040053]: Starting up
Jan 27 08:57:41 ### sssd_pam[1040054]: Starting up
Jan 27 08:57:41 ### sssd_pac[1040055]: Starting up
Jan 27 08:57:41 ### systemd[1]: Started sssd.service - System Security Services Daemon.

gemini-code-assist

Code Review

This pull request changes the systemd restart policy for the main sssd service from on-abnormal to on-failure. The detailed description provides excellent justification for this change, highlighting a real-world scenario where an OOM event caused sssd to terminate with an exit code that on-abnormal does not handle, leading to system inaccessibility. Changing to on-failure ensures that sssd will be restarted on any failure, including non-zero exit codes, which significantly improves the service's resilience. This also aligns the main service's behavior with all other sssd sub-services, which already use on-failure. The change is correct, well-reasoned, and a clear improvement for system stability.

Squiddim · 2026-01-28T09:38:11Z

Duplicate of #8407 that i created a day earlier…

EdJoPaTo · 2026-01-28T09:44:07Z

ah lol, 15h ago, this wasn't there when I checked for other issues yesterday. Thank you, @Squiddim, for the same thing!

systemd: Restart=on-failure

b5c349a

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

alexey-tikhonov requested a review from aplopez January 29, 2026 13:36

alexey-tikhonov assigned aplopez Jan 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systemd: Restart=on-failure #8408

systemd: Restart=on-failure #8408

EdJoPaTo commented Jan 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Squiddim commented Jan 28, 2026 •

edited

Loading

Uh oh!

EdJoPaTo commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

systemd: Restart=on-failure #8408

Are you sure you want to change the base?

systemd: Restart=on-failure #8408

Conversation

EdJoPaTo commented Jan 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Squiddim commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EdJoPaTo commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Squiddim commented Jan 28, 2026 •

edited

Loading