Skip to content

Segfault in free-threaded Python 3.14t during cluster shutdown (logging race in Cythonized cluster.so) #717

@dkropachev

Description

@dkropachev

Summary

The test libev (3.14t) integration test job segfaults at ~95% through cqlengine tests (during test_ifexists.py) on free-threaded Python 3.14t. The crash is a thread-safety race condition in logging calls that format Host/EndPoint objects while those objects are being concurrently torn down during cluster shutdown.

Stack Trace

Fatal Python error: Segmentation fault

<Cannot show all threads while the GIL is disabled>
Stack (most recent call first):
  File ".../logging/__init__.py", line 1154 in emit
  File ".../concurrent/futures/thread.py", line 73 in run
  ...

Current thread's C stack trace (most recent call first):
  ... at _PyUnicodeWriter_WriteStr+0x77
  ... cassandra/cluster.cpython-314t-x86_64-linux-gnu.so, at +0x13b1cc
  ... cassandra/cluster.cpython-314t-x86_64-linux-gnu.so, at +0x14771a (PyObject_VectorcallMethod)
  ... cassandra/cluster.cpython-314t-x86_64-linux-gnu.so, at +0x1121f9

Root Cause

The crash is a race condition between cluster shutdown and executor threads doing logging with %s/%r formatting of Host objects. With the GIL disabled in 3.14t, this is no longer safe.

The race:

  • Thread A (main): Cluster.shutdown()Session.shutdown() → iterates/clears _pools and shuts down pools, potentially triggering cleanup of Host/EndPoint objects
  • Thread B (executor worker): still running a submitted task (e.g. run_add_or_renew_pool(), on_down_potentially_blocking(), or a future callback), hits a logging call like log.debug("... %s", host) which calls Host.__str__()str(self.endpoint)DefaultEndPoint.__str__()"%s:%d" % (self._address, self._port)

The segfault occurs in _PyUnicodeWriter_WriteStr because the endpoint's _address string (or the endpoint object itself) is being garbage collected by Thread A while Thread B is trying to format it.

Shutdown order in Cluster.shutdown() (cluster.py:1772):

  1. self.is_shutdown = True
  2. self.scheduler.shutdown()
  3. self.control_connection.shutdown()
  4. Session shutdown → pool shutdown
  5. self.executor.shutdown() ← executor tasks may still be running during steps 2-4

The executor is shut down last, so in-flight tasks submitted before is_shutdown was set can still be executing during pool/session teardown.

Likely logging call sites involved:

  • cluster.py:3247log.debug("Removed connection pool for %r", host) in remove_pool()
  • cluster.py:3236log.debug("Added pool for host %s to session", host) in run_add_or_renew_pool()
  • cluster.py:1955-1958log.debug("... %s", host) in _start_reconnector(), called from @run_in_executor decorated on_down_potentially_blocking()
  • cluster.py:1843-1852log.error/debug/info("... %s", host) in _on_up_future_completed()

Observed In

logs_58180451190.zip

Possible Fixes

  1. Reorder shutdown: shut down (or at least drain) the executor before shutting down sessions/pools, so no executor tasks are running during teardown
  2. Defensive string caching: cache str(host) / repr(host) results so they don't access mutable state during formatting
  3. Guard logging calls: check is_shutdown before logging in executor-submitted callbacks, or catch exceptions in __str__/__repr__

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions