-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Summary
The test libev (3.14t) integration test job segfaults at ~95% through cqlengine tests (during test_ifexists.py) on free-threaded Python 3.14t. The crash is a thread-safety race condition in logging calls that format Host/EndPoint objects while those objects are being concurrently torn down during cluster shutdown.
Stack Trace
Fatal Python error: Segmentation fault
<Cannot show all threads while the GIL is disabled>
Stack (most recent call first):
File ".../logging/__init__.py", line 1154 in emit
File ".../concurrent/futures/thread.py", line 73 in run
...
Current thread's C stack trace (most recent call first):
... at _PyUnicodeWriter_WriteStr+0x77
... cassandra/cluster.cpython-314t-x86_64-linux-gnu.so, at +0x13b1cc
... cassandra/cluster.cpython-314t-x86_64-linux-gnu.so, at +0x14771a (PyObject_VectorcallMethod)
... cassandra/cluster.cpython-314t-x86_64-linux-gnu.so, at +0x1121f9
Root Cause
The crash is a race condition between cluster shutdown and executor threads doing logging with %s/%r formatting of Host objects. With the GIL disabled in 3.14t, this is no longer safe.
The race:
- Thread A (main):
Cluster.shutdown()→Session.shutdown()→ iterates/clears_poolsand shuts down pools, potentially triggering cleanup ofHost/EndPointobjects - Thread B (executor worker): still running a submitted task (e.g.
run_add_or_renew_pool(),on_down_potentially_blocking(), or a future callback), hits a logging call likelog.debug("... %s", host)which callsHost.__str__()→str(self.endpoint)→DefaultEndPoint.__str__()→"%s:%d" % (self._address, self._port)
The segfault occurs in _PyUnicodeWriter_WriteStr because the endpoint's _address string (or the endpoint object itself) is being garbage collected by Thread A while Thread B is trying to format it.
Shutdown order in Cluster.shutdown() (cluster.py:1772):
self.is_shutdown = Trueself.scheduler.shutdown()self.control_connection.shutdown()- Session shutdown → pool shutdown
self.executor.shutdown()← executor tasks may still be running during steps 2-4
The executor is shut down last, so in-flight tasks submitted before is_shutdown was set can still be executing during pool/session teardown.
Likely logging call sites involved:
cluster.py:3247—log.debug("Removed connection pool for %r", host)inremove_pool()cluster.py:3236—log.debug("Added pool for host %s to session", host)inrun_add_or_renew_pool()cluster.py:1955-1958—log.debug("... %s", host)in_start_reconnector(), called from@run_in_executordecoratedon_down_potentially_blocking()cluster.py:1843-1852—log.error/debug/info("... %s", host)in_on_up_future_completed()
Observed In
- Job:
test libev (3.14t)in PR #704 - CI run: https://github.com/scylladb/python-driver/actions/runs/22279731008/job/64448351909
- Not specific to PR Fix factory() returning dead connections on close during setup #704 changes — the PR only touches
connection.py:factory(). This is a pre-existing race exposed by free-threaded Python.
Possible Fixes
- Reorder shutdown: shut down (or at least drain) the executor before shutting down sessions/pools, so no executor tasks are running during teardown
- Defensive string caching: cache
str(host)/repr(host)results so they don't access mutable state during formatting - Guard logging calls: check
is_shutdownbefore logging in executor-submitted callbacks, or catch exceptions in__str__/__repr__