Skip to content

fix: thread-safety bugs and cleanup from code review#6

Merged
jsulmont merged 9 commits intomainfrom
jan/cleanup-tests
Mar 10, 2026
Merged

fix: thread-safety bugs and cleanup from code review#6
jsulmont merged 9 commits intomainfrom
jan/cleanup-tests

Conversation

@jsulmont
Copy link
Owner

Summary

Fixes from the 2026-03-10 code review. The first four are concurrency/correctness bugs, the rest are cleanup.

  • Move constructor data races: acquire other.mutex_ before reading members in the initializer list
  • Mutex-escaping references: get_node_state() returns a lightweight NodeStateSnapshot by value; get_metric_name() returns std::string
  • Deadlock in STATE publish: release mutex before blocking MQTT send in publish_state_birth/publish_state_death
  • Fragile STATE JSON parsing: tolerate whitespace in "online" : true instead of requiring exact "online":true
  • Remove dead code (rebirth() death payload, PayloadBuilder::build() timestamp fallback)
  • Use SEQ_NUMBER_MAX constant consistently
  • Remove unused includes, fix missing EOF newlines

Test plan

  • test_move_constructor_race — concurrent move with reader thread contention
  • test_mutex_escaping_refs — snapshot/string ownership survives map mutation
  • test_state_publish_deadlock — STATE publish under message flood
  • test_state_json_parsing — 10 whitespace variations of STATE JSON
  • All 11 tests pass with local mosquitto broker

Jan van Lindt added 9 commits March 10, 2026 14:50
 EdgeNode and HostApplication move constructors were reading members in
 the initializer list before locking other.mutex_ in the body — a data
 race if the source is accessed concurrently. Move all member transfers
 into the body under the lock. Also adds missing node_states_ transfer
 in HostApplication and primary_host_online_ in EdgeNode.
 get_node_state() returned a reference_wrapper into mutex-protected map
 data; get_metric_name() returned a string_view. Both were readable
 after the lock was released, racing with MQTT callback mutations.

 get_node_state() now returns a lightweight NodeStateSnapshot (scalars
 only). get_metric_name() returns std::string. Callers that need alias
 resolution use get_metric_name() separately.
 publish_state_birth/death held mutex_ while publish_raw_message blocked
 on a 5s future. If the Paho callback thread needed mutex_ to deliver
 on_message_arrived before firing the send-success callback, deadlock.

 Copy topic/payload/qos under the lock, release, then publish — matching
 the pattern EdgeNode already uses.
 - Use SEQ_NUMBER_MAX constant consistently instead of literal 256
 - Remove dead death_payload_data_ update in rebirth() (connect() overwrites it)
 - Remove unreachable timestamp fallback in PayloadBuilder::build()
 - Remove unused includes (memory, thread, algorithm, vector)
 - Document fire-and-forget QoS 0 semantics on command publish methods
 - Fix missing newlines at end of files
@jsulmont jsulmont merged commit 2d64b87 into main Mar 10, 2026
5 checks passed
@jsulmont jsulmont deleted the jan/cleanup-tests branch March 11, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant