Skip to content

Conversation

@josephjclark
Copy link
Collaborator

@josephjclark josephjclark commented Feb 1, 2026

Following #4333 and my final comment, I really don't think version history should be saved to the Workflow table. It's too hard to maintain there, and we can't run compaction properly.

This PR forces the provisioner and sandbox merge to look up the actual version history from the table.

Also, for good measure, since all version histories changed recently I've added a migration to remove all existing history entries. Hard reset on all of them. The CLI won't mind - it's incompatible with those versions anyway.

Tests are AI generated and I haven't even looked yet - I'll take a closer look before opening up for review.

Changelog deliberately omitted

AI Usage

Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):

  • I have used Claude Code
  • I have used another model
  • I have not used AI

You can read more details in our
Responsible AI Policy

Pre-submission checklist

  • I have performed an AI review of my code (we recommend using /review
    with Claude Code)
  • I have implemented and tested all related authorization policies.
    (e.g., :owner, :admin, :editor, :viewer)
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

@github-project-automation github-project-automation bot moved this to New Issues in v2 Feb 1, 2026
@josephjclark
Copy link
Collaborator Author

I'm a bit worried about some of the diffs in this PR. I'll have to review more closely with a clear head tomrrow

@josephjclark josephjclark force-pushed the update-workflow-history branch from e964a9a to 6b267cb Compare February 9, 2026 15:49
@@ -0,0 +1,33 @@
defmodule Lightning.Repo.Migrations.RemoveVersionHistoryFromWorkflows do
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@midigofrank if I wanted to basically reset all the versions in history, just to clear the slate, then I guess you wouldn't do it as a migration?

Maybe we should drop this. I was going to do it just as a precaution - but if it's too complex we should probably skip it.

I suppose the other way to look at it is a job that removes all versions timestamped before a certain date

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we don't need to reset the version history here. The code works without it. history_for/1 and latest_hash/1 now query the table directly and handle empty results fine. But I think we still need this migration for the column drop and constraint removal, since the version_history field is removed from the schema, the column should go too. So I'd suggest keeping this migration but removing these two lines:

  # Clean slate: remove all workflow_versions rows since they were out of sync                                                                                                                             
  execute "DELETE FROM workflow_versions"

If we want a clean slate later, a separate mix task with a date cutoff sounds like the right approach.

@josephjclark josephjclark marked this pull request as ready for review February 10, 2026 14:06
midigofrank and others added 4 commits February 10, 2026 15:25
* Refactor CircleCI to build-then-fan-out pattern

Previously, all 6 CI jobs ran the full build independently in parallel,
causing cache race conditions and redundant compilation that contributed
to flaky tests.

Now uses a single compile job that persists to workspace, with 6
lightweight check jobs that attach the pre-built workspace:

- compile: Full build, deps, PLT - runs once
- check_formatting, check_credo, check_dialyzer, check_sobelow: Fast checks
- test_elixir, test_javascript: Test suites

Benefits:
- Eliminates cache contention between parallel jobs
- Single compilation instead of 6x redundant builds
- Faster feedback on simple failures
- More reliable and debuggable CI runs

Also adds reusable executors and commands for cleaner config.

* Update changelog for CircleCI refactor

Add entry for PR #4378

* Fix workspace persist path error

Remove ~/.mix from persist_to_workspace paths - it's outside the
workspace root and already handled by the deps cache.

* Fix git safe.directory error during checkout

Add git safe.directory config before checkout to avoid "dubious
ownership" error when running as root in the lightning user's home.

* Install Hex/Rebar in downstream jobs

~/.mix isn't part of the workspace, so downstream jobs need to install
Hex and Rebar. This is a fast operation that doesn't require compilation.

* Add Node.js to test_elixir job

Lightning.AdaptorService calls npm during startup, so Node must be
available for Elixir tests to run.

* Fix ownership after cache restore

Cache is restored as root, so _build and deps directories need chown
before lightning user can compile.

* Add libsodium to test_elixir job

The enacl NIF requires libsodium at runtime.

* Fix cache path for Hex/Rebar

~/.mix expands to /root/.mix but Hex is installed to /home/lightning/.mix.
Use explicit path and bump cache version to v7 for clean slate.

* Fix ownership of .mix directory after cache restore

Cache restores /home/lightning/.mix as root. Need to chown all of
/home/lightning, not just the project directory.

* Fix rebar install and consolidate lint checks into single job

- Fix rebar3 installing for root instead of lightning user, which was
  causing all Erlang deps to recompile on every CI run
- Combine formatting, credo, and sobelow into single lint job to reduce
  environment spin-up overhead (~30s per job saved)
- Use when: always to run all lint checks even if one fails, with
  sentinel file to track failures

* Use date-based cache keys instead of version numbers

* Move changelog entry
* fix .gitignore to preserve test fixture adaptor_registry_cache.json

The blanket ignore rule for adaptor_registry_cache.json was causing the
test fixture copy to be excluded by tools that respect .gitignore (e.g.
rsync), leading to test failures.

* Fix cascade delete race condition causing StaleEntryError on edge retargeting

When a job is deleted and its edge retargeted in the same save, PostgreSQL's
ON DELETE CASCADE removes the edge before Ecto can update it, causing a
StaleEntryError.

Change edge FK constraints from CASCADE to partial SET NULL (PG 15+) so
deleted jobs nullify edge references instead of cascading. Add orphan
tracking to both save paths (save_workflow and provisioner) that precisely
captures which edges were affected, then cleans up only those edges after
the save completes - preserving legitimate NULL target edges.

* Update provisioner test for partial SET NULL FK behavior

The test previously expected StaleEntryError when deleting a job and its
edge in the same import. With partial SET NULL FKs, the operation now
succeeds correctly - the FK nullifies the edge reference instead of
cascade-deleting it, allowing Ecto to process the edge deletion.

* More tests

* Clean up provisioner_test.exs

* Handle source_job edge orphaning when jobs are deleted

Previously only target_job_id references were tracked and cleaned up
when jobs were deleted. Edges where the deleted job was the source
(source_job_id) were nullified but not tracked for cleanup, leaving
orphaned edges with no source. This matters when replacing the first
job in a chain — both the trigger edge target and the downstream edge
source need handling.

- Track source-orphaned edge IDs alongside target-orphaned in both
  save_workflow and provisioner paths
- Broaden cleanup to delete edges missing either target or source
  (without trigger fallback)
- Restore two unrelated tests removed from provisioner "new project"
  block (collections, trigger edge enabled)
- Add 6 new tests covering source-job deletion and retargeting
  scenarios across both save paths

* Add changelog entry for edge retargeting StaleEntryError fix
* reuse existing version_chip component

* update changelog
Copy link
Contributor

@elias-ba elias-ba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good Joe, nicely done 👏🏽. I have added a comment on my thoughts about the migration file.

The provisioning API still returns version_history (via
WorkflowVersions.history_for), but the field was removed from the
Workflow schema so it can no longer be extracted via Map.take on
the struct. Use the query function instead.
@elias-ba
Copy link
Contributor

Hey @josephjclark, I saw a few Elixir tests failing and wanted to help fix them. I already pushed a commit to fix the pull test in cli_deploy_test.exs (the expected state was missing version_history).

But there's one remaining failure I think is important for you to be aware of. The provisioning API still returns version_history via WorkflowVersions.history_for at provisioning_json.ex:82-84. The CLI round-trips it back during deploy, and the server rejects it with: "extraneous parameters: version_history"

Since the field was removed from the Workflow schema, the server no longer accepts it. Two options:

  1. Stop returning version_history from the provisioning API (remove lines 82-87 in provisioning_json.ex)
  2. Keep returning it but strip it from incoming deploy payloads

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

4 participants