Add assisted installer deployment method for spoke TNF clusters by gamado · Pull Request #51 · openshift-eng/two-node-toolbox

gamado · 2026-02-15T16:55:49Z

Deploy a spoke TNF cluster via ACM/MCE assisted installer on an existing hub cluster. This adds a third deployment path alongside dev-scripts (IPI) and kcli methods.

New Ansible roles:

assisted/acm-install: installs ACM operator, MultiClusterHub, AgentServiceConfig with auto-detected RHCOS ISO, enables TNF support, and configures provisioning for external BMH management
assisted/assisted-spoke: creates spoke libvirt network and VMs, deploys cluster resources (ClusterDeployment, AgentClusterInstall, InfraEnv, BareMetalHost with fencing credentials), monitors installation, and extracts spoke credentials

Usage: make deploy fencing-assisted

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Verified by :
Spoke Cluster Verification

Nodes:
NAME STATUS ROLES AGE VERSION
spoke-tnf-master-0 Ready control-plane,master,worker 46m v1.34.2
spoke-tnf-master-1 Ready control-plane,master,worker 27m v1.34.2

Cluster Operators: 34/34 Available, 0 Progressing, 0 Degraded

Full List of Resources:
* Clone Set: kubelet-clone [kubelet]:
* Started: [ spoke-tnf-master-0 spoke-tnf-master-1 ]
* spoke-tnf-master-0_redfish (stonith:fence_redfish): Started
* spoke-tnf-master-1_redfish (stonith:fence_redfish): Started
* Clone Set: etcd-clone [etcd]:
* Started: [ spoke-tnf-master-0 spoke-tnf-master-1 ]

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

Result: PASS

Deploy a spoke TNF cluster via ACM/MCE assisted installer on an existing hub cluster. This adds a third deployment path alongside dev-scripts (IPI) and kcli methods. New Ansible roles: - assisted/acm-install: installs ACM operator, MultiClusterHub, AgentServiceConfig with auto-detected RHCOS ISO, enables TNF support, and configures provisioning for external BMH management - assisted/assisted-spoke: creates spoke libvirt network and VMs, deploys cluster resources (ClusterDeployment, AgentClusterInstall, InfraEnv, BareMetalHost with fencing credentials), monitors installation, and extracts spoke credentials Usage: make deploy fencing-assisted Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-02-15T16:56:01Z

Hi @gamado. Thanks for your PR.

I'm waiting for a openshift-eng member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fonta-rh

Review: Add Assisted Installer Deployment for Spoke TNF Clusters

Solid architecture — clean separation between acm-install (hub-side) and assisted-spoke (spoke-side) with good reuse of existing roles (proxy-setup, common/update-cluster-inventory). The RHCOS ISO auto-detection with fallback and the BMH fencing credential setup are well done.

Found 5 critical issues that should be fixed before merge, 8 important issues for follow-up, and 5 minor suggestions. Details below.

Critical Issues

C1. SSH key hardcodes `id_rsa.pub` — will fail on ed25519-only hosts

File: roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml

The task uses cat ~/.ssh/id_rsa.pub but the project convention is ~/.ssh/id_ed25519.pub (see CLAUDE.md, kcli role's prepare.yml). Modern RHEL 9 defaults to ed25519. If the hypervisor was provisioned with ed25519 only, this task silently fails.

Suggestion: Use the same multi-key detection pattern from the kcli role or common/:

- name: Get SSH public key
  shell: |
    for key in ~/.ssh/id_ed25519.pub ~/.ssh/id_rsa.pub ~/.ssh/id_ecdsa.pub; do
      [ -f "$key" ] && cat "$key" && exit 0
    done
    echo "ERROR: No SSH public key found" >&2 && exit 1
  register: ssh_pub_key
  changed_when: false

C2. `pull_secret_path` duplicated across both roles

Files:

roles/assisted/acm-install/vars/main.yml:23
roles/assisted/assisted-spoke/vars/main.yml:13

Both independently define pull_secret_path: /opt/dev-scripts/pull_secret.json. If the path changes, both must be updated independently.

Suggestion: Define once as a playbook-level variable in assisted-install.yml (or in the vars file loaded by vars_files), remove from both role vars/main.yml.

C3. Hub release image extraction duplicated between roles

Files:

roles/assisted/acm-install/tasks/agent-service-config.yml:1-14 — fetches hub_release_image and hub_ocp_version
roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml:1-20 — fetches the same data

Same oc get clusterversion calls, same parsing, same variable names. Runs twice during a single deployment.

Suggestion: Move to a pre_tasks block in assisted-install.yml and pass as facts, or create a shared task file under roles/assisted/common/.

C4. RHCOS ISO fallback logic has edge cases that pass silently

File: roles/assisted/acm-install/tasks/agent-service-config.yml

Two issues:

The primary extraction pipes oc image info -o json 2>/dev/null to Python. If oc image info fails, Python gets empty stdin → JSONDecodeError → rc=1. But failed_when only checks "'FAILED' in rhcos_iso_extraction.stdout", not rc. The task succeeds with empty stdout.
The set_fact uses rhcos_iso_fallback.stdout | default(rhcos_iso_extraction.stdout). default() only triggers when the variable is undefined (i.e., fallback was skipped). If fallback runs but produces empty output, rhcos_iso_url becomes "" — the failed_when checks for 'FAILED' and 'NEEDS_FALLBACK' but not empty string.

Suggestion:

- name: Set RHCOS ISO URL fact
  set_fact:
    rhcos_iso_url: >-
      {{ (rhcos_iso_fallback.stdout | default(rhcos_iso_extraction.stdout)) | trim }}
  failed_when: >-
    rhcos_iso_url == 'FAILED' or
    rhcos_iso_url == 'NEEDS_FALLBACK' or
    rhcos_iso_url == ''

And add failed_when: rhcos_iso_extraction.rc != 0 or 'FAILED' in rhcos_iso_extraction.stdout to the primary extraction task.

C5. VM cleanup can cause stale disk reuse

File: roles/assisted/assisted-spoke/tasks/cleanup.yml

virsh undefine --remove-all-storage is wrapped in 2>/dev/null || true + failed_when: false. If storage removal fails (locked disk, permission denied), the VM definition is removed but the disk persists. On re-deployment, qemu-img create uses creates: guard and skips disk creation — the new VM boots from the OLD disk.

Suggestion: Separate the storage cleanup from VM undefine, or explicitly delete the qcow2 files in the cleanup task:

- name: Remove spoke VM disk images
  file:
    path: "{{ spoke_vm_image_dir }}/{{ spoke_cluster_name }}-master-{{ item }}.qcow2"
    state: absent
  loop: "{{ range(spoke_ctlplanes) | list }}"
  become: true

Important Issues

I6. Deploy script hardcodes spoke path

File: scripts/deploy-fencing-assisted.sh:42

Success message says KUBECONFIG=~/spoke-tnf/auth/kubeconfig but spoke_cluster_name is configurable in vars/assisted.yml. A user with spoke_cluster_name: my-cluster gets the wrong path in the output.

Suggestion: Either source the vars file to get the actual name, or note that the path depends on the configured spoke_cluster_name.

I7. No input validation for spoke-specific variables

File: (missing — no validate.yml in assisted-spoke)

Both dev-scripts/install-dev and kcli/kcli-install have dedicated validate.yml files. The assisted roles validate hub health but not spoke inputs (CIDR format, VIP within CIDR, non-empty cluster name, valid BMC credentials format).

Suggestion: Add a validate.yml to assisted-spoke that checks at minimum:

spoke_api_vip and spoke_ingress_vip are within spoke_network_cidr
spoke_cluster_name is non-empty and DNS-safe
spoke_ctlplanes >= 2

I8. No cluster state tracking — must integrate with existing system

Existing roles update cluster-vm-state.json via common/tasks/cluster-state.yml. The assisted roles skip this entirely, which means the deployment tracking system has no awareness of spoke clusters. This breaks make info, lifecycle management, and any tooling that depends on cluster-vm-state.json to know what's deployed.

The assisted-spoke role must call common/tasks/cluster-state.yml (or equivalent) to register spoke VMs and their state. This is a project convention — all deployment methods must support it.

I9. `failed_when: false` too broadly applied

Multiple locations silence legitimate errors:

cleanup.yml:3-7 — namespace delete timeout silenced entirely
enable-watch-all-namespaces.yml:17 — Provisioning CR patch hides API errors
retrieve-credentials.yml:63-78 — VM start failure hidden, triggers 2-minute blind pause

Suggestion: At minimum, log the error output. For the VM start case, if virsh start fails, don't wait 120 seconds.

I10. Wait loops provide no diagnostics on timeout

When wait for assisted-service pod or wait for agents to register times out, the user sees only the retry count. Consider adding a rescue block that dumps pod status, events, or logs.

I11. `hub_auth_dir` defined but never used

File: roles/assisted/assisted-spoke/vars/main.yml:21

Dead variable.

I12. `storage.yml` destructive `rm -rf` on re-run

File: roles/assisted/acm-install/tasks/storage.yml

rm -rf {{ assisted_images_path }}/* {{ assisted_db_path }}/*

Runs every time the role executes (not just on cleanup), wiping existing assisted-service data. This belongs in cleanup.yml, not storage.yml. Gate it with a when: force_cleanup or move to cleanup.

I13. No README for the new roles

File: (missing — no README.md in either assisted/acm-install/ or assisted/assisted-spoke/)

Every existing role in this project has a comprehensive README (dev-scripts/install-dev/README.md, kcli/kcli-install/README.md, etc.). READMEs are mandatory for new roles — they document prerequisites, variables, usage examples, and known limitations. The assisted.yml.template comments are not a substitute.

Suggestions

S14. changed_when: true on network/VM creation masks idempotency. Use actual command output to determine change status.

S15. setup-ksushy.yml assertion message assumes 2 hub nodes. Hubs can have 3+ nodes (e.g., 3-node HA).

S16. Template should document DHCP range constraints — VIPs must be outside .50-.150 range.

S17. Missing hub_network_cidr in template — if hub network differs from 192.168.111.0/24, nftables rules silently fail.

S18. Nearly all new files lack a trailing newline. Minor but should be fixed for consistency.

Strengths

Architecture is sound — clean separation between acm-install and assisted-spoke with the playbook orchestrating both
Good reuse of existing roles — proxy-setup and common/update-cluster-inventory in post_tasks
RHCOS ISO auto-detection with fallback is genuinely useful
Operational comments are excellent (e.g., "Requires both a ConfigMap AND an annotation on AgentServiceConfig")
assisted.yml.template is well-structured as user-facing documentation
BMH creation with fencing credentials correctly uses bmac.agent-install.openshift.io/fencing-credentials-secret-name
Makefile integration correctly chains fencing-ipi before fencing-assisted

- C1: SSH key detection now tries ed25519 first, falls back to rsa/ecdsa - C2: Deduplicate pull_secret_path and hub_kubeconfig to playbook level - C3: Move hub release image extraction to playbook pre_tasks (run once) - C4: RHCOS ISO extraction checks rc and catches empty string failures - C5: Explicit disk cleanup prevents stale qcow2 reuse on re-deploy - Remove unused hub_auth_dir variable (I11) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-02-18T11:10:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gamado
Once this PR has been reviewed and has the lgtm label, please ask for approval from fonta-rh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gamado · 2026-02-18T16:45:24Z

Both clusters are healthy:

Spoke cluster (spoke-tnf):

2 nodes: spoke-tnf-master-0 and spoke-tnf-master-1, both Ready
OCP 4.22.0-0.nightly, all cluster operators Available=True, Degraded=False
etcd operator healthy

Hub cluster:

2 nodes: master-0 and master-1, both Ready
Pacemaker cluster TNF running with quorum
All resources started: kubelet-clone, STONITH fencing (fence_redfish), etcd-clone
Corosync, Pacemaker, pcsd all active/enabled

The critical fixes (C1-C5) are fully validated. Ready to start on the important issues whenever you are.

- I6: Parse spoke_cluster_name from vars/assisted.yml in deploy script instead of hardcoding ~/spoke-tnf/auth/kubeconfig - I8: Add cluster state tracking (deploying/deployed) to assisted-install.yml using common/cluster-state.yml, consistent with dev-scripts and kcli methods - I9: Replace blanket failed_when:false with ignore_errors:true in cleanup, conditional error checking in enable-watch-all-namespaces, and remove failed_when:false from virsh start in retrieve-credentials - I10: Add block/rescue diagnostic handlers to all 9 wait loops across install-operator, agent-service-config, wait-for-install, and retrieve-credentials, dumping relevant status on timeout - I13: Add README.md for acm-install and assisted-spoke roles - S16: Document DHCP range constraints in assisted.yml.template - S17: Expose hub_network_cidr in assisted.yml.template - S18: Add trailing newlines to all 22 new files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gamado · 2026-02-18T18:04:48Z

Review Fixes Status

Two commits pushed addressing the review findings:

Commit 1: `3a61c01` — Critical Fixes (C1-C5)

Issue	Status	Description
C1	Fixed	SSH key detection: multi-key loop (ed25519, rsa, ecdsa) instead of hardcoded `id_rsa.pub`
C2	Fixed	`pull_secret_path` and `hub_kubeconfig` moved to playbook-level vars, removed from both roles
C3	Fixed	Hub release image extraction moved to playbook `pre_tasks`, removed duplication from both roles
C4	Fixed	RHCOS ISO extraction: added `rc` check on primary, empty string check on `set_fact`
C5	Fixed	Explicit disk image removal after VM undefine to prevent stale disk reuse

Commit 2: `38bfca3` — Important Fixes + Suggestions (I6, I8, I9, I10, I13, S16-S18)

Issue	Status	Description
I6	Fixed	Deploy script parses `spoke_cluster_name` from `vars/assisted.yml` instead of hardcoding path
I7	Skipped	Input validation for spoke variables — deferred
I8	Fixed	Cluster state tracking added to `assisted-install.yml` (deploying/deployed phases via `common/cluster-state.yml`)
I9	Fixed	`failed_when: false` replaced: `ignore_errors: true` in cleanup, conditional check in provisioning patch, removed from virsh start
I10	Fixed	Block/rescue diagnostic handlers added to all 9 wait loops across 4 files
I11	Fixed	`hub_auth_dir` removed (was fixed in C2 commit)
I12	Skipped	Destructive `rm -rf` in storage.yml — deferred
I13	Fixed	READMEs added for `acm-install` and `assisted-spoke` roles
S14	Skipped	`changed_when: true` idempotency — deferred
S15	Skipped	`setup-ksushy.yml` assertion message — deferred
S16	Fixed	DHCP range constraints documented in `assisted.yml.template`
S17	Fixed	`hub_network_cidr` exposed in `assisted.yml.template`
S18	Fixed	Trailing newlines added to all 22 new files

Summary

13 of 18 review findings addressed
5 deferred (I7, I12, S14, S15 — low risk, can be follow-up)
End-to-end deployment validated: make deploy fencing-assisted completed successfully (ok=155, changed=48, failed=0)
Both hub and spoke clusters verified healthy (nodes Ready, all COs available, Pacemaker fencing operational)

gamado · 2026-02-19T11:38:00Z

Environment

Platform: AWS c5.metal (RHEL 9.6)
OCP Version: 4.22.0-0.nightly-2026-02-17-112443
Deployment method: make deploy fencing-assisted
Date: 2026-02-19

Deployment Summary

Full end-to-end deployment completed successfully:

PLAY RECAP *********************************************************************
ec2-user@51.85.92.84  : ok=163  changed=50  unreachable=0  failed=0  skipped=18  rescued=0  ignored=0

Hub Cluster Status

Component	Status
Nodes	2/2 Ready (master-0, master-1)
Cluster Operators	All healthy (Available=True, Degraded=False)
Pacemaker	TNF cluster running, partition with quorum
STONITH	fence_redfish started on both nodes
etcd	etcd-clone started on both nodes
kubelet	kubelet-clone started on both nodes
Corosync/Pacemaker/pcsd	active/enabled

Spoke Cluster Status

Component	Status
Nodes	2/2 Ready (spoke-tnf-master-0, spoke-tnf-master-1)
Cluster Operators	34/34 healthy (Available=True, Degraded=False)
OCP Version	4.22.0-0.nightly-2026-02-17-112443

Fixes Validated

All fixes from the two review-fix commits were validated in this run:

Fix	Validated
C1 - SSH key multi-key detection	Yes — key detected and used
C2 - Playbook-level pull_secret_path / hub_kubeconfig	Yes — no duplication errors
C3 - Hub release extraction in pre_tasks	Yes — version extracted once, passed to both roles
C4 - RHCOS ISO fallback error handling	Yes — ISO URL resolved
C5 - Explicit disk cleanup	Yes — clean VM creation
I6 - Dynamic spoke path in deploy script	Yes — output shows correct `~/spoke-tnf/auth/kubeconfig`
I8 - Cluster state tracking	Yes — DEPLOYING/DEPLOYED phases logged for ASSISTED method
I9 - `failed_when: false` replaced	Yes — no silent failures, `rescued=0`
I10 - Wait loop diagnostics	Yes — all waits passed cleanly, rescue blocks available but not triggered
I13 - Role READMEs	Yes — present in repo
S16 - DHCP range documented	Yes — template updated
S17 - hub_network_cidr in template	Yes — template updated
S18 - Trailing newlines	Yes — all files fixed

gamado and others added 2 commits February 11, 2026 21:48

Support Instance type of c5.metal

f02caed

openshift-ci bot requested review from jaypoulz and slintes February 15, 2026 16:55

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 15, 2026

fonta-rh suggested changes Feb 17, 2026

View reviewed changes

openshift-ci bot assigned fonta-rh Feb 17, 2026

gamado requested a review from fonta-rh February 18, 2026 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add assisted installer deployment method for spoke TNF clusters#51

Add assisted installer deployment method for spoke TNF clusters#51
gamado wants to merge 4 commits intoopenshift-eng:mainfrom
gamado:assisted

gamado commented Feb 15, 2026

Uh oh!

openshift-ci bot commented Feb 15, 2026

Uh oh!

fonta-rh left a comment

Uh oh!

openshift-ci bot commented Feb 18, 2026

Uh oh!

gamado commented Feb 18, 2026

Uh oh!

gamado commented Feb 18, 2026

Uh oh!

gamado commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

gamado commented Feb 15, 2026

Uh oh!

openshift-ci bot commented Feb 15, 2026

Uh oh!

fonta-rh left a comment

Choose a reason for hiding this comment

Review: Add Assisted Installer Deployment for Spoke TNF Clusters

Critical Issues

C1. SSH key hardcodes id_rsa.pub — will fail on ed25519-only hosts

C2. pull_secret_path duplicated across both roles

C3. Hub release image extraction duplicated between roles

C4. RHCOS ISO fallback logic has edge cases that pass silently

C5. VM cleanup can cause stale disk reuse

Important Issues

I6. Deploy script hardcodes spoke path

I7. No input validation for spoke-specific variables

I8. No cluster state tracking — must integrate with existing system

I9. failed_when: false too broadly applied

I10. Wait loops provide no diagnostics on timeout

I11. hub_auth_dir defined but never used

I12. storage.yml destructive rm -rf on re-run

I13. No README for the new roles

Suggestions

Strengths

Uh oh!

openshift-ci bot commented Feb 18, 2026

Uh oh!

gamado commented Feb 18, 2026

Uh oh!

gamado commented Feb 18, 2026

Review Fixes Status

Commit 1: 3a61c01 — Critical Fixes (C1-C5)

Commit 2: 38bfca3 — Important Fixes + Suggestions (I6, I8, I9, I10, I13, S16-S18)

Summary

Uh oh!

gamado commented Feb 19, 2026

Environment

Deployment Summary

Hub Cluster Status

Spoke Cluster Status

Fixes Validated

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

C1. SSH key hardcodes `id_rsa.pub` — will fail on ed25519-only hosts

C2. `pull_secret_path` duplicated across both roles

I9. `failed_when: false` too broadly applied

I11. `hub_auth_dir` defined but never used

I12. `storage.yml` destructive `rm -rf` on re-run

Commit 1: `3a61c01` — Critical Fixes (C1-C5)

Commit 2: `38bfca3` — Important Fixes + Suggestions (I6, I8, I9, I10, I13, S16-S18)