Skip to content

Comments

Add assisted installer deployment method for spoke TNF clusters#51

Open
gamado wants to merge 4 commits intoopenshift-eng:mainfrom
gamado:assisted
Open

Add assisted installer deployment method for spoke TNF clusters#51
gamado wants to merge 4 commits intoopenshift-eng:mainfrom
gamado:assisted

Conversation

@gamado
Copy link
Contributor

@gamado gamado commented Feb 15, 2026

Deploy a spoke TNF cluster via ACM/MCE assisted installer on an existing hub cluster. This adds a third deployment path alongside dev-scripts (IPI) and kcli methods.

New Ansible roles:

  • assisted/acm-install: installs ACM operator, MultiClusterHub, AgentServiceConfig with auto-detected RHCOS ISO, enables TNF support, and configures provisioning for external BMH management
  • assisted/assisted-spoke: creates spoke libvirt network and VMs, deploys cluster resources (ClusterDeployment, AgentClusterInstall, InfraEnv, BareMetalHost with fencing credentials), monitors installation, and extracts spoke credentials

Usage: make deploy fencing-assisted

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Verified by :
Spoke Cluster Verification

Nodes:
NAME STATUS ROLES AGE VERSION
spoke-tnf-master-0 Ready control-plane,master,worker 46m v1.34.2
spoke-tnf-master-1 Ready control-plane,master,worker 27m v1.34.2

Cluster Operators: 34/34 Available, 0 Progressing, 0 Degraded

Full List of Resources:
* Clone Set: kubelet-clone [kubelet]:
* Started: [ spoke-tnf-master-0 spoke-tnf-master-1 ]
* spoke-tnf-master-0_redfish (stonith:fence_redfish): Started
* spoke-tnf-master-1_redfish (stonith:fence_redfish): Started
* Clone Set: etcd-clone [etcd]:
* Started: [ spoke-tnf-master-0 spoke-tnf-master-1 ]

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

Result: PASS

gamado and others added 2 commits February 11, 2026 21:48
Deploy a spoke TNF cluster via ACM/MCE assisted installer on an existing
hub cluster. This adds a third deployment path alongside dev-scripts (IPI)
and kcli methods.

New Ansible roles:
- assisted/acm-install: installs ACM operator, MultiClusterHub,
  AgentServiceConfig with auto-detected RHCOS ISO, enables TNF support,
  and configures provisioning for external BMH management
- assisted/assisted-spoke: creates spoke libvirt network and VMs, deploys
  cluster resources (ClusterDeployment, AgentClusterInstall, InfraEnv,
  BareMetalHost with fencing credentials), monitors installation, and
  extracts spoke credentials

Usage: make deploy fencing-assisted

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci openshift-ci bot requested review from jaypoulz and slintes February 15, 2026 16:55
@openshift-ci
Copy link

openshift-ci bot commented Feb 15, 2026

Hi @gamado. Thanks for your PR.

I'm waiting for a openshift-eng member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 15, 2026
Copy link
Contributor

@fonta-rh fonta-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Add Assisted Installer Deployment for Spoke TNF Clusters

Solid architecture — clean separation between acm-install (hub-side) and assisted-spoke (spoke-side) with good reuse of existing roles (proxy-setup, common/update-cluster-inventory). The RHCOS ISO auto-detection with fallback and the BMH fencing credential setup are well done.

Found 5 critical issues that should be fixed before merge, 8 important issues for follow-up, and 5 minor suggestions. Details below.


Critical Issues

C1. SSH key hardcodes id_rsa.pub — will fail on ed25519-only hosts

File: roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml

The task uses cat ~/.ssh/id_rsa.pub but the project convention is ~/.ssh/id_ed25519.pub (see CLAUDE.md, kcli role's prepare.yml). Modern RHEL 9 defaults to ed25519. If the hypervisor was provisioned with ed25519 only, this task silently fails.

Suggestion: Use the same multi-key detection pattern from the kcli role or common/:

- name: Get SSH public key
  shell: |
    for key in ~/.ssh/id_ed25519.pub ~/.ssh/id_rsa.pub ~/.ssh/id_ecdsa.pub; do
      [ -f "$key" ] && cat "$key" && exit 0
    done
    echo "ERROR: No SSH public key found" >&2 && exit 1
  register: ssh_pub_key
  changed_when: false

C2. pull_secret_path duplicated across both roles

Files:

  • roles/assisted/acm-install/vars/main.yml:23
  • roles/assisted/assisted-spoke/vars/main.yml:13

Both independently define pull_secret_path: /opt/dev-scripts/pull_secret.json. If the path changes, both must be updated independently.

Suggestion: Define once as a playbook-level variable in assisted-install.yml (or in the vars file loaded by vars_files), remove from both role vars/main.yml.


C3. Hub release image extraction duplicated between roles

Files:

  • roles/assisted/acm-install/tasks/agent-service-config.yml:1-14 — fetches hub_release_image and hub_ocp_version
  • roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml:1-20 — fetches the same data

Same oc get clusterversion calls, same parsing, same variable names. Runs twice during a single deployment.

Suggestion: Move to a pre_tasks block in assisted-install.yml and pass as facts, or create a shared task file under roles/assisted/common/.


C4. RHCOS ISO fallback logic has edge cases that pass silently

File: roles/assisted/acm-install/tasks/agent-service-config.yml

Two issues:

  1. The primary extraction pipes oc image info -o json 2>/dev/null to Python. If oc image info fails, Python gets empty stdin → JSONDecodeError → rc=1. But failed_when only checks "'FAILED' in rhcos_iso_extraction.stdout", not rc. The task succeeds with empty stdout.
  2. The set_fact uses rhcos_iso_fallback.stdout | default(rhcos_iso_extraction.stdout). default() only triggers when the variable is undefined (i.e., fallback was skipped). If fallback runs but produces empty output, rhcos_iso_url becomes "" — the failed_when checks for 'FAILED' and 'NEEDS_FALLBACK' but not empty string.

Suggestion:

- name: Set RHCOS ISO URL fact
  set_fact:
    rhcos_iso_url: >-
      {{ (rhcos_iso_fallback.stdout | default(rhcos_iso_extraction.stdout)) | trim }}
  failed_when: >-
    rhcos_iso_url == 'FAILED' or
    rhcos_iso_url == 'NEEDS_FALLBACK' or
    rhcos_iso_url == ''

And add failed_when: rhcos_iso_extraction.rc != 0 or 'FAILED' in rhcos_iso_extraction.stdout to the primary extraction task.


C5. VM cleanup can cause stale disk reuse

File: roles/assisted/assisted-spoke/tasks/cleanup.yml

virsh undefine --remove-all-storage is wrapped in 2>/dev/null || true + failed_when: false. If storage removal fails (locked disk, permission denied), the VM definition is removed but the disk persists. On re-deployment, qemu-img create uses creates: guard and skips disk creation — the new VM boots from the OLD disk.

Suggestion: Separate the storage cleanup from VM undefine, or explicitly delete the qcow2 files in the cleanup task:

- name: Remove spoke VM disk images
  file:
    path: "{{ spoke_vm_image_dir }}/{{ spoke_cluster_name }}-master-{{ item }}.qcow2"
    state: absent
  loop: "{{ range(spoke_ctlplanes) | list }}"
  become: true

Important Issues

I6. Deploy script hardcodes spoke path

File: scripts/deploy-fencing-assisted.sh:42

Success message says KUBECONFIG=~/spoke-tnf/auth/kubeconfig but spoke_cluster_name is configurable in vars/assisted.yml. A user with spoke_cluster_name: my-cluster gets the wrong path in the output.

Suggestion: Either source the vars file to get the actual name, or note that the path depends on the configured spoke_cluster_name.


I7. No input validation for spoke-specific variables

File: (missing — no validate.yml in assisted-spoke)

Both dev-scripts/install-dev and kcli/kcli-install have dedicated validate.yml files. The assisted roles validate hub health but not spoke inputs (CIDR format, VIP within CIDR, non-empty cluster name, valid BMC credentials format).

Suggestion: Add a validate.yml to assisted-spoke that checks at minimum:

  • spoke_api_vip and spoke_ingress_vip are within spoke_network_cidr
  • spoke_cluster_name is non-empty and DNS-safe
  • spoke_ctlplanes >= 2

I8. No cluster state tracking — must integrate with existing system

Existing roles update cluster-vm-state.json via common/tasks/cluster-state.yml. The assisted roles skip this entirely, which means the deployment tracking system has no awareness of spoke clusters. This breaks make info, lifecycle management, and any tooling that depends on cluster-vm-state.json to know what's deployed.

The assisted-spoke role must call common/tasks/cluster-state.yml (or equivalent) to register spoke VMs and their state. This is a project convention — all deployment methods must support it.


I9. failed_when: false too broadly applied

Multiple locations silence legitimate errors:

  • cleanup.yml:3-7 — namespace delete timeout silenced entirely
  • enable-watch-all-namespaces.yml:17 — Provisioning CR patch hides API errors
  • retrieve-credentials.yml:63-78 — VM start failure hidden, triggers 2-minute blind pause

Suggestion: At minimum, log the error output. For the VM start case, if virsh start fails, don't wait 120 seconds.


I10. Wait loops provide no diagnostics on timeout

When wait for assisted-service pod or wait for agents to register times out, the user sees only the retry count. Consider adding a rescue block that dumps pod status, events, or logs.


I11. hub_auth_dir defined but never used

File: roles/assisted/assisted-spoke/vars/main.yml:21

Dead variable.


I12. storage.yml destructive rm -rf on re-run

File: roles/assisted/acm-install/tasks/storage.yml

rm -rf {{ assisted_images_path }}/* {{ assisted_db_path }}/*

Runs every time the role executes (not just on cleanup), wiping existing assisted-service data. This belongs in cleanup.yml, not storage.yml. Gate it with a when: force_cleanup or move to cleanup.


I13. No README for the new roles

File: (missing — no README.md in either assisted/acm-install/ or assisted/assisted-spoke/)

Every existing role in this project has a comprehensive README (dev-scripts/install-dev/README.md, kcli/kcli-install/README.md, etc.). READMEs are mandatory for new roles — they document prerequisites, variables, usage examples, and known limitations. The assisted.yml.template comments are not a substitute.


Suggestions

S14. changed_when: true on network/VM creation masks idempotency. Use actual command output to determine change status.

S15. setup-ksushy.yml assertion message assumes 2 hub nodes. Hubs can have 3+ nodes (e.g., 3-node HA).

S16. Template should document DHCP range constraints — VIPs must be outside .50-.150 range.

S17. Missing hub_network_cidr in template — if hub network differs from 192.168.111.0/24, nftables rules silently fail.

S18. Nearly all new files lack a trailing newline. Minor but should be fixed for consistency.


Strengths

  • Architecture is sound — clean separation between acm-install and assisted-spoke with the playbook orchestrating both
  • Good reuse of existing roles — proxy-setup and common/update-cluster-inventory in post_tasks
  • RHCOS ISO auto-detection with fallback is genuinely useful
  • Operational comments are excellent (e.g., "Requires both a ConfigMap AND an annotation on AgentServiceConfig")
  • assisted.yml.template is well-structured as user-facing documentation
  • BMH creation with fencing credentials correctly uses bmac.agent-install.openshift.io/fencing-credentials-secret-name
  • Makefile integration correctly chains fencing-ipi before fencing-assisted

- C1: SSH key detection now tries ed25519 first, falls back to rsa/ecdsa
- C2: Deduplicate pull_secret_path and hub_kubeconfig to playbook level
- C3: Move hub release image extraction to playbook pre_tasks (run once)
- C4: RHCOS ISO extraction checks rc and catches empty string failures
- C5: Explicit disk cleanup prevents stale qcow2 reuse on re-deploy
- Remove unused hub_auth_dir variable (I11)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci
Copy link

openshift-ci bot commented Feb 18, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gamado
Once this PR has been reviewed and has the lgtm label, please ask for approval from fonta-rh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gamado
Copy link
Contributor Author

gamado commented Feb 18, 2026

Both clusters are healthy:

Spoke cluster (spoke-tnf):

  • 2 nodes: spoke-tnf-master-0 and spoke-tnf-master-1, both Ready
  • OCP 4.22.0-0.nightly, all cluster operators Available=True, Degraded=False
  • etcd operator healthy

Hub cluster:

  • 2 nodes: master-0 and master-1, both Ready
  • Pacemaker cluster TNF running with quorum
  • All resources started: kubelet-clone, STONITH fencing (fence_redfish), etcd-clone
  • Corosync, Pacemaker, pcsd all active/enabled

The critical fixes (C1-C5) are fully validated. Ready to start on the important issues whenever you are.

- I6: Parse spoke_cluster_name from vars/assisted.yml in deploy script
  instead of hardcoding ~/spoke-tnf/auth/kubeconfig
- I8: Add cluster state tracking (deploying/deployed) to assisted-install.yml
  using common/cluster-state.yml, consistent with dev-scripts and kcli methods
- I9: Replace blanket failed_when:false with ignore_errors:true in cleanup,
  conditional error checking in enable-watch-all-namespaces, and remove
  failed_when:false from virsh start in retrieve-credentials
- I10: Add block/rescue diagnostic handlers to all 9 wait loops across
  install-operator, agent-service-config, wait-for-install, and
  retrieve-credentials, dumping relevant status on timeout
- I13: Add README.md for acm-install and assisted-spoke roles
- S16: Document DHCP range constraints in assisted.yml.template
- S17: Expose hub_network_cidr in assisted.yml.template
- S18: Add trailing newlines to all 22 new files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gamado
Copy link
Contributor Author

gamado commented Feb 18, 2026

Review Fixes Status

Two commits pushed addressing the review findings:

Commit 1: 3a61c01 — Critical Fixes (C1-C5)

Issue Status Description
C1 Fixed SSH key detection: multi-key loop (ed25519, rsa, ecdsa) instead of hardcoded id_rsa.pub
C2 Fixed pull_secret_path and hub_kubeconfig moved to playbook-level vars, removed from both roles
C3 Fixed Hub release image extraction moved to playbook pre_tasks, removed duplication from both roles
C4 Fixed RHCOS ISO extraction: added rc check on primary, empty string check on set_fact
C5 Fixed Explicit disk image removal after VM undefine to prevent stale disk reuse

Commit 2: 38bfca3 — Important Fixes + Suggestions (I6, I8, I9, I10, I13, S16-S18)

Issue Status Description
I6 Fixed Deploy script parses spoke_cluster_name from vars/assisted.yml instead of hardcoding path
I7 Skipped Input validation for spoke variables — deferred
I8 Fixed Cluster state tracking added to assisted-install.yml (deploying/deployed phases via common/cluster-state.yml)
I9 Fixed failed_when: false replaced: ignore_errors: true in cleanup, conditional check in provisioning patch, removed from virsh start
I10 Fixed Block/rescue diagnostic handlers added to all 9 wait loops across 4 files
I11 Fixed hub_auth_dir removed (was fixed in C2 commit)
I12 Skipped Destructive rm -rf in storage.yml — deferred
I13 Fixed READMEs added for acm-install and assisted-spoke roles
S14 Skipped changed_when: true idempotency — deferred
S15 Skipped setup-ksushy.yml assertion message — deferred
S16 Fixed DHCP range constraints documented in assisted.yml.template
S17 Fixed hub_network_cidr exposed in assisted.yml.template
S18 Fixed Trailing newlines added to all 22 new files

Summary

  • 13 of 18 review findings addressed
  • 5 deferred (I7, I12, S14, S15 — low risk, can be follow-up)
  • End-to-end deployment validated: make deploy fencing-assisted completed successfully (ok=155, changed=48, failed=0)
  • Both hub and spoke clusters verified healthy (nodes Ready, all COs available, Pacemaker fencing operational)

@gamado gamado requested a review from fonta-rh February 18, 2026 18:09
@gamado
Copy link
Contributor Author

gamado commented Feb 19, 2026

Environment

  • Platform: AWS c5.metal (RHEL 9.6)
  • OCP Version: 4.22.0-0.nightly-2026-02-17-112443
  • Deployment method: make deploy fencing-assisted
  • Date: 2026-02-19

Deployment Summary

Full end-to-end deployment completed successfully:

PLAY RECAP *********************************************************************
ec2-user@51.85.92.84  : ok=163  changed=50  unreachable=0  failed=0  skipped=18  rescued=0  ignored=0

Hub Cluster Status

Component Status
Nodes 2/2 Ready (master-0, master-1)
Cluster Operators All healthy (Available=True, Degraded=False)
Pacemaker TNF cluster running, partition with quorum
STONITH fence_redfish started on both nodes
etcd etcd-clone started on both nodes
kubelet kubelet-clone started on both nodes
Corosync/Pacemaker/pcsd active/enabled

Spoke Cluster Status

Component Status
Nodes 2/2 Ready (spoke-tnf-master-0, spoke-tnf-master-1)
Cluster Operators 34/34 healthy (Available=True, Degraded=False)
OCP Version 4.22.0-0.nightly-2026-02-17-112443

Fixes Validated

All fixes from the two review-fix commits were validated in this run:

Fix Validated
C1 - SSH key multi-key detection Yes — key detected and used
C2 - Playbook-level pull_secret_path / hub_kubeconfig Yes — no duplication errors
C3 - Hub release extraction in pre_tasks Yes — version extracted once, passed to both roles
C4 - RHCOS ISO fallback error handling Yes — ISO URL resolved
C5 - Explicit disk cleanup Yes — clean VM creation
I6 - Dynamic spoke path in deploy script Yes — output shows correct ~/spoke-tnf/auth/kubeconfig
I8 - Cluster state tracking Yes — DEPLOYING/DEPLOYED phases logged for ASSISTED method
I9 - failed_when: false replaced Yes — no silent failures, rescued=0
I10 - Wait loop diagnostics Yes — all waits passed cleanly, rescue blocks available but not triggered
I13 - Role READMEs Yes — present in repo
S16 - DHCP range documented Yes — template updated
S17 - hub_network_cidr in template Yes — template updated
S18 - Trailing newlines Yes — all files fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants