Add assisted installer deployment method for spoke TNF clusters#51
Add assisted installer deployment method for spoke TNF clusters#51gamado wants to merge 4 commits intoopenshift-eng:mainfrom
Conversation
Deploy a spoke TNF cluster via ACM/MCE assisted installer on an existing hub cluster. This adds a third deployment path alongside dev-scripts (IPI) and kcli methods. New Ansible roles: - assisted/acm-install: installs ACM operator, MultiClusterHub, AgentServiceConfig with auto-detected RHCOS ISO, enables TNF support, and configures provisioning for external BMH management - assisted/assisted-spoke: creates spoke libvirt network and VMs, deploys cluster resources (ClusterDeployment, AgentClusterInstall, InfraEnv, BareMetalHost with fencing credentials), monitors installation, and extracts spoke credentials Usage: make deploy fencing-assisted Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hi @gamado. Thanks for your PR. I'm waiting for a openshift-eng member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
fonta-rh
left a comment
There was a problem hiding this comment.
Review: Add Assisted Installer Deployment for Spoke TNF Clusters
Solid architecture — clean separation between acm-install (hub-side) and assisted-spoke (spoke-side) with good reuse of existing roles (proxy-setup, common/update-cluster-inventory). The RHCOS ISO auto-detection with fallback and the BMH fencing credential setup are well done.
Found 5 critical issues that should be fixed before merge, 8 important issues for follow-up, and 5 minor suggestions. Details below.
Critical Issues
C1. SSH key hardcodes id_rsa.pub — will fail on ed25519-only hosts
File: roles/assisted/assisted-spoke/tasks/create-cluster-resources.yml
The task uses cat ~/.ssh/id_rsa.pub but the project convention is ~/.ssh/id_ed25519.pub (see CLAUDE.md, kcli role's prepare.yml). Modern RHEL 9 defaults to ed25519. If the hypervisor was provisioned with ed25519 only, this task silently fails.
Suggestion: Use the same multi-key detection pattern from the kcli role or common/:
- name: Get SSH public key
shell: |
for key in ~/.ssh/id_ed25519.pub ~/.ssh/id_rsa.pub ~/.ssh/id_ecdsa.pub; do
[ -f "$key" ] && cat "$key" && exit 0
done
echo "ERROR: No SSH public key found" >&2 && exit 1
register: ssh_pub_key
changed_when: falseC2. pull_secret_path duplicated across both roles
Files:
roles/assisted/acm-install/vars/main.yml:23roles/assisted/assisted-spoke/vars/main.yml:13
Both independently define pull_secret_path: /opt/dev-scripts/pull_secret.json. If the path changes, both must be updated independently.
Suggestion: Define once as a playbook-level variable in assisted-install.yml (or in the vars file loaded by vars_files), remove from both role vars/main.yml.
C3. Hub release image extraction duplicated between roles
Files:
roles/assisted/acm-install/tasks/agent-service-config.yml:1-14— fetcheshub_release_imageandhub_ocp_versionroles/assisted/assisted-spoke/tasks/create-cluster-resources.yml:1-20— fetches the same data
Same oc get clusterversion calls, same parsing, same variable names. Runs twice during a single deployment.
Suggestion: Move to a pre_tasks block in assisted-install.yml and pass as facts, or create a shared task file under roles/assisted/common/.
C4. RHCOS ISO fallback logic has edge cases that pass silently
File: roles/assisted/acm-install/tasks/agent-service-config.yml
Two issues:
- The primary extraction pipes
oc image info -o json 2>/dev/nullto Python. Ifoc image infofails, Python gets empty stdin →JSONDecodeError→ rc=1. Butfailed_whenonly checks"'FAILED' in rhcos_iso_extraction.stdout", notrc. The task succeeds with empty stdout. - The
set_factusesrhcos_iso_fallback.stdout | default(rhcos_iso_extraction.stdout).default()only triggers when the variable is undefined (i.e., fallback was skipped). If fallback runs but produces empty output,rhcos_iso_urlbecomes""— thefailed_whenchecks for'FAILED'and'NEEDS_FALLBACK'but not empty string.
Suggestion:
- name: Set RHCOS ISO URL fact
set_fact:
rhcos_iso_url: >-
{{ (rhcos_iso_fallback.stdout | default(rhcos_iso_extraction.stdout)) | trim }}
failed_when: >-
rhcos_iso_url == 'FAILED' or
rhcos_iso_url == 'NEEDS_FALLBACK' or
rhcos_iso_url == ''And add failed_when: rhcos_iso_extraction.rc != 0 or 'FAILED' in rhcos_iso_extraction.stdout to the primary extraction task.
C5. VM cleanup can cause stale disk reuse
File: roles/assisted/assisted-spoke/tasks/cleanup.yml
virsh undefine --remove-all-storage is wrapped in 2>/dev/null || true + failed_when: false. If storage removal fails (locked disk, permission denied), the VM definition is removed but the disk persists. On re-deployment, qemu-img create uses creates: guard and skips disk creation — the new VM boots from the OLD disk.
Suggestion: Separate the storage cleanup from VM undefine, or explicitly delete the qcow2 files in the cleanup task:
- name: Remove spoke VM disk images
file:
path: "{{ spoke_vm_image_dir }}/{{ spoke_cluster_name }}-master-{{ item }}.qcow2"
state: absent
loop: "{{ range(spoke_ctlplanes) | list }}"
become: trueImportant Issues
I6. Deploy script hardcodes spoke path
File: scripts/deploy-fencing-assisted.sh:42
Success message says KUBECONFIG=~/spoke-tnf/auth/kubeconfig but spoke_cluster_name is configurable in vars/assisted.yml. A user with spoke_cluster_name: my-cluster gets the wrong path in the output.
Suggestion: Either source the vars file to get the actual name, or note that the path depends on the configured spoke_cluster_name.
I7. No input validation for spoke-specific variables
File: (missing — no validate.yml in assisted-spoke)
Both dev-scripts/install-dev and kcli/kcli-install have dedicated validate.yml files. The assisted roles validate hub health but not spoke inputs (CIDR format, VIP within CIDR, non-empty cluster name, valid BMC credentials format).
Suggestion: Add a validate.yml to assisted-spoke that checks at minimum:
spoke_api_vipandspoke_ingress_vipare withinspoke_network_cidrspoke_cluster_nameis non-empty and DNS-safespoke_ctlplanes >= 2
I8. No cluster state tracking — must integrate with existing system
Existing roles update cluster-vm-state.json via common/tasks/cluster-state.yml. The assisted roles skip this entirely, which means the deployment tracking system has no awareness of spoke clusters. This breaks make info, lifecycle management, and any tooling that depends on cluster-vm-state.json to know what's deployed.
The assisted-spoke role must call common/tasks/cluster-state.yml (or equivalent) to register spoke VMs and their state. This is a project convention — all deployment methods must support it.
I9. failed_when: false too broadly applied
Multiple locations silence legitimate errors:
cleanup.yml:3-7— namespace delete timeout silenced entirelyenable-watch-all-namespaces.yml:17— Provisioning CR patch hides API errorsretrieve-credentials.yml:63-78— VM start failure hidden, triggers 2-minute blindpause
Suggestion: At minimum, log the error output. For the VM start case, if virsh start fails, don't wait 120 seconds.
I10. Wait loops provide no diagnostics on timeout
When wait for assisted-service pod or wait for agents to register times out, the user sees only the retry count. Consider adding a rescue block that dumps pod status, events, or logs.
I11. hub_auth_dir defined but never used
File: roles/assisted/assisted-spoke/vars/main.yml:21
Dead variable.
I12. storage.yml destructive rm -rf on re-run
File: roles/assisted/acm-install/tasks/storage.yml
rm -rf {{ assisted_images_path }}/* {{ assisted_db_path }}/*
Runs every time the role executes (not just on cleanup), wiping existing assisted-service data. This belongs in cleanup.yml, not storage.yml. Gate it with a when: force_cleanup or move to cleanup.
I13. No README for the new roles
File: (missing — no README.md in either assisted/acm-install/ or assisted/assisted-spoke/)
Every existing role in this project has a comprehensive README (dev-scripts/install-dev/README.md, kcli/kcli-install/README.md, etc.). READMEs are mandatory for new roles — they document prerequisites, variables, usage examples, and known limitations. The assisted.yml.template comments are not a substitute.
Suggestions
S14. changed_when: true on network/VM creation masks idempotency. Use actual command output to determine change status.
S15. setup-ksushy.yml assertion message assumes 2 hub nodes. Hubs can have 3+ nodes (e.g., 3-node HA).
S16. Template should document DHCP range constraints — VIPs must be outside .50-.150 range.
S17. Missing hub_network_cidr in template — if hub network differs from 192.168.111.0/24, nftables rules silently fail.
S18. Nearly all new files lack a trailing newline. Minor but should be fixed for consistency.
Strengths
- Architecture is sound — clean separation between
acm-installandassisted-spokewith the playbook orchestrating both - Good reuse of existing roles —
proxy-setupandcommon/update-cluster-inventoryin post_tasks - RHCOS ISO auto-detection with fallback is genuinely useful
- Operational comments are excellent (e.g., "Requires both a ConfigMap AND an annotation on AgentServiceConfig")
assisted.yml.templateis well-structured as user-facing documentation- BMH creation with fencing credentials correctly uses
bmac.agent-install.openshift.io/fencing-credentials-secret-name - Makefile integration correctly chains
fencing-ipibeforefencing-assisted
- C1: SSH key detection now tries ed25519 first, falls back to rsa/ecdsa - C2: Deduplicate pull_secret_path and hub_kubeconfig to playbook level - C3: Move hub release image extraction to playbook pre_tasks (run once) - C4: RHCOS ISO extraction checks rc and catches empty string failures - C5: Explicit disk cleanup prevents stale qcow2 reuse on re-deploy - Remove unused hub_auth_dir variable (I11) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gamado The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Both clusters are healthy: Spoke cluster (spoke-tnf):
Hub cluster:
The critical fixes (C1-C5) are fully validated. Ready to start on the important issues whenever you are. |
- I6: Parse spoke_cluster_name from vars/assisted.yml in deploy script instead of hardcoding ~/spoke-tnf/auth/kubeconfig - I8: Add cluster state tracking (deploying/deployed) to assisted-install.yml using common/cluster-state.yml, consistent with dev-scripts and kcli methods - I9: Replace blanket failed_when:false with ignore_errors:true in cleanup, conditional error checking in enable-watch-all-namespaces, and remove failed_when:false from virsh start in retrieve-credentials - I10: Add block/rescue diagnostic handlers to all 9 wait loops across install-operator, agent-service-config, wait-for-install, and retrieve-credentials, dumping relevant status on timeout - I13: Add README.md for acm-install and assisted-spoke roles - S16: Document DHCP range constraints in assisted.yml.template - S17: Expose hub_network_cidr in assisted.yml.template - S18: Add trailing newlines to all 22 new files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review Fixes StatusTwo commits pushed addressing the review findings: Commit 1:
|
| Issue | Status | Description |
|---|---|---|
| C1 | Fixed | SSH key detection: multi-key loop (ed25519, rsa, ecdsa) instead of hardcoded id_rsa.pub |
| C2 | Fixed | pull_secret_path and hub_kubeconfig moved to playbook-level vars, removed from both roles |
| C3 | Fixed | Hub release image extraction moved to playbook pre_tasks, removed duplication from both roles |
| C4 | Fixed | RHCOS ISO extraction: added rc check on primary, empty string check on set_fact |
| C5 | Fixed | Explicit disk image removal after VM undefine to prevent stale disk reuse |
Commit 2: 38bfca3 — Important Fixes + Suggestions (I6, I8, I9, I10, I13, S16-S18)
| Issue | Status | Description |
|---|---|---|
| I6 | Fixed | Deploy script parses spoke_cluster_name from vars/assisted.yml instead of hardcoding path |
| I7 | Skipped | Input validation for spoke variables — deferred |
| I8 | Fixed | Cluster state tracking added to assisted-install.yml (deploying/deployed phases via common/cluster-state.yml) |
| I9 | Fixed | failed_when: false replaced: ignore_errors: true in cleanup, conditional check in provisioning patch, removed from virsh start |
| I10 | Fixed | Block/rescue diagnostic handlers added to all 9 wait loops across 4 files |
| I11 | Fixed | hub_auth_dir removed (was fixed in C2 commit) |
| I12 | Skipped | Destructive rm -rf in storage.yml — deferred |
| I13 | Fixed | READMEs added for acm-install and assisted-spoke roles |
| S14 | Skipped | changed_when: true idempotency — deferred |
| S15 | Skipped | setup-ksushy.yml assertion message — deferred |
| S16 | Fixed | DHCP range constraints documented in assisted.yml.template |
| S17 | Fixed | hub_network_cidr exposed in assisted.yml.template |
| S18 | Fixed | Trailing newlines added to all 22 new files |
Summary
- 13 of 18 review findings addressed
- 5 deferred (I7, I12, S14, S15 — low risk, can be follow-up)
- End-to-end deployment validated:
make deploy fencing-assistedcompleted successfully (ok=155, changed=48, failed=0) - Both hub and spoke clusters verified healthy (nodes Ready, all COs available, Pacemaker fencing operational)
Environment
Deployment SummaryFull end-to-end deployment completed successfully: Hub Cluster Status
Spoke Cluster Status
Fixes ValidatedAll fixes from the two review-fix commits were validated in this run:
|
Deploy a spoke TNF cluster via ACM/MCE assisted installer on an existing hub cluster. This adds a third deployment path alongside dev-scripts (IPI) and kcli methods.
New Ansible roles:
Usage: make deploy fencing-assisted
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
Verified by :
Spoke Cluster Verification
Nodes:
NAME STATUS ROLES AGE VERSION
spoke-tnf-master-0 Ready control-plane,master,worker 46m v1.34.2
spoke-tnf-master-1 Ready control-plane,master,worker 27m v1.34.2
Cluster Operators: 34/34 Available, 0 Progressing, 0 Degraded
Full List of Resources:
* Clone Set: kubelet-clone [kubelet]:
* Started: [ spoke-tnf-master-0 spoke-tnf-master-1 ]
* spoke-tnf-master-0_redfish (stonith:fence_redfish): Started
* spoke-tnf-master-1_redfish (stonith:fence_redfish): Started
* Clone Set: etcd-clone [etcd]:
* Started: [ spoke-tnf-master-0 spoke-tnf-master-1 ]
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Result: PASS