Skip to content

Add self-hosted operations guide#17872

Open
wlami wants to merge 5 commits intomasterfrom
wladi/self-hosted-operations
Open

Add self-hosted operations guide#17872
wlami wants to merge 5 commits intomasterfrom
wladi/self-hosted-operations

Conversation

@wlami
Copy link
Member

@wlami wlami commented Mar 9, 2026

Summary

  • Adds a new Operations section under self-hosting docs with 10 pages covering architecture, database HA, compute sizing, object storage, networking, monitoring, backup/recovery, upgrades, and security hardening
  • Adds cross-references from existing components/api.md and network.md pages to the new operations guide
  • Updates self-hosting landing page with operations quick-link cards

Test plan

  • make lint passes (0 errors)
  • make build succeeds
  • Verify operations landing page renders correctly (no checkboxes, clean bullet lists)
  • Verify all 10 sub-pages appear in left nav under Operations
  • Verify cross-links from components/api.md and network.md resolve correctly
  • Verify compute-sizing has a single consolidated table
  • Verify object-storage references components/api.md for env var details instead of duplicating them

🤖 Generated with Claude Code

…tices

Add a new operations/ section under self-hosting docs covering architecture,
database HA, compute sizing, object storage, networking, monitoring, backup
and recovery, upgrades, and security hardening. Add cross-references from
existing components/api.md and network.md pages. Update self-hosting landing
page with operations cards and bump changelog menu weight.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Mar 9, 2026

Docs review

This is a well-structured, comprehensive set of operations documentation for self-hosted Pulumi Cloud. The content is organized logically, cross-references are consistent, and frontmatter is complete across all 10 new pages. A few items to address:


Issues

1. En dash in numeric ranges (style guide: use not - for ranges)

content/docs/administration/self-hosting/operations/backup-recovery.md, line 285:

| Full region failure | Restore DB from cross-region backup, fail over storage | 1–4 hours |

content/docs/administration/self-hosting/operations/networking.md, line 555:

- **Target**: 50–60% average CPU utilization

2. Vague qualifier — Per the style guide, avoid vague qualifiers like "reasonable."

content/docs/administration/self-hosting/operations/compute-sizing.md, line 329:

For production, 2 vCPU / 1 GB RAM per API instance is a good starting point. Scale horizontally (more instances) rather than vertically for the API service, since it is stateless and benefits from running behind a load balancer across multiple AZs.

3. Config key name mismatch with described service

content/docs/administration/self-hosting/operations/security-hardening.md, lines 698–701:

The section heading says "CAPTCHA and bot protection" and references Cloudflare Turnstile, but the config keys are named recaptchaSiteKey / recaptchaSecretKey. Readers will be confused about whether these keys accept reCAPTCHA or Turnstile values. Consider clarifying:

## CAPTCHA and bot protection

Configure Cloudflare Turnstile for signup protection. Despite the `recaptcha` naming, these config keys accept Cloudflare Turnstile credentials:

- Set `recaptchaSiteKey` (Turnstile site key)
- Set `recaptchaSecretKey` (Turnstile secret key)

4. Magic number without explanation

content/docs/administration/self-hosting/operations/networking.md, line 565:

terminationGracePeriodSeconds: 130 is presented without explanation. Consider noting why it is set above the 120-second stop timeout:

- **Kubernetes**: Set `terminationGracePeriodSeconds: 130` on the API pod spec (slightly above the 120-second stop timeout to allow clean shutdown before Kubernetes force-kills the pod).

Minor observations

  • The landing page (_index.md) cards surface 4 of the 10 sub-pages. This appears intentional and works well as a curated entry point.
  • External link to https://hub.docker.com/r/pulumi/service/tags in upgrades.md looks correct.
  • All ordered list items correctly use 1.
  • All H2+ headings are sentence case ✓
  • All new pages have complete frontmatter including meta_desc, title_tag, and meta_image

Mention @claude if you'd like additional review after addressing these items.

@pulumi-bot
Copy link
Collaborator

- Use en dashes for numeric ranges (1–4 hours, 50–60%)
- Replace vague qualifier "reasonable" with "good"
- Clarify that recaptcha config keys accept Cloudflare Turnstile credentials
- Explain why terminationGracePeriodSeconds is set to 130

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pulumi-bot
Copy link
Collaborator

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@CamSoper CamSoper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great addition of the self-hosted operations guide. I pushed a small fix for Title Case on the h1/title frontmatter fields per our style guide. Everything else looks clean — well-structured content, correct nav hierarchy, and good cross-references.

@pulumi-bot
Copy link
Collaborator

| :-- | :-- |
| Object storage | Blob storage for checkpoint (state) files and policy packs. Supported: S3 and compatible implementations, Azure Blob Storage, Google Cloud Storage |
| Search (optional) | OpenSearch 2.x or Elasticsearch 7.x for resource search and AI features |
| Cache (optional) | Redis 6.2 or later for session caching and performance |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any customer who uses Redis or any sort of caching layer - I don't believe that even BMW who were our most active self-hosted user had this so you can probably remove references to this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove

| Component | Description |
| :-- | :-- |
| Object storage | Blob storage for checkpoint (state) files and policy packs. Supported: S3 and compatible implementations, Azure Blob Storage, Google Cloud Storage |
| Search (optional) | OpenSearch 2.x or Elasticsearch 7.x for resource search and AI features |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenSearch/ElasticSearch is only optional if you don't want any sort of resource search, so I wouldn't have this as optional even if it technically is


### Connection pooling

For AWS deployments with many concurrent users, consider placing [Amazon RDS Proxy](https://aws.amazon.com/rds/proxy/) in front of your Aurora or RDS instance. RDS Proxy pools and shares database connections, reducing connection overhead and improving failover times.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure anyone will actually need this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove

@@ -0,0 +1,69 @@
---
title_tag: "Monitoring and Alerting | Self-Hosting Pulumi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus metrics for the Go application are available so you should probably mention these: /docs/administration/self-hosting/components/api/#opentelemetry

- Password reset emails
- Organization notifications

SMTP is optional for initial testing but required for production use. See the [API component reference](/docs/administration/self-hosting/components/api/) for SMTP environment variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required for production use. Since most people will use SAML, forgotten password flow isn't used. I think the only thing that will use SMTP is email notifications.

- API service and console deployed with 2+ replicas
- Database migrations run successfully
- DNS records configured for both API and console domains
- TLS termination configured on load balancer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- TLS termination configured on load balancer

This may not happen, depending on the setup

| Installer | Default instance type | Notes |
| :-- | :-- | :-- |
| ECS | db.t3.medium (4 GB RAM) | Suitable for small-to-medium workloads |
| EKS | db.r5.large (16 GB RAM) | Memory-optimized, better for production |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that these two (ECS/EKS) can be the same and are just different because different people updated the installers. I don't see that there is a difference in size because of the compute platform

| :-- | :-- | :-- |
| ECS | db.t3.medium (4 GB RAM) | Suitable for small-to-medium workloads |
| EKS | db.r5.large (16 GB RAM) | Memory-optimized, better for production |
| GKE | db-g1-small (1.7 GB RAM) | Minimal; upgrade for production use |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add in Azure as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a list of port requirements here: /docs/administration/self-hosting/network/ (although it is slightly out of date and doesn't include the elasticsearch/opensearch ports)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added elasticsearch http port


Pulumi Cloud uses object storage for checkpoint (state) files, policy packs, and other data. This page covers the storage architecture and best practices for production deployments.

## Storage architecture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also specify a file path if you want to use a filesystem for storage: /docs/administration/self-hosting/components/api/#local-storage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this hint.

Co-authored-by: Piers Karsenbarg <piers@pulumi.com>
@pulumi-bot
Copy link
Collaborator

Copy link
Member Author

@wlami wlami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the valueable feedback, @pierskarsenbarg !

| :-- | :-- |
| Object storage | Blob storage for checkpoint (state) files and policy packs. Supported: S3 and compatible implementations, Azure Blob Storage, Google Cloud Storage |
| Search (optional) | OpenSearch 2.x or Elasticsearch 7.x for resource search and AI features |
| Cache (optional) | Redis 6.2 or later for session caching and performance |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove

| :-- | :-- | :-- |
| ECS | db.t3.medium (4 GB RAM) | Suitable for small-to-medium workloads |
| EKS | db.r5.large (16 GB RAM) | Memory-optimized, better for production |
| GKE | db-g1-small (1.7 GB RAM) | Minimal; upgrade for production use |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


### Connection pooling

For AWS deployments with many concurrent users, consider placing [Amazon RDS Proxy](https://aws.amazon.com/rds/proxy/) in front of your Aurora or RDS instance. RDS Proxy pools and shares database connections, reducing connection overhead and improving failover times.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added elasticsearch http port


Pulumi Cloud uses object storage for checkpoint (state) files, policy packs, and other data. This page covers the storage architecture and best practices for production deployments.

## Storage architecture
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this hint.

- Remove Redis/cache from architecture (unused by customers)
- Mark OpenSearch as required, not optional
- Normalize DB instance sizing by cloud provider, add Azure
- Remove RDS Proxy section
- Add Prometheus/OpenTelemetry reference to monitoring
- Correct SMTP guidance: optional with SAML SSO
- Reframe installer update guidance as reference architecture
- Rename to "Pulumi Cloud license key" to avoid OSS confusion
- Remove TLS cert and DB migrations checklist items
- Add local filesystem storage option to object-storage
- Add OpenSearch port 9200 to network requirements

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pulumi-bot
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants