Skip to content

rfc: forge test network#82

Open
alanshaw wants to merge 1 commit intomainfrom
rfc/forge-test-network
Open

rfc: forge test network#82
alanshaw wants to merge 1 commit intomainfrom
rfc/forge-test-network

Conversation

@alanshaw
Copy link
Member

Copy link
Member

@volmedo volmedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this.

Something that might be worth mentioning explicitly is that the test network will use Filecoin's calibration network.

About signalling deploys to the test network, if we want it to be updated at the same time, or even before, the production network, I think we should deploy on new releases (changes to version.json) as we do today for prod. We can then schedule an automated deploy to prod 24 hours later. I found this GitHub action (https://github.com/austenstone/schedule) that could be used to implement such a mechanism. I still don't know, however, if deploys to prod scheduled that way can be stopped if we detect issues in the test network.

Alternatively, we could deploy to test on new releases and then do manual deploys to prod after a while, but we will likely forget to do so.

Copy link
Member

@frrist frrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the need here is real, and I'm glad this is being written up. A few thoughts.

On purpose: Reading this, I keep coming back to who the network is for. If the primary goal is to let prospective customers experience the product without financial commitment, a stable mirror of production, operated entirely by us, same contracts, same APIs, same flow, then I think that's clearly valuable and relatively straightforward. Three nodes, limited capacity, monthly resets, different locations. We control every variable. That makes sense to me.

I want to be precise about what "close-to-production area" means here. This isn't a validation environment. We may find bugs, but the purpose isn't to catch them. warm-staging catches bugs, today at least. This network exists so a customer can see the product work. I think it could be used as a pre-production gate, but its not the primary signal for deploying to prod.

On scope: The RFC also positions this as a sandbox for prospective node operators. I wonder if that's asking one environment to serve two masters? Operators learning the system will misconfigure nodes, run out of disk, trip over lotus, and go offline — which is fine, that's how learning works. It is less fine if a customer is watching.
Do these audiences need to share a network, or would operator vetting be better served by its own process with its own criteria for graduating to production? Genuinely unsure, though I lean towards wanting something separate. Curious what others think.

On release cadence: The RFC recommends deploying 24 hours ahead of production. Some lead time is reasonable, deploying ahead of production gives us a window to catch integration issues that warm-staging might miss. I'm less sure 24 hours is the right number. I'd suggest we define a minimum rather than a fixed interval, and adjust based on what we learn. Deploying from an RC tag seems fine, but the RFC should clarify which components that covers. Piri, the upload service, the indexer, IPNI, these don't necessarily release in lockstep, and "an RC" means different things depending on which system we're talking about. An RC from any of them should probably trigger a test network deploy, but that's worth stating explicitly.

@alanshaw
Copy link
Member Author

Something that might be worth mentioning explicitly is that the test network will use Filecoin's calibration network.

👍

About signalling deploys to the test network, if we want it to be updated at the same time, or even before, the production network, I think we should deploy on new releases (changes to version.json) as we do today for prod. We can then schedule an automated deploy to prod 24 hours later. I found this GitHub action (https://github.com/austenstone/schedule) that could be used to implement such a mechanism. I still don't know, however, if deploys to prod scheduled that way can be stopped if we detect issues in the test network.

Alternatively, we could deploy to test on new releases and then do manual deploys to prod after a while, but we will likely forget to do so.

I am less worried about this...I worry about deploying to test and then finding a problem and forgetting to cancel the scheduled deploy. I am strongly in favour of humans triggering deploys to production.

@alanshaw
Copy link
Member Author

On scope: The RFC also positions this as a sandbox for prospective node operators. I wonder if that's asking one environment to serve two masters? Operators learning the system will misconfigure nodes, run out of disk, trip over lotus, and go offline — which is fine, that's how learning works. It is less fine if a customer is watching. Do these audiences need to share a network, or would operator vetting be better served by its own process with its own criteria for graduating to production? Genuinely unsure, though I lean towards wanting something separate. Curious what others think.

So, I think you're right. My concern is the overhead of operating an additional network, both in maintenance and cost, but mostly because I feel as though the network SHOULD be resilient to node operator mess-ups in a way that does not effect customers who are uploading data. We are perhaps not there yet so I'd be willing to conceed and suggest we onboard node providers to warm-staging until we are more confident in resiliency?

On release cadence: The RFC recommends deploying 24 hours ahead of production. Some lead time is reasonable, deploying ahead of production gives us a window to catch integration issues that warm-staging might miss. I'm less sure 24 hours is the right number. I'd suggest we define a minimum rather than a fixed interval, and adjust based on what we learn.

I was trying to convey that the 24 hours be a minimum - but I also do not want to prevent pushing to prod sooner if necessary. Apologies if not clear in the wording.

Deploying from an RC tag seems fine, but the RFC should clarify which components that covers.

I was thinking a RC tag would be per component, similarly to how a regular release tag is currently per component.

We should never be in a situation that 2 or more components need to be released at exactly the same time but there may be an order that releases need to happen in. IMO we have few enough components that a human can manage releasing them in order and that automation for this would be more work than it's worth.

An RC from any of them should probably trigger a test network deploy, but that's worth stating explicitly.

👍

Copy link
Member

@Peeja Peeja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds great.

I'm trying to think through whether there's a consistency issue with clearing it out periodically. I think it's okay, at least so far. Once we (I expect) make re-replication automatic, we might have some issues, and it'll be good to see that reflected here. I'd love to be able to simulate a node going down or losing some data and to be able to see the network (and my data) self-heal.

But I'm just dreaming about the future. This is an awesome step forward.

@Peeja
Copy link
Member

Peeja commented Feb 23, 2026

On purpose: Reading this, I keep coming back to who the network is for. If the primary goal is to let prospective customers experience the product without financial commitment, a stable mirror of production, operated entirely by us, same contracts, same APIs, same flow, then I think that's clearly valuable and relatively straightforward. Three nodes, limited capacity, monthly resets, different locations. We control every variable. That makes sense to me.

Yeah, important question: Is this a demo network, for prospective customers to try it out, or is it a test network, for onboarding customers to work with a production-like network as they get up and running, especially for more sophisticated uses (ie, not using guppy upload, but working directly with the client API to do even more interesting things).

It can be both, but which should be clear about the answer.

@alanshaw
Copy link
Member Author

Rename to demo/playground network

@alanshaw
Copy link
Member Author

Load with test data?

Be more specific about purpose.

@alanshaw
Copy link
Member Author

Deploy at the same time as prod - treat as a a prod network. Do internal testing on staging. Also add reset to staging?

@alanshaw
Copy link
Member Author

Delegations should expire at the end of the month so clients are also "reset".

@volmedo
Copy link
Member

volmedo commented Feb 25, 2026

Rename to demo/playground network

"play" works as well if "playground" is too verbose 😄. I agree "demo" sounds more enterprisy though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants