-
Notifications
You must be signed in to change notification settings - Fork 2
rfc: failure cases #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| # RFC: What to do in the failure cases? | ||
|
|
||
| Ideas are here needed. | ||
|
|
||
| The Storacha Network is not setup well to deal with failure cases. We have implemented the happy path, but there are various things that can go wrong that we currently have no story for. This RFC attempts to enumerate a bunch of them that currently live in my head, with the hope that we can gain consensus on what _should_ happen and allow further RFC(s) to be opened and/or specs created/amended to deal with them. | ||
|
|
||
| * Replication failure | ||
| * What happens when a replication does not succeed? Currently nothing. | ||
|
Comment on lines
+7
to
+8
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally, I'd like to handle this with the same mechanism as when a node leaves: the number of actual replicas is not equal to the requested number, so we attempt to replicate on nodes until it is. |
||
| * Data deletion | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if we handle it in two phases:
Regarding the PDP root aggregates, would it be possible to make the root aggregate expire every X With that, we can check if a blob is no longer referenced by any currently provable root we can do the byte clean-up. The deletion won't happen immediately, but in a time window we define. For Filecoin, is it required to store on Filecoin by default? Like, can we make Filecoin opt-in, not default durability? |
||
| * Long standing problem. We have a better story for this on storage nodes but no implementation. An interesting/difficult problem is how to deal with deletion in the context of our PDP root aggregates as well as our Filecoin deal aggregations. There is also data stored on the IPNI chain on the indexer that needs to be cleaned up. Or consider implementing [RFC #52](https://github.com/storacha/RFC/pull/52) | ||
|
Comment on lines
+9
to
+10
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm. This is interesting. Leaving Filecoin aside, is this a signal that the strategy of aggregating blobs for PDP is flawed? The cost of not aggregating is the increased gas/storage ratio for many small blobs. I'm spitballing, but could there be a way to carry a similar cost ratio all the way to the Storacha customer, to discourage that behavior, or to compensate for it? Or would that just be too much complexity for too little value? |
||
| * Data loss | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess we will need to continuously check data possession and content retrieval, then if any failures are found, we need to report them as data/replica loss. So we can kick off some sort of repair process to restore the required replication level. Do we have a global map of the replicas? |
||
| * If a node loses some data, what happens? How do we know? We should probably be less inclined to use them for storing data. How do we ensure minimum replicas are maintained? This is linked to proving failure. | ||
|
Comment on lines
+11
to
+12
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, I'd like to handle this idempotently, like when a replication fails or a node leaves: if data is lost, that replica no longer exists, so the number of replicas is too low, so we replicate. (That doesn't address how to weight that node lower, though.) |
||
| * Proving failure | ||
| * If a node falls below the successful proving threshold, what happens? We probably should NOT attempt to store _more_ data on that node. | ||
|
Comment on lines
+13
to
+14
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would expect some sort of softer version of the data loss case. The word I want to use here is "slashing", but I think that usually implies something harsher than what we probably want here, at least at first. If bad behavior continues, it should probably escalate into something that feels more like "slashing", though. |
||
| * Allocation failure | ||
| * If we cannot allocate on a node, we should try a _different_ node. We should consider that an allocation failure could be temporary and caused by a full disk or adverse network conditions. Aside - we should consider increasing probability of allocating on nodes that are close to the client uploading the data to mitigate firewalls (inability to reach the node) and allow for faster upload times. | ||
|
Comment on lines
+15
to
+16
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ➕ |
||
| * Accept failure | ||
| * After uploading a blob the upload service should at least retry a blob/accept invocation to a storage node. If the node responds with an error receipt, what is the implication? Has the node failed to store the data? How should it affect our desire to store _more_ data on the node? | ||
|
Comment on lines
+17
to
+18
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the node "successfully" returns an error receipt, we can be confident (I think) that it knows that it failed, so we can call that a failure to store. We should probably choose a different node and try again. It should probably also play into the "reputation"/weighting system, yeah. But maybe with a lower impact than other things, since it's not terrible to run into this is and have to try another node, unless it happens a lot, at which point the effect would add up. If the node fails to return a receipt, I'm not exactly sure. Can we follow up by asking for the receipt, and ideally get a success receipt, a fail receipt, or an affirmative 404 that the receipt doesn't exist and (presumably) the storage didn't happen? |
||
| * Retrieval failure | ||
| * How do we observe and verify failures to retrieve? How do we prevent storage nodes from accepting retrieval invocations, claiming the egress and not sending the data? | ||
|
Comment on lines
+19
to
+20
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In AWS or Cloudflare, you pretty much have to trust the host to tell the truth about what it sent your clients. The best assurance you have is logging: if the logs don't match the bill, there's an obvious problem, and if the logs don't match what the clients receive, you have a less obvious/certain problem, but still something that hopefully is evident enough to be caught. Can we make similar logging available enough to customers in real-ish time so that they can keep the nodes honest by reporting issues to us? We'd still have to trust the nodes a fair bit, but maybe issues become manual investigations with a penalty of getting kicked off the network for any fraud? |
||
| * Node leaving the network | ||
| * How do we detect this and clean up? How do we repair all the data it was storing? We should stop considering that node as a candidate for uploads. | ||
|
Comment on lines
+21
to
+22
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we currently store the number of intended replicas of a blob as state somewhere? I think ideally we automatically attempt to re-establish the requested number of replicas any time the actual number drops, which I think is exactly when a node leaves (voluntarily or involuntarily). I'm curious what we want to do in the case where the node held the only copy. Do we simply lose the data, and call that the risk of a single replica? That seems pretty reasonable to me, as replicated data should be considered the norm, but we should probably make sure we're comfortable being explicit about that. |
||
|
|
||
| A lot of these failure cases can boil down to us maintaining node weights, and how each one of these situations affects a weighting. That said, it would be good to find some time to consider whether weights is the right solution or if there is a better alternative, like a broader reputation system, that is maybe provable and observable. There is some much harder problems here though and it would be good to find some time to consider what we could put in place to either solve or align incentives in a way that makes it undesirable. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's enough to sign an audit log of our reasoning for altering weights. I don't think we need to "prove" our justifications with unimpeachable evidence, such that (eg) a smart contract could come to the same conclusions. I think we make assertions based on our own policies and judgements, make those decisions transparent, and then stand behind them. Ultimately, we're a central-but-replaceable service holding this routing table. If people don't like our management of it, if we do our job right, they could run their own on top of the same network. This portion is akin to Bluesky: someone else can build their own Bluesky clone on atproto if they don't like how the company is running the app. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the replication is our durability contract with the user, I believe we should keep retrying until the minimum replication factor is reached; otherwise, we need to mark that blob as "at_risk" or something so we don't silently fail to replicate, and it becomes a data loss in the future when nodes are not available.
Perhaps having some sort of control loop to periodically check if a blob replica is below the target threshold, if so, create new replicas on healthy nodes.