Conversation
|
Some data retrieval patterns to consider, that will absolutely come up:
Broadly, I think range coalescing is the right move here and am glad you landed here as well, and I'd lean away from multipart stuff, even if that adds a minor egress overage for bytes for CIDs. Also, Harvard data is onboarded as is, so we need to deal with CARs. But we should think now about what we can make easier by improving our data preparation strategies:
|
hannahhoward
left a comment
There was a problem hiding this comment.
net net I'm lgtm on starting with a range coalesce strategy.
|
Another thing to consider if we switch to Filepack -- should we be strategic about how we pack files? IOW, pretty easy to use shard-optimistic strategies if... the shard is the file. :P |
|
Yeah, I think Filepack is the best choice here, and I don't think there's any reason not to, right? We'll need to support CARs when we already have them, but if we're packing the shards, I think it may as well be the default. I also think either the virtual block RFC or just packing metadata into separate shards (or some combination) could work really well, as well as, as you say, making the shard the file (when possible, and otherwise several concatenated small files, or a contiguous chunk of a big one). |
|
My mind also keeps coming back to some kind of "hinting" early in the process, such as in the index. I can't tell whether that ultimately amounts to the same thing as virtual blocks of UnixFS metadata. I think the foothold on this thing has something to do with the fact that we almost exclusively retrieve entire subdags. That is, once you've got a root block, you want everything under it. Could the index give you that in one go, somehow? Is that just equivalent to virtual metadata blocks? |
|
|
||
| * **Shard-Optimistic:** When the first block from a shard is requested, fetch the entire shard blob and cache it. | ||
| * Only one request is made per involved shard, but we may fetch (and egress) more data than required. | ||
| * We also need to hold onto the cached shards until we're done using them, and those shards potentially take up much more space than the actual target data. |
There was a problem hiding this comment.
Maybe worth mentioning that if you don't know you're going to use all the data in the shard then this is not friendly to the user that has stored the data i.e. you have made them pay for egress of data you're not using.
|
|
||
| * **Shard-Optimistic:** When the first block from a shard is requested, fetch the entire shard blob and cache it. | ||
| * Only one request is made per involved shard, but we may fetch (and egress) more data than required. | ||
| * We also need to hold onto the cached shards until we're done using them, and those shards potentially take up much more space than the actual target data. |
There was a problem hiding this comment.
Due to the way DAGs are constructed the root block is the last block in the file so by requesting the entire shard, the time it takes to start streaming the data is a lot longer than if you were using naive or range coalescing strategy. From a CLI tool that maybe doesn't matter all that much, but in a gateway situation e.g. retrieving a video, it effects your TTFB.
|
|
||
| * **Range-Coalescing:** When a block is requested, place it in a queue. Periodically, for each shard with blocks in the queue, coalesce the ranges of those blocks, and then request each range. The shards are (currently) CAR files, so adjacent blocks are not literally adjacent (there's a CID and length between them), so we count blocks that are "close enough" as adjacent. | ||
| * If the requested data involves many continguous blocks, this involves many fewer requests than the Naive approach, but not as few as the Shard-Optimistic approach. | ||
| * Like the Naive approach, it retrieves the minimal blocks, though unlike the Naive approach it must egress the CID and length data between blocks, which is then ignored. |
There was a problem hiding this comment.
This is exactly why Filepack exists :)
|
|
||
| * **Shard-Optimistic:** When the first block from a shard is requested, fetch the entire shard blob and cache it. | ||
| * Only one request is made per involved shard, but we may fetch (and egress) more data than required. | ||
| * We also need to hold onto the cached shards until we're done using them, and those shards potentially take up much more space than the actual target data. |
There was a problem hiding this comment.
From a gateway perspective, as soon as you start getting requests for files in a larger DAG this approach doesn't really hold up:
- The TTFB will be too slow
- The disk cache would have to be massive and is very wasteful
- There's a big cost to users for wasted bandwidth
| * **Range-Coalescing:** When a block is requested, place it in a queue. Periodically, for each shard with blocks in the queue, coalesce the ranges of those blocks, and then request each range. The shards are (currently) CAR files, so adjacent blocks are not literally adjacent (there's a CID and length between them), so we count blocks that are "close enough" as adjacent. | ||
| * If the requested data involves many continguous blocks, this involves many fewer requests than the Naive approach, but not as few as the Shard-Optimistic approach. | ||
| * Like the Naive approach, it retrieves the minimal blocks, though unlike the Naive approach it must egress the CID and length data between blocks, which is then ignored. | ||
| * Startup is slow, because only the root block can be fetched on the first request. Efficiency on further rounds is best on wide DAGs and worse on deep DAGs. |
There was a problem hiding this comment.
I'm not certain this is correct, and I don't know if startup is the right word here. With shard optimistic you'd certainly be making fewer requests. However, depending on the shard size, you can potentially get to streaming the data faster using range coalescing since you don't have to download an entire shard before exporting the data from the DAG (per the aforementioned root as last block problem).
#66 would resolve the deep DAGs problem.
| * There is also a multipart version of this approach which can make a single request for multiple ranges at once, but server support for this is spotty, and notably lacking in Go. It also incurs overhead in the response which may negate any benefits. | ||
| * Currently, Freeway implements a Range-Coalescing approach, so we have some evidence it works decently. | ||
|
|
||
| * **Chunk-Optimistic:** This is the same as Shard-Optimistic, but divides shards into smaller chunks. When the first block is requested, a range of nearby blocks are fetched along with it. |
There was a problem hiding this comment.
Feels like an optimization rather than a strategy.
| * **Chunk-Optimistic:** This is the same as Shard-Optimistic, but divides shards into smaller chunks. When the first block is requested, a range of nearby blocks are fetched along with it. | ||
| * Like Range-Coalescing, this strikes a balance between Naive and Shard-Optimistic. | ||
| * Unlike Range-Coalescing, startup can include multiple blocks. | ||
| * Unlike other approaches, the ranges may not be on block borders, because the borders of blocks are unknown until we look up the block in the index. That may make managing cached data difficult to manage. |
There was a problem hiding this comment.
because the borders of blocks are unknown until we look up the block in the index
The only way you know a block is in a shard is because you have an index, so you should know the block boundaries no?
|
|
||
| ### Thoughts | ||
|
|
||
| * For large data, the startup cost of Range-Coalescing is much less significant. Large data is also (warning: speculation) more likely to be wide than deep. The only way to make an especially deep UnixFS DAG would be to start with a very deep directory tree. |
There was a problem hiding this comment.
Ya, I think no way to know and no one size fits all.
If you find the CID requested is the root CID of the DAG the index describes i.e. request CID == index.content and the shard is within some acceptable size bound you could use shard optimistic but otherwise I think Range-Coalescing is the best way.
|
|
||
| * Metrics are probably the key to tuning. | ||
|
|
||
| * Any time we egress data that's ultimately discarded, we should have a pretty strong argument for why, since egress is charged to the customer. |
📖 Preview
A rough exploration of possible approaches to making retrieval requests in Guppy. Thoughts on the given ideas and additional ideas are welcome.
It's been useful to me to write this all out, but so far it's just convinced me that Freeway's approach is the best one and we should just reimplement that. Looking forward to hearing if others disagree, though.