rfc: Guppy Retrieval Strategy#77

Open

Peeja wants to merge 1 commit intomainfrom

guppy-retrieval-strategy

Member

Peeja commented Jan 5, 2026

A rough exploration of possible approaches to making retrieval requests in Guppy. Thoughts on the given ideas and additional ideas are welcome.

It's been useful to me to write this all out, but so far it's just convinced me that Freeway's approach is the best one and we should just reimplement that. Looking forward to hearing if others disagree, though.


          rfc: Guppy Retrieval Strategy (first draft)

38e1b3c

Member

hannahhoward commented Jan 5, 2026

Some data retrieval patterns to consider, that will absolutely come up:

Small file extraction from deep in a very large directory -- i.e. /a/b/c/d/e/f/smallfile.txt -- this is likely to happen and I think is particularly bad for the shard-optimistic approach. Though no approach is great, cause of the required sequence of round trips.
Non unix-FS dags where we don't control DAG shapes -- this is explicitly going to be a thing for Triton.

Broadly, I think range coalescing is the right move here and am glad you landed here as well, and I'd lean away from multipart stuff, even if that adds a minor egress overage for bytes for CIDs.

Also, Harvard data is onboarded as is, so we need to deal with CARs. But we should think now about what we can make easier by improving our data preparation strategies:

Guppy's Filepack support works. Maybe we make it default.
What about feat(rfc): sharded dag with virtual blocks RFC #66 -- maybe this merits exploration BEFORE we do additional uploads. Especially since it is well suited for scenario 1.

hannahhoward approved these changes

View reviewed changes

Member

hannahhoward left a comment

net net I'm lgtm on starting with a range coalesce strategy.

Member

hannahhoward commented Jan 5, 2026

Another thing to consider if we switch to Filepack -- should we be strategic about how we pack files? IOW, pretty easy to use shard-optimistic strategies if... the shard is the file. :P

Member Author

Peeja commented Jan 6, 2026

Yeah, I think Filepack is the best choice here, and I don't think there's any reason not to, right? We'll need to support CARs when we already have them, but if we're packing the shards, I think it may as well be the default.

I also think either the virtual block RFC or just packing metadata into separate shards (or some combination) could work really well, as well as, as you say, making the shard the file (when possible, and otherwise several concatenated small files, or a contiguous chunk of a big one).

Member Author

Peeja commented Jan 6, 2026

My mind also keeps coming back to some kind of "hinting" early in the process, such as in the index. I can't tell whether that ultimately amounts to the same thing as virtual blocks of UnixFS metadata.

I think the foothold on this thing has something to do with the fact that we almost exclusively retrieve entire subdags. That is, once you've got a root block, you want everything under it. Could the index give you that in one go, somehow? Is that just equivalent to virtual metadata blocks?

alanshaw approved these changes

View reviewed changes

rfc/guppy-retrieval-strategy.md

+              * **Shard-Optimistic:** When the first block from a shard is requested, fetch the entire shard blob and cache it.
+                * Only one request is made per involved shard, but we may fetch (and egress) more data than required.
+                * We also need to hold onto the cached shards until we're done using them, and those shards potentially take up much more space than the actual target data.

Member

alanshaw Jan 7, 2026 •

edited

Loading

Maybe worth mentioning that if you don't know you're going to use all the data in the shard then this is not friendly to the user that has stored the data i.e. you have made them pay for egress of data you're not using.

rfc/guppy-retrieval-strategy.md

+              * **Shard-Optimistic:** When the first block from a shard is requested, fetch the entire shard blob and cache it.
+                * Only one request is made per involved shard, but we may fetch (and egress) more data than required.
+                * We also need to hold onto the cached shards until we're done using them, and those shards potentially take up much more space than the actual target data.

Member

alanshaw Jan 7, 2026

Due to the way DAGs are constructed the root block is the last block in the file so by requesting the entire shard, the time it takes to start streaming the data is a lot longer than if you were using naive or range coalescing strategy. From a CLI tool that maybe doesn't matter all that much, but in a gateway situation e.g. retrieving a video, it effects your TTFB.

rfc/guppy-retrieval-strategy.md

+              * **Range-Coalescing:** When a block is requested, place it in a queue. Periodically, for each shard with blocks in the queue, coalesce the ranges of those blocks, and then request each range. The shards are (currently) CAR files, so adjacent blocks are not literally adjacent (there's a CID and length between them), so we count blocks that are "close enough" as adjacent.
+                * If the requested data involves many continguous blocks, this involves many fewer requests than the Naive approach, but not as few as the Shard-Optimistic approach.
+                * Like the Naive approach, it retrieves the minimal blocks, though unlike the Naive approach it must egress the CID and length data between blocks, which is then ignored.

Member

alanshaw Jan 7, 2026

This is exactly why Filepack exists :)

rfc/guppy-retrieval-strategy.md

+              * **Shard-Optimistic:** When the first block from a shard is requested, fetch the entire shard blob and cache it.
+                * Only one request is made per involved shard, but we may fetch (and egress) more data than required.
+                * We also need to hold onto the cached shards until we're done using them, and those shards potentially take up much more space than the actual target data.

Member

alanshaw Jan 7, 2026

From a gateway perspective, as soon as you start getting requests for files in a larger DAG this approach doesn't really hold up:

The TTFB will be too slow
The disk cache would have to be massive and is very wasteful
There's a big cost to users for wasted bandwidth

rfc/guppy-retrieval-strategy.md

+              * **Range-Coalescing:** When a block is requested, place it in a queue. Periodically, for each shard with blocks in the queue, coalesce the ranges of those blocks, and then request each range. The shards are (currently) CAR files, so adjacent blocks are not literally adjacent (there's a CID and length between them), so we count blocks that are "close enough" as adjacent.
+                * If the requested data involves many continguous blocks, this involves many fewer requests than the Naive approach, but not as few as the Shard-Optimistic approach.
+                * Like the Naive approach, it retrieves the minimal blocks, though unlike the Naive approach it must egress the CID and length data between blocks, which is then ignored.
+                * Startup is slow, because only the root block can be fetched on the first request. Efficiency on further rounds is best on wide DAGs and worse on deep DAGs.

Member

alanshaw Jan 7, 2026

I'm not certain this is correct, and I don't know if startup is the right word here. With shard optimistic you'd certainly be making fewer requests. However, depending on the shard size, you can potentially get to streaming the data faster using range coalescing since you don't have to download an entire shard before exporting the data from the DAG (per the aforementioned root as last block problem).

#66 would resolve the deep DAGs problem.

rfc/guppy-retrieval-strategy.md

+                * There is also a multipart version of this approach which can make a single request for multiple ranges at once, but server support for this is spotty, and notably lacking in Go. It also incurs overhead in the response which may negate any benefits.
+                * Currently, Freeway implements a Range-Coalescing approach, so we have some evidence it works decently.
+              * **Chunk-Optimistic:** This is the same as Shard-Optimistic, but divides shards into smaller chunks. When the first block is requested, a range of nearby blocks are fetched along with it.

Member

alanshaw Jan 7, 2026

Feels like an optimization rather than a strategy.

rfc/guppy-retrieval-strategy.md

+              * **Chunk-Optimistic:** This is the same as Shard-Optimistic, but divides shards into smaller chunks. When the first block is requested, a range of nearby blocks are fetched along with it.
+                * Like Range-Coalescing, this strikes a balance between Naive and Shard-Optimistic.
+                * Unlike Range-Coalescing, startup can include multiple blocks.
+                * Unlike other approaches, the ranges may not be on block borders, because the borders of blocks are unknown until we look up the block in the index. That may make managing cached data difficult to manage.

Member

alanshaw Jan 7, 2026

because the borders of blocks are unknown until we look up the block in the index

The only way you know a block is in a shard is because you have an index, so you should know the block boundaries no?

rfc/guppy-retrieval-strategy.md


		### Thoughts

		* For large data, the startup cost of Range-Coalescing is much less significant. Large data is also (warning: speculation) more likely to be wide than deep. The only way to make an especially deep UnixFS DAG would be to start with a very deep directory tree.

Member

alanshaw Jan 7, 2026

Ya, I think no way to know and no one size fits all.

If you find the CID requested is the root CID of the DAG the index describes i.e. request CID == index.content and the shard is within some acceptable size bound you could use shard optimistic but otherwise I think Range-Coalescing is the best way.

rfc/guppy-retrieval-strategy.md


		* Metrics are probably the key to tuning.

		* Any time we egress data that's ultimately discarded, we should have a pretty strong argument for why, since egress is charged to the customer.

Member

alanshaw Jan 7, 2026

➕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet