Skip to content

rfc: Guppy Retrieval Strategy#77

Open
Peeja wants to merge 1 commit intomainfrom
guppy-retrieval-strategy
Open

rfc: Guppy Retrieval Strategy#77
Peeja wants to merge 1 commit intomainfrom
guppy-retrieval-strategy

Conversation

@Peeja
Copy link
Member

@Peeja Peeja commented Jan 5, 2026

📖 Preview

A rough exploration of possible approaches to making retrieval requests in Guppy. Thoughts on the given ideas and additional ideas are welcome.

It's been useful to me to write this all out, but so far it's just convinced me that Freeway's approach is the best one and we should just reimplement that. Looking forward to hearing if others disagree, though.

@hannahhoward
Copy link
Member

Some data retrieval patterns to consider, that will absolutely come up:

  1. Small file extraction from deep in a very large directory -- i.e. /a/b/c/d/e/f/smallfile.txt -- this is likely to happen and I think is particularly bad for the shard-optimistic approach. Though no approach is great, cause of the required sequence of round trips.
  2. Non unix-FS dags where we don't control DAG shapes -- this is explicitly going to be a thing for Triton.

Broadly, I think range coalescing is the right move here and am glad you landed here as well, and I'd lean away from multipart stuff, even if that adds a minor egress overage for bytes for CIDs.

Also, Harvard data is onboarded as is, so we need to deal with CARs. But we should think now about what we can make easier by improving our data preparation strategies:

  1. Guppy's Filepack support works. Maybe we make it default.
  2. What about feat(rfc): sharded dag with virtual blocks RFC #66 -- maybe this merits exploration BEFORE we do additional uploads. Especially since it is well suited for scenario 1.

Copy link
Member

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

net net I'm lgtm on starting with a range coalesce strategy.

@hannahhoward
Copy link
Member

Another thing to consider if we switch to Filepack -- should we be strategic about how we pack files? IOW, pretty easy to use shard-optimistic strategies if... the shard is the file. :P

@Peeja
Copy link
Member Author

Peeja commented Jan 6, 2026

Yeah, I think Filepack is the best choice here, and I don't think there's any reason not to, right? We'll need to support CARs when we already have them, but if we're packing the shards, I think it may as well be the default.

I also think either the virtual block RFC or just packing metadata into separate shards (or some combination) could work really well, as well as, as you say, making the shard the file (when possible, and otherwise several concatenated small files, or a contiguous chunk of a big one).

@Peeja
Copy link
Member Author

Peeja commented Jan 6, 2026

My mind also keeps coming back to some kind of "hinting" early in the process, such as in the index. I can't tell whether that ultimately amounts to the same thing as virtual blocks of UnixFS metadata.

I think the foothold on this thing has something to do with the fact that we almost exclusively retrieve entire subdags. That is, once you've got a root block, you want everything under it. Could the index give you that in one go, somehow? Is that just equivalent to virtual metadata blocks?


* **Shard-Optimistic:** When the first block from a shard is requested, fetch the entire shard blob and cache it.
* Only one request is made per involved shard, but we may fetch (and egress) more data than required.
* We also need to hold onto the cached shards until we're done using them, and those shards potentially take up much more space than the actual target data.
Copy link
Member

@alanshaw alanshaw Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning that if you don't know you're going to use all the data in the shard then this is not friendly to the user that has stored the data i.e. you have made them pay for egress of data you're not using.


* **Shard-Optimistic:** When the first block from a shard is requested, fetch the entire shard blob and cache it.
* Only one request is made per involved shard, but we may fetch (and egress) more data than required.
* We also need to hold onto the cached shards until we're done using them, and those shards potentially take up much more space than the actual target data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the way DAGs are constructed the root block is the last block in the file so by requesting the entire shard, the time it takes to start streaming the data is a lot longer than if you were using naive or range coalescing strategy. From a CLI tool that maybe doesn't matter all that much, but in a gateway situation e.g. retrieving a video, it effects your TTFB.


* **Range-Coalescing:** When a block is requested, place it in a queue. Periodically, for each shard with blocks in the queue, coalesce the ranges of those blocks, and then request each range. The shards are (currently) CAR files, so adjacent blocks are not literally adjacent (there's a CID and length between them), so we count blocks that are "close enough" as adjacent.
* If the requested data involves many continguous blocks, this involves many fewer requests than the Naive approach, but not as few as the Shard-Optimistic approach.
* Like the Naive approach, it retrieves the minimal blocks, though unlike the Naive approach it must egress the CID and length data between blocks, which is then ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly why Filepack exists :)


* **Shard-Optimistic:** When the first block from a shard is requested, fetch the entire shard blob and cache it.
* Only one request is made per involved shard, but we may fetch (and egress) more data than required.
* We also need to hold onto the cached shards until we're done using them, and those shards potentially take up much more space than the actual target data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a gateway perspective, as soon as you start getting requests for files in a larger DAG this approach doesn't really hold up:

  • The TTFB will be too slow
  • The disk cache would have to be massive and is very wasteful
  • There's a big cost to users for wasted bandwidth

* **Range-Coalescing:** When a block is requested, place it in a queue. Periodically, for each shard with blocks in the queue, coalesce the ranges of those blocks, and then request each range. The shards are (currently) CAR files, so adjacent blocks are not literally adjacent (there's a CID and length between them), so we count blocks that are "close enough" as adjacent.
* If the requested data involves many continguous blocks, this involves many fewer requests than the Naive approach, but not as few as the Shard-Optimistic approach.
* Like the Naive approach, it retrieves the minimal blocks, though unlike the Naive approach it must egress the CID and length data between blocks, which is then ignored.
* Startup is slow, because only the root block can be fetched on the first request. Efficiency on further rounds is best on wide DAGs and worse on deep DAGs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain this is correct, and I don't know if startup is the right word here. With shard optimistic you'd certainly be making fewer requests. However, depending on the shard size, you can potentially get to streaming the data faster using range coalescing since you don't have to download an entire shard before exporting the data from the DAG (per the aforementioned root as last block problem).

#66 would resolve the deep DAGs problem.

* There is also a multipart version of this approach which can make a single request for multiple ranges at once, but server support for this is spotty, and notably lacking in Go. It also incurs overhead in the response which may negate any benefits.
* Currently, Freeway implements a Range-Coalescing approach, so we have some evidence it works decently.

* **Chunk-Optimistic:** This is the same as Shard-Optimistic, but divides shards into smaller chunks. When the first block is requested, a range of nearby blocks are fetched along with it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like an optimization rather than a strategy.

* **Chunk-Optimistic:** This is the same as Shard-Optimistic, but divides shards into smaller chunks. When the first block is requested, a range of nearby blocks are fetched along with it.
* Like Range-Coalescing, this strikes a balance between Naive and Shard-Optimistic.
* Unlike Range-Coalescing, startup can include multiple blocks.
* Unlike other approaches, the ranges may not be on block borders, because the borders of blocks are unknown until we look up the block in the index. That may make managing cached data difficult to manage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the borders of blocks are unknown until we look up the block in the index

The only way you know a block is in a shard is because you have an index, so you should know the block boundaries no?


### Thoughts

* For large data, the startup cost of Range-Coalescing is much less significant. Large data is also (warning: speculation) more likely to be wide than deep. The only way to make an especially deep UnixFS DAG would be to start with a very deep directory tree.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I think no way to know and no one size fits all.

If you find the CID requested is the root CID of the DAG the index describes i.e. request CID == index.content and the shard is within some acceptable size bound you could use shard optimistic but otherwise I think Range-Coalescing is the best way.


* Metrics are probably the key to tuning.

* Any time we egress data that's ultimately discarded, we should have a pretty strong argument for why, since egress is charged to the customer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants