Skip to content

fix: stream multipart uploads to avoid OOM on large files#418

Open
Acelogic wants to merge 3 commits intogoogleworkspace:mainfrom
Acelogic:fix/streaming-multipart-upload
Open

fix: stream multipart uploads to avoid OOM on large files#418
Acelogic wants to merge 3 commits intogoogleworkspace:mainfrom
Acelogic:fix/streaming-multipart-upload

Conversation

@Acelogic
Copy link

Summary

Fixes #244 — uploading large files via --upload causes an out-of-memory crash because the entire file is read into memory twice (once by tokio::fs::read, then copied into a second Vec by build_multipart_body). A 5 GB file requests ~20 GB of contiguous RAM.

This replaces the buffered approach with a streaming multipart/related body:

  • build_multipart_stream yields the body in three phases via futures_util::stream::unfold: preamble (boundary + JSON metadata), file chunks (64 KB reads from disk), and postamble (closing boundary)
  • Content-Length is computed from tokio::fs::metadata so Google APIs still receive the correct header without buffering the file
  • Memory usage is now O(1) (~64 KB buffer) regardless of file size

The old build_multipart_body is retained under #[cfg(test)] for the existing unit tests.

Test plan

  • cargo clippy -- -D warnings passes
  • cargo test — 552/552 pass (2 new tests added)
  • test_build_multipart_stream_content_length — verifies declared Content-Length matches preamble + file size + postamble
  • test_build_multipart_stream_large_file — 256 KB file (larger than 64 KB chunk size) verifies multi-chunk streaming arithmetic

…space#244)

Replace the buffered file read + body copy with a streaming
multipart/related body that reads the file in 64 KB chunks.

Previously, uploading a file required ~4x the file size in RAM
(tokio::fs::read allocates a Vec, then build_multipart_body copies
it into a second growing Vec). A 5 GB upload would request ~20 GB
of contiguous memory, crashing the process.

The new build_multipart_stream function yields preamble, file
chunks, and postamble through a futures_util::stream::unfold
state machine, keeping memory usage constant regardless of file
size. Content-Length is computed from file metadata so Google APIs
still receive the correct header.

Closes googleworkspace#244
@Acelogic Acelogic requested a review from jpoehnelt as a code owner March 11, 2026 20:32
@changeset-bot
Copy link

changeset-bot bot commented Mar 11, 2026

🦋 Changeset detected

Latest commit: 813dd28

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@googleworkspace/cli Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@google-cla
Copy link

google-cla bot commented Mar 11, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical out-of-memory issue that occurred when uploading large files by refactoring the upload mechanism to use a streaming approach. This change significantly reduces memory consumption and improves the robustness of the application when dealing with substantial data transfers, ensuring a more stable and efficient user experience.

Highlights

  • Streaming Multipart Uploads: Implemented build_multipart_stream to handle file uploads in a streaming fashion, avoiding reading entire files into memory.
  • Memory Optimization: Memory usage for file uploads is now O(1) (approximately 64 KB buffer) regardless of the file size, preventing out-of-memory crashes on large files.
  • Content-Length Calculation: The Content-Length header is now computed accurately from file metadata, ensuring correct communication with Google APIs without buffering the entire file.
  • Legacy Function Retention: The original build_multipart_body function has been retained under #[cfg(test)] for existing unit tests.
Changelog
  • .changeset/stream-multipart-uploads.md
    • Added a new changeset file documenting the streaming multipart upload fix.
  • src/executor.rs
    • Imported AsyncReadExt for asynchronous file reading operations.
    • Modified the build_http_request function to utilize the new streaming multipart upload logic, calculating Content-Length from file metadata and setting the request body with the stream.
    • Annotated the build_multipart_body function with #[cfg(test)] to restrict its compilation and use to test environments.
    • Introduced the build_multipart_stream function, which constructs a streaming multipart/related body by reading file contents in 64 KB chunks.
    • Added two new asynchronous tests, test_build_multipart_stream_content_length and test_build_multipart_stream_large_file, to validate the content length calculation and multi-chunk streaming behavior of the new multipart stream.
Activity
  • All cargo clippy -- -D warnings checks passed.
  • All cargo test suite tests passed, including 2 new tests introduced in this PR.
  • Added test_build_multipart_stream_content_length to verify the declared Content-Length matches the combined size of preamble, file, and postamble.
  • Added test_build_multipart_stream_large_file to confirm correct multi-chunk streaming arithmetic for files larger than the 64 KB chunk size.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Generative AI Prohibited Use Policy, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement, fixing a critical out-of-memory issue by switching to streaming uploads for large files. The implementation is mostly solid, and the addition of tests for the new streaming logic is commendable. I've identified a critical security vulnerability related to path traversal and a high-severity issue regarding the stream implementation's efficiency and maintainability. Addressing these will make the solution more robust and secure.

Comment on lines +185 to +193
let file_size = tokio::fs::metadata(upload_path)
.await
.map_err(|e| {
GwsError::Validation(format!(
"Failed to read upload file '{}': {}",
upload_path, e
))
})?
.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The upload_path is used to access the filesystem without prior validation. This could lead to a path traversal vulnerability, allowing an attacker to read arbitrary files by providing a path like ../../etc/passwd. Given that this tool might be used by automated agents (as hinted at in src/validate.rs), this poses a significant security risk.

Before accessing the file system with tokio::fs::metadata, you should validate upload_path to ensure it's a relative path that resolves to a location within the current working directory. You can implement a new validation function similar to validate_safe_dir_path in src/validate.rs for this purpose.

src/executor.rs Outdated
Comment on lines +813 to +859
// State machine for the streaming body: preamble -> file chunks -> postamble
enum State {
Preamble {
preamble: Vec<u8>,
file_path: String,
postamble: Vec<u8>,
},
Streaming {
file: tokio::fs::File,
postamble: Vec<u8>,
},
Done,
}

let initial = State::Preamble {
preamble: preamble.into_bytes(),
file_path: file_path.to_owned(),
postamble: postamble.into_bytes(),
};

let stream = futures_util::stream::unfold(initial, |state| async move {
match state {
State::Preamble {
preamble,
file_path,
postamble,
} => match tokio::fs::File::open(&file_path).await {
Ok(file) => Some((Ok(preamble), State::Streaming { file, postamble })),
Err(e) => Some((Err(e), State::Done)),
},
State::Streaming {
mut file,
postamble,
} => {
let mut buf = vec![0u8; 64 * 1024];
match file.read(&mut buf).await {
Ok(0) => Some((Ok(postamble), State::Done)),
Ok(n) => {
buf.truncate(n);
Some((Ok(buf), State::Streaming { file, postamble }))
}
Err(e) => Some((Err(e), State::Done)),
}
}
State::Done => None,
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of the streaming body using futures_util::stream::unfold and a manual state machine is quite complex and has a performance drawback: it re-allocates a 64KB buffer for every chunk read from the file. For large files, this results in many allocations, impacting performance.

A more idiomatic and efficient approach in the Tokio ecosystem is to use tokio_util::io::ReaderStream and chain it with streams for the preamble and postamble. This would simplify the code, improve readability, and enhance performance by avoiding repeated buffer allocations. This may require adding tokio-util with the io feature if it's not already a dependency.

Here is a suggested replacement for the state machine and unfold logic:

    let stream = {
        use futures_util::stream::{self, TryStreamExt};
        use tokio_util::io::ReaderStream;

        let file_path = file_path.to_owned();
        let preamble_bytes: bytes::Bytes = preamble.into_bytes().into();
        let postamble_bytes: bytes::Bytes = postamble.into_bytes().into();

        let file_stream = stream::once(async move { tokio::fs::File::open(file_path).await })
            .map_ok(ReaderStream::new)
            .try_flatten_stream();

        stream::once(async { Ok(preamble_bytes) })
            .chain(file_stream)
            .chain(stream::once(async { Ok(postamble_bytes) }))
    };

Address review feedback: replace the hand-rolled futures_util::stream::unfold
state machine with tokio_util::io::ReaderStream chained between preamble and
postamble streams. This is more idiomatic, avoids per-chunk buffer
re-allocation, and is easier to read.
@googleworkspace-bot
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses an out-of-memory issue with large file uploads by switching from a buffered to a streaming approach. The implementation is solid and introduces a new build_multipart_stream function to handle this. I've included one suggestion to further optimize memory usage within the new streaming logic by avoiding an unnecessary data copy for each file chunk.

Comment on lines +813 to +828
// Chain: preamble bytes -> file chunks (via ReaderStream) -> postamble bytes
let file_path = file_path.to_owned();
let preamble_bytes = preamble.into_bytes();
let postamble_bytes = postamble.into_bytes();

let file_stream =
futures_util::stream::once(async move { tokio::fs::File::open(file_path).await })
.map_ok(|f| tokio_util::io::ReaderStream::new(f).map_ok(|b| b.to_vec()))
.try_flatten();

let stream =
futures_util::stream::once(async { Ok::<_, std::io::Error>(preamble_bytes) })
.chain(file_stream)
.chain(futures_util::stream::once(async {
Ok::<_, std::io::Error>(postamble_bytes)
}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this implementation correctly streams the file, it performs an unnecessary allocation and copy for each chunk of the file by converting Bytes to Vec<u8> with .map_ok(|b| b.to_vec()).

Given that the goal of this change is to optimize memory usage for large files, we can further improve efficiency by avoiding this copy. By using reqwest::bytes::Bytes for all parts of the stream, you can achieve a zero-copy implementation for the file chunks. This makes the streaming even more memory-efficient.

Suggested change
// Chain: preamble bytes -> file chunks (via ReaderStream) -> postamble bytes
let file_path = file_path.to_owned();
let preamble_bytes = preamble.into_bytes();
let postamble_bytes = postamble.into_bytes();
let file_stream =
futures_util::stream::once(async move { tokio::fs::File::open(file_path).await })
.map_ok(|f| tokio_util::io::ReaderStream::new(f).map_ok(|b| b.to_vec()))
.try_flatten();
let stream =
futures_util::stream::once(async { Ok::<_, std::io::Error>(preamble_bytes) })
.chain(file_stream)
.chain(futures_util::stream::once(async {
Ok::<_, std::io::Error>(postamble_bytes)
}));
// Chain: preamble bytes -> file chunks (via ReaderStream) -> postamble bytes
let file_path = file_path.to_owned();
let preamble_bytes = reqwest::bytes::Bytes::from(preamble.into_bytes());
let postamble_bytes = reqwest::bytes::Bytes::from(postamble.into_bytes());
let file_stream =
futures_util::stream::once(async move { tokio::fs::File::open(file_path).await })
.map_ok(|f| tokio_util::io::ReaderStream::new(f))
.try_flatten();
let stream =
futures_util::stream::once(async { Ok::<_, std::io::Error>(preamble_bytes) })
.chain(file_stream)
.chain(futures_util::stream::once(async {
Ok::<_, std::io::Error>(postamble_bytes)
}));

Remove the per-chunk .to_vec() copy by using bytes::Bytes throughout
the stream chain. ReaderStream already yields Bytes, so the preamble
and postamble just need to match the type.
@googleworkspace-bot
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a streaming approach for multipart file uploads to fix out-of-memory issues with large files. The implementation is solid and correctly uses streams to avoid buffering the entire file. I've added one high-severity comment regarding error handling during metadata serialization, which could lead to silent failures. Propagating the error would make the implementation more robust.

Comment on lines +782 to +836
fn build_multipart_stream(
metadata: &Option<Value>,
file_path: &str,
file_size: u64,
) -> (reqwest::Body, String, u64) {
let boundary = format!("gws_boundary_{:016x}", rand::random::<u64>());

let media_mime = metadata
.as_ref()
.and_then(|m| m.get("mimeType"))
.and_then(|v| v.as_str())
.unwrap_or("application/octet-stream")
.to_string();

let metadata_json = metadata
.as_ref()
.map(|m| serde_json::to_string(m).unwrap_or_else(|_| "{}".to_string()))
.unwrap_or_else(|| "{}".to_string());

let preamble = format!(
"--{boundary}\r\n\
Content-Type: application/json; charset=UTF-8\r\n\r\n\
{metadata_json}\r\n\
--{boundary}\r\n\
Content-Type: {media_mime}\r\n\r\n"
);
let postamble = format!("\r\n--{boundary}--\r\n");

let content_length = preamble.len() as u64 + file_size + postamble.len() as u64;
let content_type = format!("multipart/related; boundary={boundary}");

// Chain: preamble bytes -> file chunks (via ReaderStream) -> postamble bytes
// All parts use bytes::Bytes for zero-copy streaming.
let file_path = file_path.to_owned();
let preamble_bytes = bytes::Bytes::from(preamble.into_bytes());
let postamble_bytes = bytes::Bytes::from(postamble.into_bytes());

let file_stream =
futures_util::stream::once(async move { tokio::fs::File::open(file_path).await })
.map_ok(tokio_util::io::ReaderStream::new)
.try_flatten();

let stream =
futures_util::stream::once(async { Ok::<_, std::io::Error>(preamble_bytes) })
.chain(file_stream)
.chain(futures_util::stream::once(async {
Ok::<_, std::io::Error>(postamble_bytes)
}));

(
reqwest::Body::wrap_stream(stream),
content_type,
content_length,
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of unwrap_or_else on lines 798 and 799 can hide a JSON serialization error, causing the upload to proceed with empty metadata ({}). This could lead to silent failures or hard-to-debug server-side errors.

To make this more robust, I recommend changing build_multipart_stream to return a Result and propagating any serialization errors. This aligns with the error handling pattern used elsewhere in the file.

Here's how you could refactor the function:

fn build_multipart_stream(
    metadata: &Option<Value>,
    file_path: &str,
    file_size: u64,
) -> Result<(reqwest::Body, String, u64), GwsError> {
    let boundary = format!("gws_boundary_{:016x}", rand::random::<u64>());

    let media_mime = metadata
        .as_ref()
        .and_then(|m| m.get("mimeType"))
        .and_then(|v| v.as_str())
        .unwrap_or("application/octet-stream")
        .to_string();

    let metadata_json = match metadata {
        Some(m) => serde_json::to_string(m).map_err(|e| {
            GwsError::Validation(format!("Failed to serialize upload metadata: {e}"))
        })?,
        None => "{}".to_string(),
    };

    let preamble = format!(
        "--{boundary}\r\n\
         Content-Type: application/json; charset=UTF-8\r\n\r\n\
         {metadata_json}\r\n\
         --{boundary}\r\n\
         Content-Type: {media_mime}\r\n\r\n"
    );
    let postamble = format!("\r\n--{boundary}--\r\n");

    let content_length = preamble.len() as u64 + file_size + postamble.len() as u64;
    let content_type = format!("multipart/related; boundary={boundary}");

    // ... (rest of the function is fine)

    let file_path = file_path.to_owned();
    let preamble_bytes = bytes::Bytes::from(preamble.into_bytes());
    let postamble_bytes = bytes::Bytes::from(postamble.into_bytes());

    let file_stream =
        futures_util::stream::once(async move { tokio::fs::File::open(file_path).await })
            .map_ok(tokio_util::io::ReaderStream::new)
            .try_flatten();

    let stream =
        futures_util::stream::once(async { Ok::<_, std::io::Error>(preamble_bytes) })
            .chain(file_stream)
            .chain(futures_util::stream::once(async {
                Ok::<_, std::io::Error>(postamble_bytes)
            }));

    Ok((
        reqwest::Body::wrap_stream(stream),
        content_type,
        content_length,
    ))
}

You would also need to update the call site in build_http_request to use the ? operator:

// line 196
let (body, content_type, content_length) =
    build_multipart_stream(&input.body, upload_path, file_size)?;

@Acelogic
Copy link
Author

I]have signed the Google CLA. Please re-check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Out-Of-Memory (OOM) Crash on Large File Uploads (Google Drive/YouTube)

2 participants