Skip to content

Conversation

@EddieLF
Copy link
Contributor

@EddieLF EddieLF commented Aug 21, 2024

Currently, we can have multiple files in a Source, so long as the transfer_cmd is gcs_rsync or gcs_cp_r.

This PR adds a way to handle multiple files in a source when the transfer_cmd is curl, by iterating through the source.files dictionary values and constructing URLs for each file source and destination to pipe into.

In response to this PR and this workflow run which did not handle multiple source files with curl well at all.

@EddieLF EddieLF requested a review from illusional August 21, 2024 01:51
)
print(cmd)
subprocess.run(cmd, shell=True)
if source.files and source.transfer_cmd in (curl, curl_with_user_agent):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I'd probably tweak this so the transfer_cmd returns a list of commands (or a single string), and then run multiple if it's a list. So we keep how it's run closer to the implementation.

Are there type annotations that are wrong here, or is files not used here, I'm a bit confused what the behaviour is supposed to be here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is files not used here

I think files is not used here. There are two types of Source objects in references.py - sources containing a single reference file, for which only src and dst are defined. Then there are sources containing multiple reference files. For these sources, the attribute files is defined as a dict.

When the transfer_cmd is gsutil rsync, we copy the entire src directory to the dst destination, meaning we get what's in the files dict with the correct directory structure (alongside whatever else the rsync picks up from the source), so there's no need to actually look at the files dict.

Whereas for a curl, we need to actually look at all the files in the dict and execute curl command for each of them.

IMO, I'd probably tweak this so the transfer_cmd returns a list of commands (or a single string), and then run multiple if it's a list. So we keep how it's run closer to the implementation.

Can you elaborate? Are you suggesting something like updating this line to

transfer_cmd = [curl_with_user_agent, curl_with_user_agent, curl_with_user_agent]

And then updating the transfer script code to do something like

if source.files and isinstance(source.transfer_cmd, list):
  for file, command in zip(source.files.values(), transfer_cmd):
    cmd = command(...)
    print(command)
    subprocess.run(command, shell=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I think this behaviour should be decided by the transfer_cmd implementation. It probably makes more sense to pass the Source, and change the transfer_cmd to take the Source object.

RE the list of commands, I mean something like:

# this is a garbage implementation, it's just trying to illustrate the point
def my_transfer_command(source: Source) -> list[str] | str:
    if many_files or a_directory:
        return ["cp {f} ." for f in many_files]
    return "cp one.file ."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants