feat: Honor compression settings for metadata.json on write#1876
feat: Honor compression settings for metadata.json on write#1876emkornfield wants to merge 63 commits intoapache:mainfrom
Conversation
Previously these properties where not honored on tabel properties. - Adds table properties for these values. - Plumbs them through for writers.
| key: &str, | ||
| default: T, | ||
| ) -> Result<T, anyhow::Error> | ||
| ) -> crate::error::Result<T> |
There was a problem hiding this comment.
| ) -> crate::error::Result<T> | |
| ) -> Result<T> |
Unnecessary fully qualified name.
| /// The target file size for files. | ||
| pub write_target_file_size_bytes: usize, | ||
| /// Compression codec for metadata files (JSON) | ||
| pub metadata_compression_codec: String, |
There was a problem hiding this comment.
This should be sth like Option<String>, none should be None in memory.
|
|
||
| // Modify filename to add .gz before .metadata.json | ||
| let location = metadata_location.as_ref(); | ||
| let new_location = if location.ends_with(".metadata.json") { |
There was a problem hiding this comment.
I don't think we should simply replace the suffix with this approach. Currently all metadata location generated logic are in MetadataLocation, we should add a new method in like following and deprecate old method as following:
impl MetadataLocation {
#[deprecate("Use new_with_table instead.")]
pub fn new_with_table_location(table_location: impl ToString) -> Self {
}
pub fn new_with_table(table: &Table) -> Self {
....
}There was a problem hiding this comment.
The table option seemed a little awkward with the control flow on how it is used istead I added a method that takes table properties. Also added a similar method for incrementing the number.
I modified this logic a bit and kept. My concern here is if we only deprecate the new_with_table_location then users can write out metadata.json files in a format that would be unreadable by other readers.
kevinjqliu
left a comment
There was a problem hiding this comment.
Generally LGTM, a few nit comments.
would be good to get a second pair of 👀
|
|
||
| /// Creates a completely new metadata location starting at version 0, | ||
| /// with compression settings from the table's properties. | ||
| /// Only used for creating a new table. For updates, see `with_next_version`. |
There was a problem hiding this comment.
| /// Only used for creating a new table. For updates, see `with_next_version`. | |
| /// Only used for creating a new table. For updates, see `with_next_version_and_properties`. |
since with_next_version is also deprecated
There was a problem hiding this comment.
I'll wait till I understand @liurenjie1024 concerns about with_next_version_and_properties to update this.
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @emkornfield for this pr, looks much better now, still need some refinements.
| /// Creates a new metadata location for an updated metadata file. | ||
| /// Takes table properties to determine compression settings, which may have changed | ||
| /// from the previous version. | ||
| pub fn with_next_version_and_properties(&self, properties: &HashMap<String, String>) -> Self { |
There was a problem hiding this comment.
Can you clarify what you mean by we don't need it? I updated crates/iceberg/src/catalog/mod.rs to use it, if new properties change compression settings, I'm not sure how this would be reflected otherwise? I might have missed it but I don't think there is anything in the spec that prevents users from changing compression properties.
There was a problem hiding this comment.
If properties has been changes, we should build a new instance of this MetadataLocation struct.
There was a problem hiding this comment.
If properties has been changes, we should build a new instance of this MetadataLocation struct.
My main concern here, is I don't think this is something that clients of MetadataLocation would actually understand and check for a new version of the metadata (i.e. the path of least resistance is to call with_next_version on whatever metadata they used for creation. As a solution, I've created a static factory next_version which combines the two steps for the only call-site that exists for with_next_version. If you think it is better to not deprecated with_next_version I can also revert this change.
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
…nfield/iceberg-rust into feat/metadata-compression
) ## Which issue does this PR close? As discussed in the PR for [allowing compressed metadata.json](#1876 (comment)) writes we want CompressionCodec in a central place so it can be used in puffin and for other compression use-cases. Happy to move it to `spec/` as suggested in the original comment but this seemed more natural. ## What changes are included in this PR? This moves compression.rs to a top level (seems like the best location for common code but please let me know if there is a better place. - It adds Gzip as a compression option and replaces current direct use the GzEncoder package. - It adds validation to puffin that Gzip is not currently supported (per spec) ## Are these changes tested? Added unit tests to cover additions and existing tests cover the rest.
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @emkornfield for this pr, we are quite close !
|
@blackmwk thanks for the review I think we need to close on a direct for this: https://github.com/apache/iceberg-rust/pull/1876/files#r2752454670
If we can make breaking changes then:
|
|
Hi, @emkornfield
I think this is the right direction to go.
This is the place where we have to make a breaking change, so I think it's fine. |
|
|
||
| #[test] | ||
| fn test_parse_metadata_file_compression_valid() { | ||
| use super::parse_metadata_file_compression; |
There was a problem hiding this comment.
need to move these imports out and clean up this code a little bit.
|
@blackmwk thanks for the review, I believe I addressed comments. |
|
@blackmwk just wanted to ping on this? Thank you for your time. |
Which issue does this PR close?
Split off from #1851
What changes are included in this PR?
This change honors the compression setting for metadata.json file (
write.metadata.compression-codec).Are these changes tested?
Add unit test to verify files are gzipped when the flag is enabled.
BREAKING CHANGE: Make
write_totakeMetadataLocation