-
Notifications
You must be signed in to change notification settings - Fork 426
feat: Add Geometry & Geography Types #2859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SaymV
wants to merge
4
commits into
apache:main
Choose a base branch
from
SaymV:feat/geospatial-types
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,060
−1
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| # RFC: Iceberg v3 Geospatial Primitive Types | ||
|
|
||
| ## Motivation | ||
|
|
||
| Apache Iceberg v3 introduces native geospatial types (`geometry` and `geography`) to support spatial data workloads. These types enable: | ||
|
|
||
| 1. **Interoperability**: Consistent spatial data representation across Iceberg implementations | ||
| 2. **Query optimization**: Future support for spatial predicate pushdown | ||
| 3. **Standards compliance**: Alignment with OGC and ISO spatial data standards | ||
|
|
||
| This RFC describes the design and implementation of these types in PyIceberg. | ||
|
|
||
| ## Scope | ||
|
|
||
| **In scope:** | ||
|
|
||
| - `geometry(C)` and `geography(C, A)` primitive type definitions | ||
| - Type parsing and serialization (round-trip support) | ||
| - Avro mapping (WKB bytes) | ||
| - PyArrow/Parquet conversion (with version-aware fallback) | ||
| - Format version enforcement (v3 required) | ||
|
|
||
| **Out of scope (future work):** | ||
|
|
||
| - Spatial predicate pushdown (e.g., ST_Contains, ST_Intersects) | ||
| - WKB/WKT conversion (requires external dependencies) | ||
| - Geometry/geography bounds metrics | ||
| - Spatial indexing | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| - Adding heavy dependencies like Shapely, GEOS, or GeoPandas | ||
| - Implementing spatial operations or computations | ||
| - Supporting format versions < 3 | ||
|
|
||
| ## Design | ||
|
|
||
| ### Type Parameters | ||
|
|
||
| **GeometryType:** | ||
|
|
||
| - `crs` (string): Coordinate Reference System, defaults to `"OGC:CRS84"` | ||
|
|
||
| **GeographyType:** | ||
|
|
||
| - `crs` (string): Coordinate Reference System, defaults to `"OGC:CRS84"` | ||
| - `algorithm` (string): Geographic algorithm, defaults to `"spherical"` | ||
|
|
||
| ### Type String Format | ||
|
|
||
| ```python | ||
| # Default parameters | ||
| "geometry" | ||
| "geography" | ||
|
|
||
| # With custom CRS | ||
| "geometry('EPSG:4326')" | ||
| "geography('EPSG:4326')" | ||
|
|
||
| # With custom CRS and algorithm | ||
| "geography('EPSG:4326', 'planar')" | ||
| ``` | ||
|
|
||
| ### Runtime Representation | ||
|
|
||
| Values are stored as WKB (Well-Known Binary) bytes at runtime. This matches the Avro and Parquet physical representation per the Iceberg spec. | ||
|
|
||
| ### JSON Single-Value Serialization | ||
|
|
||
| Per the Iceberg spec, geometry/geography values should be serialized as WKT (Well-Known Text) strings in JSON. However, since we represent values as WKB bytes at runtime, conversion between WKB and WKT would require external dependencies. | ||
|
|
||
| **Current behavior:** `NotImplementedError` is raised for JSON serialization/deserialization until a conversion strategy is established. | ||
|
|
||
| ### Avro Mapping | ||
|
|
||
| Both geometry and geography types map to Avro `bytes` type, consistent with `BinaryType` handling. | ||
|
|
||
| ### PyArrow/Parquet Mapping | ||
|
|
||
| **With geoarrow-pyarrow installed:** | ||
|
|
||
| - Geometry types convert to GeoArrow WKB extension type with CRS metadata | ||
| - Geography types convert to GeoArrow WKB extension type with CRS and edge type metadata | ||
| - Uses `geoarrow.pyarrow.wkb().with_crs()` and `.with_edge_type()` for full GeoArrow compatibility | ||
|
|
||
| **Without geoarrow-pyarrow:** | ||
|
|
||
| - Geometry and geography types fall back to `pa.large_binary()` | ||
| - This provides WKB storage without GEO logical type metadata | ||
|
|
||
| ## Compatibility | ||
|
|
||
| ### Format Version | ||
|
|
||
| Geometry and geography types require Iceberg format version 3. Attempting to use them with format version 1 or 2 will raise a validation error via `Schema.check_format_version_compatibility()`. | ||
|
|
||
| ### geoarrow-pyarrow | ||
|
|
||
| - **Optional dependency**: Install with `pip install pyiceberg[geoarrow]` | ||
| - **Without geoarrow**: Geometry/geography stored as binary columns (WKB) | ||
| - **With geoarrow**: Full GeoArrow extension type support with CRS/edge metadata | ||
|
|
||
| ### Breaking Changes | ||
|
|
||
| None. These are new types that do not affect existing functionality. | ||
|
|
||
| ## Dependency/Versioning | ||
|
|
||
| **Required:** | ||
|
|
||
| - PyIceberg core (no new dependencies) | ||
|
|
||
| **Optional for full functionality:** | ||
|
|
||
| - PyArrow 21.0.0+ for native Parquet GEO logical types | ||
|
|
||
| ## Testing Strategy | ||
|
|
||
| 1. **Unit tests** (`test_types.py`): | ||
| - Type creation with default/custom parameters | ||
| - `__str__` and `__repr__` methods | ||
| - JSON serialization/deserialization round-trip | ||
| - Equality, hashing, and pickling | ||
| - `minimum_format_version()` enforcement | ||
|
|
||
| 2. **Integration tests** (future): | ||
| - End-to-end table creation with geometry/geography columns | ||
| - Parquet file round-trip with PyArrow | ||
|
|
||
| ## Known Limitations | ||
|
|
||
| 1. **No WKB/WKT conversion**: JSON single-value serialization raises `NotImplementedError` | ||
| 2. **No bounds metrics**: Cannot extract bounds from WKB without parsing | ||
| 3. **No spatial predicates**: Query optimization for spatial filters not yet implemented | ||
| 4. **PyArrow < 21.0.0**: Falls back to binary type without GEO metadata | ||
| 5. **Reverse conversion from Parquet**: Binary columns cannot be distinguished from geometry/geography without Iceberg schema metadata | ||
|
|
||
| ## File Locations | ||
|
|
||
| | Component | File | | ||
| |-----------|------| | ||
| | Type definitions | `pyiceberg/types.py` | | ||
| | Conversions | `pyiceberg/conversions.py` | | ||
| | Schema visitors | `pyiceberg/schema.py` | | ||
| | Avro conversion | `pyiceberg/utils/schema_conversion.py` | | ||
| | PyArrow conversion | `pyiceberg/io/pyarrow.py` | | ||
| | Unit tests | `tests/test_types.py` | | ||
|
|
||
| ## References | ||
|
|
||
| - [Iceberg v3 Type Specification](https://iceberg.apache.org/spec/#schemas-and-data-types) | ||
| - [Arrow GEO Proposal](https://arrow.apache.org/docs/format/GeoArrow.html) | ||
| - [Arrow PR #45459](https://github.com/apache/arrow/pull/45459) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,110 @@ | ||||||
| # Geospatial Types | ||||||
|
|
||||||
| PyIceberg supports Iceberg v3 geospatial primitive types: `geometry` and `geography`. | ||||||
|
|
||||||
| ## Overview | ||||||
|
|
||||||
| Iceberg v3 introduces native support for spatial data types: | ||||||
|
|
||||||
| - **`geometry(C)`**: Represents geometric shapes in a coordinate reference system (CRS) | ||||||
| - **`geography(C, A)`**: Represents geographic shapes with CRS and calculation algorithm | ||||||
|
|
||||||
| Both types store values as WKB (Well-Known Binary) bytes. | ||||||
|
|
||||||
| ## Requirements | ||||||
|
|
||||||
| - Iceberg format version 3 or higher | ||||||
| - `geoarrow-pyarrow` for full GeoArrow extension type support (optional: `pip install pyiceberg[geoarrow]`) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a little confused by this sentence. Is
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## Usage | ||||||
|
|
||||||
| ### Declaring Columns | ||||||
|
|
||||||
| ```python | ||||||
| from pyiceberg.schema import Schema | ||||||
| from pyiceberg.types import NestedField, GeometryType, GeographyType | ||||||
|
|
||||||
| # Schema with geometry and geography columns | ||||||
| schema = Schema( | ||||||
| NestedField(1, "id", IntegerType(), required=True), | ||||||
| NestedField(2, "location", GeometryType(), required=True), | ||||||
| NestedField(3, "boundary", GeographyType(), required=False), | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| ### Type Parameters | ||||||
|
|
||||||
| #### GeometryType | ||||||
|
|
||||||
| ```python | ||||||
| # Default CRS (OGC:CRS84) | ||||||
| GeometryType() | ||||||
|
|
||||||
| # Custom CRS | ||||||
| GeometryType("EPSG:4326") | ||||||
| ``` | ||||||
|
|
||||||
| #### GeographyType | ||||||
|
|
||||||
| ```python | ||||||
| # Default CRS (OGC:CRS84) and algorithm (spherical) | ||||||
| GeographyType() | ||||||
|
|
||||||
| # Custom CRS | ||||||
| GeographyType("EPSG:4326") | ||||||
|
|
||||||
| # Custom CRS and algorithm | ||||||
| GeographyType("EPSG:4326", "planar") | ||||||
| ``` | ||||||
|
|
||||||
| ### String Type Syntax | ||||||
|
|
||||||
| Types can also be specified as strings in schema definitions: | ||||||
|
|
||||||
| ```python | ||||||
| # Using string type names | ||||||
| NestedField(1, "point", "geometry", required=True) | ||||||
| NestedField(2, "region", "geography", required=True) | ||||||
|
|
||||||
| # With parameters | ||||||
| NestedField(3, "location", "geometry('EPSG:4326')", required=True) | ||||||
| NestedField(4, "boundary", "geography('EPSG:4326', 'planar')", required=True) | ||||||
| ``` | ||||||
|
|
||||||
| ## Data Representation | ||||||
|
|
||||||
| Values are represented as WKB (Well-Known Binary) bytes at runtime: | ||||||
|
|
||||||
| ```python | ||||||
| # Example: Point(0, 0) in WKB format | ||||||
| point_wkb = bytes.fromhex("0101000000000000000000000000000000000000") | ||||||
| ``` | ||||||
|
|
||||||
| ## Current Limitations | ||||||
|
|
||||||
| 1. **WKB/WKT Conversion**: Converting between WKB bytes and WKT strings requires external libraries (like Shapely). PyIceberg does not include this conversion to avoid heavy dependencies. | ||||||
|
|
||||||
| 2. **Spatial Predicates**: Spatial filtering (e.g., ST_Contains, ST_Intersects) is not yet supported for query pushdown. | ||||||
|
|
||||||
| 3. **Bounds Metrics**: Geometry/geography columns do not currently contribute to data file bounds metrics. | ||||||
|
|
||||||
| 4. **Without geoarrow-pyarrow**: When the `geoarrow-pyarrow` package is not installed, geometry and geography columns are stored as binary without GeoArrow extension type metadata. The Iceberg schema preserves type information, but other tools reading the Parquet files directly may not recognize them as spatial types. Install with `pip install pyiceberg[geoarrow]` for full GeoArrow support. | ||||||
|
|
||||||
| ## Format Version | ||||||
|
|
||||||
| Geometry and geography types require Iceberg format version 3: | ||||||
|
|
||||||
| ```python | ||||||
| from pyiceberg.table import TableProperties | ||||||
|
|
||||||
| # Creating a v3 table | ||||||
| table = catalog.create_table( | ||||||
| identifier="db.spatial_table", | ||||||
| schema=schema, | ||||||
| properties={ | ||||||
| TableProperties.FORMAT_VERSION: "3" | ||||||
| } | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| Attempting to use these types with format version 1 or 2 will raise a validation error. | ||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having type information in the docs is really useful!