Skip to content

Add support for scalar values with extension types #1301

@timsaucer

Description

@timsaucer

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Suppose I have a pyarrow scalar value that contains an extension type. If I try turning that into a literal expression in datafusion, we should get the associated metadata transparently to the user.

Consider this minimal example:

import pyarrow as pa
import uuid
from datafusion import lit

value = pa.scalar(uuid.uuid4().bytes, pa.uuid())

print(lit(value))

This currently fails with ArrowTypeError: Expected bytes, got a 'UUID' object. That can be overcome with the simple patch

--- a/src/pyarrow_util.rs
+++ b/src/pyarrow_util.rs
@@ -30,7 +30,11 @@ impl FromPyArrow for PyScalarValue {
     fn from_pyarrow_bound(value: &Bound<'_, PyAny>) -> PyResult<Self> {
         let py = value.py();
         let typ = value.getattr("type")?;
-        let val = value.call_method0("as_py")?;
+        let val = if value.hasattr("value")? {
+            value.getattr("value")?
+        } else {
+            value.call_method0("as_py")?
+        };

But then we still don't have the metadata. It is lost and we get a bare fixed sized binary.

Describe the solution you'd like

The above code should just work. I have done a little investigation and using the pycapsule interface we can get the schema of the array we generate inside PyScalarValue::from_pyarrow_bound. We can then plumb this through when calling lit().

Ideally we would take this opportunity to ensure that when we call PyScalarValue::from_pyarrow_bound we are also supporting other libraries besides just pyarrow. There has been a complaint a few times that we are too tightly coupled to pyarrow. In particular it would be good to demonstrate that when converting a Python object that is a scalar value it works for:

  • pyarrow
  • nanoarrow
  • arro3
  • polars

I don't think we necessarily need to support pandas since they are not an Arrow library.

Describe alternatives you've considered

Alternatively the user can manually turn their data into the underlying storage and then attach the metadata from their extension type. This feels like a poor user experience.

Additional context

This came up during a different investigation:

Also worth evaluating while we're doing this: For scalar values, is it possible for them to contain metadata? If I do pa.scalar(uuid.uuid4().bytes, type=pa.uuid()) and I check the type I should have the extension data. Maybe this is already supported, but as part of this PR I want to evaluate that as well.

Originally posted by @timsaucer in #1299 (comment)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions