feat: add lambda expression support#710
Conversation
benbellick
left a comment
There was a problem hiding this comment.
Only just started looking at this but flushing a few comments just to get the ball rolling. On the whole this is looking good though, nice work!
| private final Type.Struct rootType; | ||
| private final ProtoTypeConverter protoTypeConverter; | ||
| private final ProtoRelConverter protoRelConverter; | ||
| private final List<Type.Struct> lambdaParameterStack = new ArrayList<>(); |
There was a problem hiding this comment.
I need to access lambdaIndex = lambdaParameterStack.size() - 1 - stepsOut here: lambdaParameters = lambdaParameterStack.get(lambdaIndex); when building the Lambda parametre ref that's why I didn't use a stack
There was a problem hiding this comment.
That's fair. Java's stack does extend Vector though so you still get the get method. I'm not so particular on this one.
The only thing I will say is I do think that the whole stack parameter stuff can be confusing for people without a little bit of PL experience, which is why I pushed @Slimsammylim to encapsulate some of that logic and put it in docstring comments. I could see a case for encapsulating this logic in a local private class, though it probably makes more sense to do this only if this logic comes up again.
I expected to see something like this logic elsewhere, as you need to do it to do validation when building lambdas. Though I didn't find anything like this. Does the builder for lambdas in this PR actually validate that the lambda is semantically correct?
There was a problem hiding this comment.
I agree that it would be nice for clarity to encapsulate this quite specific behaviour in a class (single responsibility principle). It doesn't need to extend one of the full collection interfaces, just provide the methods you need:
void push(Type.Struct value)Type.Struct pop()(throw runtime NoSuchElementException if empty)Type.Struct get(int stepsOut)(throw runtime IndexOutOfBoundsException on invalid stepsOut)
Or make it generic if you will reuse it for other types. The caller doesn't need to worry about clever indexing logic or what collection implementation is used internally; just make simple method calls that clearly describe the intent.
There was a problem hiding this comment.
Hey @benbellick, for your last point I don't quite understand what validation I should add, doesn't the validation happen when building the LAMBDA_PARAMETER_REFERENCE?
core/src/main/java/io/substrait/expression/proto/ProtoExpressionConverter.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/substrait/expression/proto/ProtoExpressionConverter.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/substrait/expression/proto/ProtoExpressionConverter.java
Outdated
Show resolved
Hide resolved
|
|
||
| Type.Struct lambdaParameters = lambdaParameterStack.get(lambdaIndex); | ||
| return FieldReference.newLambdaParameterReference( | ||
| reference.getDirectReference().getStructField().getField(), |
There was a problem hiding this comment.
Something I'm curious about is why we use reference.getDirectReference().getStructField().getField() here instead of getDirectReferenceSegments(reference.getDirectReference()) like the ROOT_REFERENCE case does.
I'm unfamiliar with substrait-java, but does this current implementation support nested field access through lambda parameters?
*This may be an unimportant issue especially if it's not common to access nested fields in lambdas
There was a problem hiding this comment.
because the ofRoot method takes the struct itself while the newLambdaParameterReference takes the index, like the newRootStructOuterReference methode
There was a problem hiding this comment.
Ah, I see. I am still confused as to whether this supports nested field access through lambda parameters. I messed around w some tests through claude, and I believe it is unsupported. However, this is a nit and I'm not sure it will come up often in practice, so we can come back to this later if it ends up being an issue!
There was a problem hiding this comment.
I am still confused as to whether this supports nested field access through lambda parameters.
If by nested access you mean accessing nested fields of a lambda parameter inside the body, for example
($0) -> $0.a.[0].hello then I think that is broken here actually. That might be fine for now and we can treat it is a follow-up item, but we should detect if we have a nested field lookup happening by checking to to content of reference.getDirectReference().getStructField().getChild(); If it's empty, there is only a direct reference to the lambda param. If it's not empty we can blow up.
To @Slimsammylim's we point, to handle this correctly we would need to use getDirectReferenceSegments(reference.getDirectReference()) to create the list of reference segments and then re-use some of the other constructors.
There was a problem hiding this comment.
ok I will add a check and go back to it in a follow up pr.
core/src/main/java/io/substrait/expression/proto/ProtoExpressionConverter.java
Show resolved
Hide resolved
|
So I was trying to test this by reading in the test plans from https://github.com/substrait-io/substrait-go/tree/main/expr/testdata/lambda, but then I realized that we were not loading in the functions_list.yaml extensions, and then I realize that we cannot load the extensions because the ANTLR parser hasn't been updated. I made a PR to do just that, because it's very out of date and required a little extra care #728. Once that's in, I can use the test vectors to help verify this work. We also need that PR anyways, because we can't load any of the functions that use lambdas without it 😓 |
Note that these had to be tweaked slightly because they were not entirely valid. This will be fixed in substrait-go.
|
I made a PR into this PR as part of understanding how it works, and fixing a small gap when loading Substrait plans with lambdas: limameml#1 |
test: add lambda plan roundtrip tests
vbarua
left a comment
There was a problem hiding this comment.
Started taking a more thorough look and found a couple of others things. I'll continue reviwing this tomorrow.
| // TO DO: fix Lambda return type once this issue | ||
| // https://github.com/substrait-io/substrait/issues/976 is resolved |
There was a problem hiding this comment.
| // TO DO: fix Lambda return type once this issue | |
| // https://github.com/substrait-io/substrait/issues/976 is resolved | |
| // TODO: Type.Func nullability is hardcoded to false here because the spec does not allow for | |
| // declaring otherwise. | |
| // See: https://github.com/substrait-io/substrait/issues/976 |
Being a bit more precise about what the issue is, and what the implications are in the code.
| return type.parameterTypes().stream().mapToInt(p -> p.accept(this)).sum() | ||
| + type.returnType().accept(this); |
There was a problem hiding this comment.
This visitor is used to determine how many names are needed for a ReadRel based on the schema.
For example, a table with row type struct<i8, i16, i32> needs 3 names. A table like struct<i8, struct<i16, i32>, i64> needs 5 names, because in Substrait we name every struct field. For full details see https://substrait.io/types/named_structs/#determining-names
Now, given something like struct<i8, func<...>, i32> we don't need to look inside the func for names. The type of the column is just func, and none of the parameters to the func can be reference by name.
All of this is to say that this can just be 0.
| return type.parameterTypes().stream().mapToInt(p -> p.accept(this)).sum() | |
| + type.returnType().accept(this); | |
| return 0; |
| @Override | ||
| public String visit(Expression.Lambda expr, EmptyVisitationContext context) | ||
| throws RuntimeException { | ||
| return "<Lambda>"; |
There was a problem hiding this comment.
minor: extract for consistency with the other methods
| return "<Lambda>"; | |
| return "<Lambda >"; |
|
|
||
| Type.Struct lambdaParameters = lambdaParameterStack.get(lambdaIndex); | ||
| return FieldReference.newLambdaParameterReference( | ||
| reference.getDirectReference().getStructField().getField(), |
There was a problem hiding this comment.
I am still confused as to whether this supports nested field access through lambda parameters.
If by nested access you mean accessing nested fields of a lambda parameter inside the body, for example
($0) -> $0.a.[0].hello then I think that is broken here actually. That might be fine for now and we can treat it is a follow-up item, but we should detect if we have a nested field lookup happening by checking to to content of reference.getDirectReference().getStructField().getChild(); If it's empty, there is only a direct reference to the lambda param. If it's not empty we can blow up.
To @Slimsammylim's we point, to handle this correctly we would need to use getDirectReferenceSegments(reference.getDirectReference()) to create the list of reference segments and then re-use some of the other constructors.
vbarua
left a comment
There was a problem hiding this comment.
I did leave a couple of very minor suggestions, and I think that there is one check worth enforcing in #710 (comment).
This is a solid pass at integrating lambdas end-to-end, thanks for working on in @limameml. If you can take a look at the above I think we can get this merged this week.
cc: @bestbeforetoday is there anything else you'd like to see in this PR
| */ | ||
| class LambdaExpressionRoundtripTest extends TestBase { | ||
|
|
||
| /** Test that lambdas with no parameters are valid. Building: () -> i32(42) : func<() -> i32> */ |
There was a problem hiding this comment.
nit: The parenthesis make it look like a cast to me 🤷
| /** Test that lambdas with no parameters are valid. Building: () -> i32(42) : func<() -> i32> */ | |
| /** Test that lambdas with no parameters are valid. Building: () -> 42 : func<() -> i32> */ |
| FieldReference invalidRef = FieldReference.newLambdaParameterReference(0, params, 1); | ||
|
|
||
| Expression.Lambda lambda = | ||
| Expression.Lambda.builder().parameters(params).body(invalidRef).build(); |
There was a problem hiding this comment.
Shouldn't this throw an exception? Perhaps I am misunderstanding what the builder does, but don't we expect the builder to validate that we built a correct lambda? Why are we able to build it, and convert it to a proto, but not convert it back?
There was a problem hiding this comment.
IMO this issue will need to be resolved before merging. Sorry to be a blocker! It doesn't make sense to me that we would be able to make an invalid proto and only validate on the way back, unless I am misunderstanding something.
There was a problem hiding this comment.
^ You can see an example here of how in substrait-go the builder throws an error
There was a problem hiding this comment.
Good point Ben. It would make sense for these to fail, especially given that we're already claiming them as validation errors that should fail.
| Expression.Lambda.builder().parameters(outerParams).body(innerLambda).build(); | ||
|
|
||
| // Convert to proto | ||
| io.substrait.proto.Expression protoExpression = expressionProtoConverter.toProto(outerLambda); |
There was a problem hiding this comment.
Same issue here as above I think.
benbellick
left a comment
There was a problem hiding this comment.
Sorry to block this, I know that you have already gotten an approval. I just want to make sure that this is addressed before merging. Thanks!
| FieldReference invalidRef = FieldReference.newLambdaParameterReference(0, params, 1); | ||
|
|
||
| Expression.Lambda lambda = | ||
| Expression.Lambda.builder().parameters(params).body(invalidRef).build(); |
There was a problem hiding this comment.
IMO this issue will need to be resolved before merging. Sorry to be a blocker! It doesn't make sense to me that we would be able to make an invalid proto and only validate on the way back, unless I am misunderstanding something.
This PR is to add support for the Lambda Expression
Summary
Test plan
closes #687