get all composite key candidates function added#58
get all composite key candidates function added#58souvik-databricks wants to merge 1 commit intomainfrom
Conversation
| if exclude_cols is None: | ||
| exclude_cols = [] | ||
| df_col_excluded = df.drop(*exclude_cols) | ||
| col_select_condition = df_col_excluded.distinct().count() |
There was a problem hiding this comment.
@souvik-databricks why does it need to distinct? If .distinct().count() < .count(), there is no unique combination, or?
There was a problem hiding this comment.
@robertkossendey .distinct() is required because if your data has duplicates in them and then the collect_set() in line 401 will matchup and we have only unique combination of values for the columns.
There was a problem hiding this comment.
@souvik-databricks still struggling to understand, could you add a test with a completely duplicate line? Because that should return no combination at all right?
There was a problem hiding this comment.
@robertkossendey So in that scenario if you have a case of duplicate record then that would indicate that from the columns that you have included to search in, you don't have valid composite key combos. You would have to include more columns to act as part of the composite key.
If in the whole dataset you have a full exact duplicate record then it won't make any sense to keep it as that's then redundant information.
And I will add a test with duplicate record with the explanation of the output for that test case in the readme.
There was a problem hiding this comment.
If in the whole dataset you have a full exact duplicate record then it won't make any sense to keep it as that's then redundant information.
@souvik-databricks but isn't that the information that there is no valid key candidate in the provided dataset?
There was a problem hiding this comment.
I was OOO for last week. @robertkossendey I am back now, I will resume the work now.
|
This looks like a great start. Can you add more tests for the edge cases like when the result is empty, when the DataFrame contains null values, etc? |
| column for column in df_col_excluded.columns if | ||
| df_col_excluded.select(column).collect()[0][0] == col_select_condition | ||
| ] | ||
| df_col_excluded.select("column_combos ->", *columns).show(truncate=False) |
There was a problem hiding this comment.
@souvik-databricks we can remove that show, right?
There was a problem hiding this comment.
Yes we can remove the show
|
@souvik-databricks - let me know when you've addressed the feedback from @robertkossendey and this is ready for another review, thank you! |
|
@souvik-databricks - following up on this one. Are you planning to address this? |
|
@MrPowers Let's go through this details over our sync. |
|
@souvik-databricks - anything else we need to push this over the finish line? |
Added the function to get all of the possible composite key candidates for a delta table