-
-
Notifications
You must be signed in to change notification settings - Fork 213
[ENH] V1 → V2 API Migration - datasets #1608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1608 +/- ##
==========================================
- Coverage 52.75% 43.51% -9.25%
==========================================
Files 36 47 +11
Lines 4333 4902 +569
==========================================
- Hits 2286 2133 -153
- Misses 2047 2769 +722 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
FYI @geetu040 Currently the
Issues:
Example:current def _get_dataset_features_file(did_cache_dir: str | Path | None, dataset_id: int) -> dict[int, OpenMLDataFeature]:
return _featuresOr by updating the Dataset class to use the underlining interface method from api_context directly. def _load_features(self) -> None:
...
self._features = api_context.backend.datasets.get_features(self.dataset_id)Another option is to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left an intermediate review. This is solid work and well done overall. Nice job. I'll look into the download part now.
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| *, | ||
| data_id: list[int] | None = None, # type: ignore | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not have same signature for all 3 methods: DatasetsAPI.list, DatasetsV1.list, DatasetsV2.list? does it raise pre-commit failures since a few might not be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that v2 signature was experimental, idk how pre-commits did not catch that, Will make them same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is mypy supposed to catch that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes unused parameters are caught under #ARG001 as seen in the cache_directory params.
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| *, | ||
| data_id: list[int] | None = None, # type: ignore | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make this simple using private helper methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geetu040 I could not find a good common block to make an helper since the filters are passed via url in v1 and via json object in v2 , and both have different parsing, If you have any specific Idea on that please let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at DatasetsV1 I can think of these helper methods: _build_list_call, _parse_and_validate_xml, _parse_dataset
you can do something similar for DatasetsV2 though they can be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do something similar for DatasetsV2 though they can be different.
I see, That opens more options.
openml/_api/resources/datasets.py
Outdated
| bool | ||
| True if the deletion was successful. False otherwise. | ||
| """ | ||
| return openml.utils._delete_entity("data", dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you implement the delete logic yourself instead of openml.utils._delete_entity, how would that look? I think it would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes Sense , It would look like a delete request from client along with exception handling
| def list( | ||
| self, | ||
| limit: int, | ||
| offset: int, | ||
| **kwargs: Any, | ||
| ) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, it can use private helper methods
| # Minimalistic check if the XML is useful | ||
| if "oml:data_qualities_list" not in qualities: | ||
| raise ValueError('Error in return XML, does not contain "oml:data_qualities_list"') | ||
| from openml._api import api_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we have this import at the very top? does it create circular import error? if not, should be moved to top from all functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does raise circular import
Thanks for a detailed explanation, I now have good understanding of the download mechanism.
minio can be handled easily, we will use a separate client along with
these are actually different objects in both apis, v1 uses xml and v2 keeps them in json
yes you are right, they are the same files, which are not required to be downloaded again for both versions, but isn't this true for almost all the http objects? they may have different format
I don't understand this point
agreed, should be handled by
agreed, adding in conclusion, I may ask, if we ignore the fact that it downloads the |
|
@geetu040 making a new client for FYI the new commit adds better handling of feature and qualites in OpenMLDataset class moving the v1 specific parsing logic to the interface. So only part left is to handle
|
|
From the standup discussion and earlier conversations, I think we can agree on a few points:
Consider this a green light to experiment with the client design. Try an approach, use whatever caching strategy you think fits best, and aim for a clean, sensible design. Feel free to ask for suggestions or reviews along the way. I'll review it in code. Just make sure this doesn't significantly impact the base classes or other stacked PRs. |
|
The points do make sense to me, I will propose the design along with how it would be used in the resource. |
|
@geetu040 I have a design implemented which needs reviews
Question:
|
I have taken a quick look, the design looks really good, though I have some suggestions/questions in the code, which I will leave in a detailed review. But this in general fixes all our blockers without damaging the original design.
Is it provided by the user? I don't think so. In that case, how does it affect the users? From looking at the code, this cache directory is generated programmatically inside the functions, we can completely remove these lines and always rely on the
|
This makes a sense now. having an independent download method as I have setup is better than updating requests/caching to return path right?
Yes that would work, but the function definition would be changed i.e. tests etc corresponding to them |
I am not sure about that, would require a detailed review
yes, that is expected |
| path = f"data/features/{dataset_id}" | ||
| return self.download_file(path) | ||
|
|
||
| def download_qualities_file(self, dataset_id: int) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I know why this function (and similar functions), which are used only in DatasetsV1 and V2, and not called inside functions.py, not private? It becomes hard to distinguish what is actually being called by the SDK and what is being abstracted away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is called in functions.py I haven't updated it yet, was still working on it
openml/_api/http/client.py
Outdated
| parsed_url = urllib.parse.urlparse(source) | ||
|
|
||
| # expect path format: /BUCKET/path/to/file.ext | ||
| _, bucket, *prefixes, _file = parsed_url.path.split("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_file should be renamed to _ given it is never called, surprised ruff does not call it out.
openml/datasets/functions.py
Outdated
| Parameters | ||
| ---------- | ||
| data_id : list, optional | ||
| dataset_id : list, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be data_id
| ------- | ||
| Dataset id | ||
| """ | ||
| if not isinstance(data_id, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 instances of this are being removed, why?
| def _get_dataset_parquet( | ||
| description: dict | OpenMLDataset, | ||
| cache_directory: Path | None = None, | ||
| cache_directory: Path | None = None, # noqa: ARG001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the ARG001 addition, I assume ruff checks were passing before this was added too?
Edit1: Just saw that this is not being used, why not remove it then?
Edit2: This is happening in more than 1 place, I won't mention it in all of them so we can just discuss it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still under work,This method is now updated to not use the param, working on removing it here and update the corresponding test when starting to write tests for this pr, as mentioned in the meet
| return api_context.backend.datasets.download_dataset_parquet(description, download_all_files) | ||
|
|
||
|
|
||
| def _get_dataset_arff( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geetu040 This function is not being used yet is still present, because it is being used in tests. Should the test involving this be removed, or should this function be kept? (Same scenario exists in tasks migration too)
Edit: I see this happening in multiple places (_get_dataset_features_file and _get_dataset_qualities_file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| else: | ||
| raise TypeError("`description` should be either OpenMLDataset or Dict.") | ||
|
|
||
| save_cache_directory = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume cache paths for V1 and V2 are different, hence this code is being removed?
| bool | ||
| True if the deletion was successful. False otherwise. | ||
| """ | ||
| return openml.utils._delete_entity("data", dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geetu040 @JATAYU000 I remember being on a call and discussing this, didn't we decide to implement this in the base PR (core) itself and leave it out of our migration implementations? If not, I will need to change tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was, since it is not yet added I haven't removed this yet
satvshr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, will look at this again once it is actually ready for review with all implementations complete.
|
Is there an API_KEY that you are using to test the endpoints? I was going to run a test script for your code but I could not given I have no API_KEY for it. |
If you are talking about the invalid api_key regex match in v2 you can set this in the config |
Metadata
Reference Issue: [ENH] V1 → V2 API Migration - datasets #1592
Depends on: [ENH] V1 → V2 API Migration - core structure #1576
Change Log Entry:This PR implements Datasets resource, and refactor its existing functions