Skip to content

Make extracted raw data location configurable#378

Open
zmaalick wants to merge 11 commits intomainfrom
156_raw_data_location_configurable
Open

Make extracted raw data location configurable#378
zmaalick wants to merge 11 commits intomainfrom
156_raw_data_location_configurable

Conversation

@zmaalick
Copy link
Collaborator

@zmaalick zmaalick commented Feb 19, 2026

Closes #156.

PR creation checklist for the developer

  • The <issue_number> above ☝️ has been replaced with the issue number.
  • main has been selected as the base branch.
  • The feature branch name follows the format <issue_number>_<short_description_of_feature>.
  • The text of the PR title exactly matches with the text (not including the issue number) of the issue title.
  • Appropriate reviewers have been added to the PR (once it is ready for review).
  • The PR has been assigned to the developer(s).
  • The same labels as on the issue (except for the good first issue label) have been added to the PR.
  • The Climate Model Evaluation Workflow (CMEW) project has been added to the PR.
  • The appropriate milestone has been added to the PR.

Definition of Done for the developer

  • The change in this PR addresses the above issue / all acceptance criteria have been met.
  • The change in this PR follows the requirements in the wiki: Developer Guide (including copyrights).
  • The GitHub Actions workflow checks pass.
  • The tests run locally and pass (Note: the tests are not run by the GitHub Actions workflow, see wiki: Run the tests locally).
  • Updating the Rose metadata (select one of the following):
    • Rose metadata related to the change has been added or updated.
    • The change does not require Rose metadata to be added or updated.
  • Rendering the Rose metadata (select one of the following):
    • The Rose GUI shows the change as expected.
    • The change in this PR does not affect the Rose GUI.
  • Updating the tests (select one of the following):
    • Tests related to the change have been added or updated.
    • The change does not require tests to be added or updated.
  • Updating the user documentation (i.e. everything in the doc directory, including the Quick Start section; select one of the following):
    • The user documentation related to the change has been updated appropriately.
    • The change in this PR does not require the user documentation to be updated.
  • Rendering the user documentation (wiki: Build the documentation locally provides instructions; select one of the following):
    • The HTML pages show the change as expected.
    • The change in this PR does not affect the HTML pages.
  • Updating the API documentation (e.g. docstrings in Python modules; select one of the following):
    • The API documentation related to the change has been updated appropriately.
    • The change in this PR does not affect the API documentation.

PR creation checklist for the reviewer

  • The <issue_number> above ☝️ has been replaced with the issue number.
  • main has been selected as the base branch.
  • The feature branch name follows the format <issue_number>_<short_description_of_feature>.
  • The text of the PR title exactly matches with the text (not including the issue number) of the issue title.
  • Appropriate reviewers have been added to the PR (once it is ready for review).
  • The PR has been assigned to the developer(s).
  • The same labels as on the issue (except for the good first issue label) have been added to the PR.
  • The Climate Model Evaluation Workflow (CMEW) project has been added to the PR.
  • The appropriate milestone has been added to the PR.

Definition of Done for the reviewer

  • The change in this PR addresses the above issue / all acceptance criteria have been met.
  • The change in this PR follows the requirements in the wiki: Developer Guide (including copyrights).
  • The GitHub Actions workflow checks pass.
  • The tests run locally and pass (Note: the tests are not run by the GitHub Actions workflow, see wiki: Run the tests locally).
  • Updating the Rose metadata (select one of the following):
    • Rose metadata related to the change has been added or updated.
    • The change does not require Rose metadata to be added or updated.
  • Rendering the Rose metadata (select one of the following):
    • The Rose GUI shows the change as expected.
    • The change in this PR does not affect the Rose GUI.
  • Updating the tests (select one of the following):
    • Tests related to the change have been added or updated.
    • The change does not require tests to be added or updated.
  • Updating the user documentation (i.e. everything in the doc directory, including the Quick Start section; select one of the following):
    • The user documentation related to the change has been updated appropriately.
    • The change in this PR does not require the user documentation to be updated.
  • Rendering the user documentation (wiki: Build the documentation locally provides instructions; select one of the following):
    • The HTML pages show the change as expected.
    • The change in this PR does not affect the HTML pages.
  • Updating the API documentation (e.g. docstrings in Python modules; select one of the following):
    • The API documentation related to the change has been updated appropriately.
    • The change in this PR does not affect the API documentation.

Important

  • Remember to re-check the Definition of Done after making changes in response to a review.
  • The developer merges the PR.
  • Remember to use the format #<pull_request_number>: <pull_request_title> when writing the merge commit message for the pull request, so the pull request number is immediately visible on GitHub, regardless of the length of the pull request title.

@zmaalick zmaalick self-assigned this Feb 19, 2026
@zmaalick zmaalick added enhancement New feature or request technical debt Technical debt in CMEW standardise Anything related to CDDS labels Feb 19, 2026
@zmaalick zmaalick added this to the v0.2.0 (multiple model runs) milestone Feb 19, 2026
@NParsonsMO NParsonsMO self-requested a review March 4, 2026 11:49
Copy link
Collaborator

@NParsonsMO NParsonsMO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like @ehogan to tell me whether this solution is OK, because it goes against what I thought was agreed in the technical meeting:

When I run the workflow specifying raw_data_dir="$SCRATCH/raw data", there are two folders that don't contain what I expect.

1 [user-specified]: MY_SCRATCH/raw_data
2 [workflow]: MY_HOME/cylc-run/CMEW/run3/share/data/cdds/raw_data
 
1 contains raw and processed data (though not restructured, I don't think).
It also contains lots of logs / empty directories and other "mess".

2 contains raw data (not processed or restructured).
It also contains empty directories and logs.

I expected 1 only to contain raw data, as the variable name implies.
I did not expect 2 to also contain raw data.

If we must keep all of it in the user-specified location, then the variable needs renaming as RAW_DATA_DIR is misleading.

@ehogan
Copy link
Member

ehogan commented Mar 6, 2026

I would like @ehogan to tell me whether this solution is OK, because it goes against what I thought was agreed in the technical meeting:

When I run the workflow specifying raw_data_dir="$SCRATCH/raw data", there are two folders that don't contain what I expect.

1 [user-specified]: MY_SCRATCH/raw_data
2 [workflow]: MY_HOME/cylc-run/CMEW/run3/share/data/cdds/raw_data

1 contains raw and processed data (though not restructured, I don't think). It also contains lots of logs / empty directories and other "mess".

2 contains raw data (not processed or restructured). It also contains empty directories and logs.

I expected 1 only to contain raw data, as the variable name implies.
I did not expect 2 to also contain raw data.

Yes, 1 should only contain raw data.

2 will contain raw data. The workflow directory should look identical to a second workflow directory created by running CMEW with no raw_data_dir defined. It is my understanding that we agreed to copy or link the raw data such that it was available via the workflow directory, so that CDDS would act on that directory exactly as it would if raw_data_dir was not defined. Does that make sense?

@NParsonsMO
Copy link
Collaborator

NParsonsMO commented Mar 6, 2026

Yes, 1 should only contain raw data.

2 will contain raw data. The workflow directory should look identical to a second workflow directory created by running CMEW with no raw_data_dir defined. It is my understanding that we agreed to copy or link the raw data such that it was available via the workflow directory, so that CDDS could act on that directory exactly as it would if raw_data_dir was not defined. Does that make sense?

I think that in a "normal" workflow [please correct me if I'm wrong] the raw data is cleaned up after CDDS runs, whereas here it was not (obviously not in the user-specified directory - we don't want that cleaned up) which is a difference within the workflow to how CMEW normally housekeeps after CDDS.

@ehogan
Copy link
Member

ehogan commented Mar 6, 2026

I think that in a "normal" workflow [please correct me if I'm wrong] the raw data is cleaned up after CDDS runs, whereas here it was not (obviously not in the user-specified directory - we don't want that cleaned up - but it's a difference within the workflow to how CMEW normally housekeeps after CDDS).

The raw data in the workflow directory should still be cleaned up (CDDS should act on the workflow directory exactly as it would if raw_data_dir was not defined), but not in the user-specified directory, as you correctly pointed out 👍

@zmaalick
Copy link
Collaborator Author

zmaalick commented Mar 6, 2026

  • after cdds extract in input dir (.share/data/..)
  • RAW_DATA_DIR: copy to raw data dir, (do not copy if RAW data dir is not empty)
  • RAW_DATA_DIR/suite_id/stream/data, if empty then copy data there and if not empty, print: log.err (dir was not empty)

@zmaalick
Copy link
Collaborator Author

zmaalick commented Mar 9, 2026

The new solution:
Copy the input data (suite_id/apm/..) into the raw_data_dir (if provided) and job fails if raw_data_dir is not empty

Also update:

  • rose meta
  • branch with main

@zmaalick zmaalick requested a review from NParsonsMO March 9, 2026 11:58
Copy link
Collaborator

@NParsonsMO NParsonsMO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Minor:
    The user-facing description / help for the new variable could be clearer.

  2. More significant but I may be wrong:
    I think that this is copying each dataset's folder within input individually, which may be nice as we could check whether each individual one exists and only copy it if it doesn't. But at the minute I think that you check whether the directory as a whole is empty (which would make sense if you were going to copy input as a whole), then copy each subdirectory individually, which seems like a wasted opportunity.

  3. Also significant but I definitely am not sure about this:
    The level of echoing etc. in the bash script made me think of the recently discussed PR where non of this was needed / desirable. Is this the same scenario?

# If RAW_DATA_DIR is configured, copy extracted raw data only when the target
# directory is empty. If it is not empty, emit a log.err message and do not copy.
if [[ -n "${RAW_DATA_DIR:-}" ]]; then
echo "[INFO] RAW_DATA_DIR is set to: ${RAW_DATA_DIR}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all this stuff, or should it be a case of using set -xeu like in https://github.com/MetOffice/CMEW/pull/392/changes#r2895732243 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e5908c5

remove two echo statements, others are error logs, should be there

@zmaalick zmaalick requested a review from NParsonsMO March 10, 2026 09:49
@zmaalick
Copy link
Collaborator Author

  1. Minor:
    The user-facing description / help for the new variable could be clearer.
  2. More significant but I may be wrong:
    I think that this is copying each dataset's folder within input individually, which may be nice as we could check whether each individual one exists and only copy it if it doesn't. But at the minute I think that you check whether the directory as a whole is empty (which would make sense if you were going to copy input as a whole), then copy each subdirectory individually, which seems like a wasted opportunity.
  3. Also significant but I definitely am not sure about this:
    The level of echoing etc. in the bash script made me think of the recently discussed PR where non of this was needed / desirable. Is this the same scenario?

for point 2 above, I would suggest to scope it out as separate issue if it is intended. We can discuss it in technical meeting, if required, we can implement it in separate issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be here @ehogan? I know it hasn't changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether GitHub is playing up, but I can't see which bit of the code this comment is for. Looking at 7ca1122, I infer that you are asking about pipefail? If so, the Developer Guide: Rose requirements explains where it should be used (it should not be used in bash scripts).

Copy link
Collaborator

@NParsonsMO NParsonsMO Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was. I wondered if it meant the whole script had been changed at some point and now needed changing back (similarly to one last week). But that isn't actually part of this review.
Image

@zmaalick zmaalick requested a review from NParsonsMO March 12, 2026 10:10
Copy link
Collaborator

@NParsonsMO NParsonsMO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous issues:

  1. This has been clarified. The capital P on path presumably indicates that there was supposed to be a full stop, but I won't refuse based on this.
Image
  1. I do think this should be discussed, I don't think that I have authority to OK new issues. So one for @ehogan probably.

  2. To the best of my ability to check, the bash script now follows the guidelines pointed out here:
    #392 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request standardise Anything related to CDDS technical debt Technical debt in CMEW

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make extracted raw data location configurable

3 participants