Make extracted raw data location configurable#378
Conversation
There was a problem hiding this comment.
I would like @ehogan to tell me whether this solution is OK, because it goes against what I thought was agreed in the technical meeting:
When I run the workflow specifying raw_data_dir="$SCRATCH/raw data", there are two folders that don't contain what I expect.
1 [user-specified]: MY_SCRATCH/raw_data
2 [workflow]: MY_HOME/cylc-run/CMEW/run3/share/data/cdds/raw_data
1 contains raw and processed data (though not restructured, I don't think).
It also contains lots of logs / empty directories and other "mess".
2 contains raw data (not processed or restructured).
It also contains empty directories and logs.
I expected 1 only to contain raw data, as the variable name implies.
I did not expect 2 to also contain raw data.
If we must keep all of it in the user-specified location, then the variable needs renaming as RAW_DATA_DIR is misleading.
Yes, 1 should only contain raw data. 2 will contain raw data. The workflow directory should look identical to a second workflow directory created by running CMEW with no |
I think that in a "normal" workflow [please correct me if I'm wrong] the raw data is cleaned up after CDDS runs, whereas here it was not (obviously not in the user-specified directory - we don't want that cleaned up) which is a difference within the workflow to how CMEW normally housekeeps after CDDS. |
The raw data in the workflow directory should still be cleaned up (CDDS should act on the workflow directory exactly as it would if |
|
|
The new solution: Also update:
|
There was a problem hiding this comment.
-
Minor:
The user-facing description / help for the new variable could be clearer. -
More significant but I may be wrong:
I think that this is copying each dataset's folder within input individually, which may be nice as we could check whether each individual one exists and only copy it if it doesn't. But at the minute I think that you check whether the directory as a whole is empty (which would make sense if you were going to copyinputas a whole), then copy each subdirectory individually, which seems like a wasted opportunity. -
Also significant but I definitely am not sure about this:
The level of echoing etc. in the bash script made me think of the recently discussed PR where non of this was needed / desirable. Is this the same scenario?
| # If RAW_DATA_DIR is configured, copy extracted raw data only when the target | ||
| # directory is empty. If it is not empty, emit a log.err message and do not copy. | ||
| if [[ -n "${RAW_DATA_DIR:-}" ]]; then | ||
| echo "[INFO] RAW_DATA_DIR is set to: ${RAW_DATA_DIR}" |
There was a problem hiding this comment.
Do we need all this stuff, or should it be a case of using set -xeu like in https://github.com/MetOffice/CMEW/pull/392/changes#r2895732243 ?
There was a problem hiding this comment.
remove two echo statements, others are error logs, should be there
for point 2 above, I would suggest to scope it out as separate issue if it is intended. We can discuss it in technical meeting, if required, we can implement it in separate issue. |
There was a problem hiding this comment.
Should this be here @ehogan? I know it hasn't changed.
There was a problem hiding this comment.
I don't know whether GitHub is playing up, but I can't see which bit of the code this comment is for. Looking at 7ca1122, I infer that you are asking about pipefail? If so, the Developer Guide: Rose requirements explains where it should be used (it should not be used in bash scripts).
There was a problem hiding this comment.
Previous issues:
- This has been clarified. The capital P on path presumably indicates that there was supposed to be a full stop, but I won't refuse based on this.
-
I do think this should be discussed, I don't think that I have authority to OK new issues. So one for @ehogan probably.
-
To the best of my ability to check, the bash script now follows the guidelines pointed out here:
#392 (comment)

Closes #156.
PR creation checklist for the developer
<issue_number>above ☝️ has been replaced with the issue number.mainhas been selected as the base branch.<issue_number>_<short_description_of_feature>.good first issuelabel) have been added to the PR.Climate Model Evaluation Workflow (CMEW)project has been added to the PR.Definition of Done for the developer
docdirectory, including the Quick Start section; select one of the following):PR creation checklist for the reviewer
<issue_number>above ☝️ has been replaced with the issue number.mainhas been selected as the base branch.<issue_number>_<short_description_of_feature>.good first issuelabel) have been added to the PR.Climate Model Evaluation Workflow (CMEW)project has been added to the PR.Definition of Done for the reviewer
docdirectory, including the Quick Start section; select one of the following):Important
#<pull_request_number>: <pull_request_title>when writing the merge commit message for the pull request, so the pull request number is immediately visible on GitHub, regardless of the length of the pull request title.