Skip to content

Extension of ESA CCI OZONE CMORizer#4125

Open
axel-lauer wants to merge 30 commits intomainfrom
extend_esa_cci_ozone_cmorizer
Open

Extension of ESA CCI OZONE CMORizer#4125
axel-lauer wants to merge 30 commits intomainfrom
extend_esa_cci_ozone_cmorizer

Conversation

@axel-lauer
Copy link
Contributor

@axel-lauer axel-lauer commented Jul 28, 2025

Description

This PR extends the existing CMORizer scripts (downloading and formatting) for ESA CCI OZONE data to include the following additional dataset versions:

  • MEGRIDOP (o3)
  • IASI (o3, toz)

In addition, problems with the time bounds of the dataset versions included in the first version of the CMORizer SAGE-OMPS (o3) and GTO-ECV (toz) are fixed in this PR.

For automatic downloading of IASI data, support for webdav is needed (https://pypi.org/project/webdavclient/). The webdavclient package has been added to the environment files.

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

New or updated data reformatting script

@axel-lauer axel-lauer marked this pull request as ready for review July 30, 2025 06:40
@axel-lauer axel-lauer requested a review from a team as a code owner July 30, 2025 06:40
@axel-lauer axel-lauer added this to the v2.14.0 milestone Jan 12, 2026
@schlunma
Copy link
Contributor

Hello, this pull request has been marked with the v2.14.0 milestone. The release of version 2.14.0 is currently scheduled for February 2026. To get this into the new release, it would be great to get this merged by the end of January.

If you won't be able to finish this in time, don't worry - just unassign the milestone v2.14.0. If you need any support, ping myself (@schlunma; the release manager for v2.14.0) or the @ESMValGroup/technical-lead-development-team. Please note that I won't be available until the beginning of February, though.

@bouweandela
Copy link
Member

@valeriupredoi volunteered to do the technical review of this one.

@axel-lauer
Copy link
Contributor Author

The downloader now uses webdav3 instead of webdav. I also updated the CDS requests to account for the latest changes from CDS as the attributes to request the data slightly changed. The downloader and formatter work fine with the changes.

As expected, the tests now fail because webdav3 is not in the testing environment. From my point of view, this would now be ready for merging. Thank again @valeriupredoi for taking a look!

@valeriupredoi
Copy link
Contributor

@axel-lauer good stuff, bud! Please also add webdav3 to pyproject.toml 🍺

@axel-lauer
Copy link
Contributor Author

@axel-lauer good stuff, bud! Please also add webdav3 to pyproject.toml 🍺

Thanks @valeriupredoi ! Just added webdav3 to pyproject.toml: da5fe5a

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, my bad - I forgot the actual package name 😁

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks spiffy! Very many thanks @axel-lauer 🍺

@schlunma
Copy link
Contributor

@ESMValGroup/science-reviewers Anyone available to do to a quick scientific review on this one?

Copy link
Contributor

@bettina-gier bettina-gier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, I'll try and run to see the output unless you want to point me to a folder with the download and formatting logs

Co-authored-by: Bettina Gier <gier@uni-bremen.de>
@bettina-gier
Copy link
Contributor

Sorry for the delay. I'm getting the following error in regards to the webdav3 client:

2026-03-04 17:41:47,480 UTC [720960] ERROR   Program terminated abnormally, see stack trace below for more information:
Traceback (most recent call last):
  File "/work/bd0854/b309137/esmval/ESMValCore/esmvalcore/_main.py", line 786, in run
    fire.Fire(ESMValTool())
    ~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/b/b309137/.conda/envs/aival/lib/python3.13/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/b/b309137/.conda/envs/aival/lib/python3.13/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ~~~~~~~~~~~~~~~~~~~^
        component,
        ^^^^^^^^^^
    ...<2 lines>...
        treatment='class' if is_class else 'routine',
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        target=component.__name__)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b309137/.conda/envs/aival/lib/python3.13/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/work/bd0854/b309137/esmval/ESMValTool/esmvaltool/cmorizers/data/cmorizer.py", line 708, in download
    self.formatter.download(start_date, end_date, overwrite=overwrite)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/bd0854/b309137/esmval/ESMValTool/esmvaltool/cmorizers/data/cmorizer.py", line 200, in download
    self.download_dataset(
    ~~~~~~~~~~~~~~~~~~~~~^
        dataset,
        ^^^^^^^^
    ...<2 lines>...
        overwrite=overwrite,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/work/bd0854/b309137/esmval/ESMValTool/esmvaltool/cmorizers/data/cmorizer.py", line 250, in download_dataset
    downloader.download_dataset(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        original_data_dir=self.original_data_dir,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        overwrite=overwrite,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/work/bd0854/b309137/esmval/ESMValTool/esmvaltool/cmorizers/data/downloaders/datasets/esacci_ozone.py", line 157, in download_dataset
    files = wd_client.list(remotepath)
  File "/home/b/b309137/.conda/envs/aival/lib/python3.13/site-packages/webdav3/client.py", line 78, in _wrapper
    res = fn(self, *args, **kw)
  File "/home/b/b309137/.conda/envs/aival/lib/python3.13/site-packages/webdav3/client.py", line 295, in list
    if directory_urn.path() != Client.root and not self.check(directory_urn.path()):
                                                   ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b309137/.conda/envs/aival/lib/python3.13/site-packages/webdav3/client.py", line 78, in _wrapper
    res = fn(self, *args, **kw)
  File "/home/b/b309137/.conda/envs/aival/lib/python3.13/site-packages/webdav3/client.py", line 344, in check
    response = self.execute_request(action="check", path=urn.quote())
  File "/home/b/b309137/.conda/envs/aival/lib/python3.13/site-packages/webdav3/client.py", line 257, in execute_request
    raise ResponseErrorCode(
    ...<3 lines>...
    )
webdav3.exceptions.ResponseErrorCode: Request to https://webdav.aeronomie.be/guest/o3_cci/webdata/Nadir_Profiles/L3/IASI_MG_FORLI/2008/ failed with code 401 and message: b''

Which is a very.. telling error message. Could you double check if this is a problem on my side or in general?

@bettina-gier
Copy link
Contributor

Tried to track it down as the problem is still persisting for me
Error Message 401 apparently means unauthorized. I tried going from the base down folder by folde. The basepath works but doesn't include the "guest" directory in the list - but this is the same I can see using a browser.
Using the path 'https://webdav.aeronomie.be/guest/o3_cci/webdata/Nadir_Profiles/L3/IASI_MG_FORLI/2008/' as example, going through the folders one by one I get different errors:
/guest and /guest/o3_cci give error 403: which is apparently an authorization error
the others give a 401 error. I can access all these folders through a webbrowser. According to Google the difference between the errors is: "401 refers to the lack of valid authentication credentials, whereas the 403 error occurs after authentication, signaling the absence of necessary permissions to access a resource", which seems weird when the browser is fine with all folder with the same exact authentification.

I'd download manually to test the actual cmorizer but the files are on a per-day basis so even just trying to test a single year would be manually downloading 365 files.

@bouweandela
Copy link
Member

I am able to download with just wget and the command (based on this post):

wget -e robots=off -r -nH --cut-dirs=3 --no-parent --reject="index.html*"  --user=o3_cci_public --password='' https://webdav.aeronomie.be/guest/o3_cci/webdata/Nadir_Profiles/L3/IASI_MG_FORLI/2008/

Since we already have the WgetDownloader, would it be possible to use that instead of adding yet another dependency?

@axel-lauer
Copy link
Contributor Author

I am able to download with just wget and the command (based on [this post]
Since we already have the WgetDownloader, would it be possible to use that instead of adding yet another dependency?

Good idea, but for reasons I do not understand, this does not work for me from inside the downloder script. No matter what I try, I end up with an "HTTP request sent, awaiting response... 401 Unauthorized" error. Using the WebDAV client downloading the files works fine.

@bouweandela
Copy link
Member

The wget downloader works for me if I make these changes:

diff --git a/esmvaltool/cmorizers/data/downloaders/datasets/esacci_ozone.py b/esmvaltool/cmorizers/data/downloaders/datasets/esacci_ozone.py
index e509faf7d..84aeb6446 100644
--- a/esmvaltool/cmorizers/data/downloaders/datasets/esacci_ozone.py
+++ b/esmvaltool/cmorizers/data/downloaders/datasets/esacci_ozone.py
@@ -9,7 +9,7 @@ from datetime import datetime
 
 import cdsapi
 from dateutil import relativedelta
-from webdav3.client import Client
+from esmvaltool.cmorizers.data.downloaders.wget import WGetDownloader
 
 logger = logging.getLogger(__name__)
 
@@ -134,42 +134,27 @@ def download_dataset(
         if end_date is None:
             end_date = datetime(2023, 12, 31)
 
-        options = {
-            "webdav_hostname": "https://webdav.aeronomie.be",
-            "webdav_login": "o3_cci_public",
-            "webdav_password": "",
-        }
-
-        wd_client = Client(options)
+        downloader = WGetDownloader(
+            original_data_dir=original_data_dir,
+            dataset=dataset,
+            dataset_info=dataset_info,
+            overwrite=overwrite,
+        )
+        wget_options = [
+            "-e robots=off",  # Ignore robots.txt
+            "--no-parent",  # Don't ascend to the parent directory
+            "--user=o3_cci_public",  # User name
+            "--password=",  # Empty password (no password needed for public access)
+        ]
 
-        basepath = "/guest/o3_cci/webdata/Nadir_Profiles/L3/IASI_MG_FORLI/"
+        basepath = "https://webdav.aeronomie.be/guest/o3_cci/webdata/Nadir_Profiles/L3/IASI_MG_FORLI"
 
         loop_date = start_date
         while loop_date <= end_date:
             year = loop_date.year
-
-            # if needed, create local output directory
-            outdir = output_folder / f"IASI_{year}"
-            os.makedirs(outdir, exist_ok=True)
-
-            # directory on WebDAV server to download
+            # directory on server to download
             remotepath = f"{basepath}/{year}"
-            files = wd_client.list(remotepath)
-            info = wd_client.info(remotepath + "/" + files[0])
-            numfiles = len(files)
-            # calculate approx. download volume in Gbytes
-            size = int(info["size"]) * numfiles // 1073741824
-            del files
-
-            loginfo = (
-                f"downloading {numfiles} files for year {year}"
-                f" (approx. {size} Gbytes)"
-            )
-            logger.info(loginfo)
-
-            # synchronize local (output) directory and WebDAV server directory
-            wd_client.pull(remote_directory=remotepath, local_directory=outdir)
-
+            downloader.download_folder(remotepath, wget_options)
             loop_date += relativedelta.relativedelta(years=1)
 
     else:

@axel-lauer
Copy link
Contributor Author

axel-lauer commented Mar 12, 2026

Thanks @bouweandela ! This seems to work fine at a first look, but there are strange things happening when continuing unfinished downloads. For example, the wget command
['wget', '-e robots=off', '--no-parent', '--accept=nc', '--user=o3_cci_public', '--password=', '--no-clobber', '--directory-prefix=/work/bd0854/b380103/download/Tier2/ESACCI-OZONE/IASI_2008', '--recursive', '--no-directories', 'https://webdav.aeronomie.be/guest/o3_cci/webdata/Nadir_Profiles/L3/IASI_MG_FORLI/2008']
starts downloading files for the year 2009(!) into the folder for 2008, e.g.
Saving to: ‘/work/bd0854/b380103/download/Tier2/ESACCI-OZONE/IASI_2008/IASI_FORLI_O3_MERGED_20090112_V1.0.nc’

Here is the code I used:

        if start_date is None:
            start_date = datetime(2008, 1, 1)
        if end_date is None:
            end_date = datetime(2023, 12, 31)

        downloader = WGetDownloader(
            original_data_dir=original_data_dir,
            dataset=dataset,
            dataset_info=dataset_info,
            overwrite=overwrite,
        )

        basepath = "https://webdav.aeronomie.be/guest/o3_cci/webdata/Nadir_Profiles/L3/IASI_MG_FORLI"

        wget_options = [
            "-e robots=off",  # ignore robots.txt
            "--no-parent",    # don't ascend to the parent directory
            "--accept=nc",    # download only *.nc files
            "--user=o3_cci_public",  # user name
            "--password=",    # empty password (no password needed for public access)
        ]

        loop_date = start_date
        while loop_date <= end_date:
            year = loop_date.year

            # directory on server to download
            remotepath = f"{basepath}/{year}"
            downloader.download_folder(remotepath, wget_options, f"IASI_{year}")

            loop_date += relativedelta.relativedelta(years=1)

In order to save the output into custom subfolders, I extended download_folder in wget.py with the new option sub_folder:

    def download_folder(self, server_path, wget_options, sub_folder=""):
        """Download folder.

        Parameters
        ----------
        server_path: str
            Path to remote folder
        wget_options: list(str)
            Extra options for wget
        sub_folder : str, optional
            Name of the local subfolder to store the results in, by default ''
        """
        if self.overwrite:
            raise ValueError(
                "Overwrite does not work with downloading directories through "
                "wget. Please, remove the unwanted data manually",
            )
        output_dir = Path(self.local_folder) / sub_folder

        command = (
            ["wget"]
            + wget_options
            + [
                "--no-clobber",
                f"--directory-prefix={output_dir}",
                "--recursive",
                "--no-directories",
                f"{server_path}",
            ]
        )

        logger.debug(command)
        subprocess.check_output(command)

I do not know enough about WebDAV to understand what's going on, but simply switching from WebDAV to wget seems quite error prone. I would therefore prefer to leave things as they are, i.e. continue using WebDAV.

@bouweandela
Copy link
Member

I'm not too keen on adding yet another dependency since we already have quite some work on making sure it is possible to install the tool and the conda environment solves, especially @valeriupredoi.

Do you really need the subfolders? The files all seem to have the date in the name anyway. If you need them, adding that functionality to the WgetDownloader or creating a customized subclass and using that seems a fine solution too.

@axel-lauer
Copy link
Contributor Author

I use subfolders because the number of files in one directory makes hanlding the data very slow (on Levante). Even a plain "ls" will take very long. Anyways. I will try to put everthing in the same folder using wget. If that works fine. I don't care anymore.

@axel-lauer
Copy link
Contributor Author

There you go. All files in the same folder, no more webdav: f720d7b

@schlunma
Copy link
Contributor

Thanks for making theses changes @axel-lauer! @bouweandela @bettina-gier @valeriupredoi anything you'd like to add? From my side this can be merged. This is the final feature PR for v2.14.0 🚀

@valeriupredoi
Copy link
Contributor

all fine by me, cheers, folks 🍺

@bouweandela
Copy link
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants