USE 384 - migrate libguides from OAI XML parsing to web crawl HTML parsing by ghukill · Pull Request #269 · MITLibraries/transmogrifier

ghukill · 2026-02-24T13:51:48Z

Purpose and background context

This PR migrates the libguides source from a transformer expecting OAI-PMH Dublin Core XML records to a transformer expecting JSONLines of browsertrix-harvester web crawl output.

See the "long-term" strategy in this decision doc, and the subsequent engineering plan for more information.

This PR can be broken down into two main components:

Overall, the change is pretty clear-cut. I've opted to add inline comments in the new transformer, to explain little bits and bobs and decisions along the way. The output of this transformer has been reviewed by stakeholders and is confirmed to be better records compared to the OAI XML transformer.

The input for Transmogrifier is the result of a libguides configured browsertrix-harvester crawl. This configuration was mapped out in USE-383, and is noted in the LibGuides confluence page.

Lastly, see USE-388 for background on how and where we exclude some LibGuides from TIMDEX (spoiler alert, it's here in Transmogrifier).

The point of providing this background is to contextualize that much of the mechanics and subjective work is complete and approved for LibGuides, with this PR being the implementation of those decisions.

How can a reviewer manually see the effects of these changes?

Set Dev1 credentials in terminal
Set env vars with values shared in slack

LIBGUIDES_CLIENT_ID=...
LIBGUIDES_API_TOKEN=...

Run transmog, writing to a local dataset:

pipenv run transform --verbose \
-s libguides \
-i s3://timdex-extract-dev-222053980223/libguides/libguides-2026-02-20-full-extracted-records-to-index.jsonl \
-o /tmp/use384

Start an ipython shell to run TDA and look at records:

pipenv run ipython

import json
from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset("/tmp/use384")

# rebuild metadata (first time dataset)
td.metadata.rebuild_dataset_metadata()

# load arbitrary record
record = next(td.read_dicts_iter(source="libguides", table="current_records"))
source_record = json.loads(record['source_record'])
transformed_record = json.loads(record['transformed_record'])

The source_record shows what browsertrix-harveseter provided, and transformed_record the TIMDEX record created.

And/or.... here they are for review as well!

Source record:

{
  "url": "https://libguides.mit.edu/bizcat",
  "status": "active",
  "cdx_warc_filename": "rec-d3274de52d0c-libguides-20260205201423787-6.warc.gz",
  "cdx_title": "Analyst reports - Business Databases by Category - LibGuides at MIT Libraries",
  "cdx_offset": "1001127",
  "cdx_length": "16655",
  "html_base64": "ICA8IURPQ1RZUEU...........vaHRtbD4KICA=",
  "response_headers": {
    "Content-Security-Policy": [
      "upgrade-insecure-requests"
    ],
    "content-type": [
      "text/html; charset=UTF-8"
    ],
    "date": [
      "Thu, 05 Feb 2026 20:16:24 GMT"
    ],
    "server": [
      "nginx"
    ],
    "strict-transport-security": [
      "max-age=31536000; preload"
    ],
    "vary": [
      "Accept-Encoding"
    ],
    "x-backend-server": [
      "libguides-us-1.springyaws.com"
    ],
    "x-content-type-options": [
      "nosniff"
    ],
    "x-springy-cache-disabled": [
      "0"
    ],
    "x-orig-content-encoding": [
      "gzip"
    ]
  }
}

Transformed record:

{
  "source": "LibGuides",
  "source_link": "https://libguides.mit.edu/bizcat",
  "timdex_record_id": "libguides:guides-383403",
  "title": "Business Databases by Category: Analyst reports",
  "citation": "Business Databases by Category: Analyst reports. MIT Libraries. LibGuide. https://libguides.mit.edu/bizcat",
  "content_type": [
    "LibGuide"
  ],
  "dates": [
    {
      "kind": "Accessed",
      "value": "2026-02-20"
    },
    {
      "kind": "Created",
      "value": "2015-09-13"
    },
    {
      "kind": "Modified",
      "value": "2026-02-02"
    }
  ],
  "format": "electronic resource",
  "fulltext": "MIT LibGuides Business Databases by Category Analyst reports Search this Guide Search Business Databases by Category: Analyst reports\nNavigate this site Business resources home & related guides Jump to: Analyst reports Articles Companies Countries Economic indicators Financial Markets Industries Market Research See also: Business databases A-Z Business & Management Librarian Shikha Sharma Contact: sharmash@mit.edu Room E53-168K 617.253.5670 More ways to get help Ask Us Ask a question, make\u00a0an appointment, give\u00a0feedback, or visit us. Analyst Reports Capital IQ Click on Research tab to find CFRA Equity Reports. more... less... New users can register for an account by following the  New User? link on the screen. LSEG Workspace LSEG Workspace is the new name and interface for ThomsonONE database. To find analyst reports  for a specific company (also known as sell-side, aftermarket, broker, or equity research), search by ticker symbol or company name in the top search box. Then, under News & Research menu, click on Company Research. Use filters on the left to refine your search. To screen for analyst reports based on a set of criteria, enter ADVRES in the search box to get a menu for advanced research. Use filters on the left to refine your search by industry, geography, contributor, keywords in the text, and more. Note: There is a daily limit of 150 pages for viewing and downloading analyst reports.  This limit resets at 12:00 AM Eastern Time daily. more... less... New user? Register for an account first. Next: Articles >>",
  "identifiers": [
    {
      "value": "383403",
      "kind": "LibGuide ID"
    }
  ],
  "languages": [
    "en"
  ],
  "links": [
    {
      "url": "https://libguides.mit.edu/bizcat/analysts"
    }
  ],
  "publishers": [
    {
      "name": "MIT Libraries"
    }
  ],
  "rights": [
    {
      "description": "Copyright MIT Libraries 2026"
    }
  ],
  "timdex_provenance": {
    "source": "libguides",
    "run_date": "2026-02-20",
    "run_id": "ae55cb4b-2b17-4caf-93cf-15955f416bb5",
    "run_record_offset": 0
  }
}

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES

The pipeline lambdas will be updated to run browsertrix-harvester for libguides in the ETL StepFunction, producing the JSONLines that this new transformation requires

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-384

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

ghukill · 2026-02-24T14:32:21Z

tests/fixtures/libguides/libguides_api_guides_df.pkl

This pickled pandas dataframe is data from a LibGuides API request, thereby allowing tests to use it without an API call.

ghukill · 2026-02-24T14:33:56Z

transmogrifier/config.py

+LIBGUIDES_TOKEN_URL = os.getenv(
+    "LIBGUIDES_TOKEN_URL", "https://lgapi-us.libapps.com/1.2/oauth/token"
+)
+LIBGUIDES_GUIDES_URL = os.getenv(
+    "LIBGUIDES_GUIDES_URL", "https://lgapi-us.libapps.com/1.2/guides"
+)
+LIBGUIDES_API_TOKEN = os.getenv("LIBGUIDES_API_TOKEN")
+LIBGUIDES_CLIENT_ID = os.getenv("LIBGUIDES_CLIENT_ID")
+


Noting generally on this file, that we may be approaching a time when a bonafied Config class for Transgmorifier could be helpful.

Short of that, continued with module level constants.

ghukill · 2026-02-24T14:40:08Z

transmogrifier/sources/json/libguides.py

+
+
+# instantiate a LibGuidesAPIClient singleton
+libguides_api_client = LibGuidesAPIClient()


This line is important. This instantiates the LibGuidesAPIClient as a singleton, meaning it's instantiated once when this module is loaded and then never again.

Why?

Transmogrifier has a lot of opinionation that transformers should be stateless, hence a lot of the class methods that just want an input source record. Most of the time this is fine, but it gets tricky when we want to cache or reuse "expensive" data like an API call for all records in the run. By instantiating this once, module level, we don't need an __init__ method on the Libguides transformer class itself, and methods that need to remain @classmethod can remain so.

This could be problematic in a long running program, where this singleton would not get updated between runs. Thankfully, that's not an issue for us: Transmogrifier is also highly opinionated as a single source, single run application.

Overall flow for how this works:

CLI is run.

The source is libguides, so we instantiate the Libguides transformer.

This is the first time imports are called from this module, so it's the first time this singleton LibGuidesAPIClient is instantiated.

Transforms are performed, using the Libguides transformer, which in turn uses this pre-initialized API client.

Transform is done, program exits.

Future runs, process repeats.

I thought long and hard about this, and felt this was a good middle-ground pattern. Short of refactoring Transmogrifier to allow transformers to have their own __init__ methods and utilize state (self) more often, this felt like an understandable and safe way to ensure we don't query the LibGuides API multiple times for a single run.

Makes sense, very helpful context!

Why these changes are being introduced: A decision was made to migrate the libguides source from an OAI-PMH harvest of Dublin Core XML records to a Browsertrix-Harvester web crawl that produced HTML for each crawled libguides. This decision exceeds the scope of this commit. How this addresses that need: A new JSONLines transformer called `Libguides` has been created. As of this commit, the OAI-PMH XML transformer remains. The config is updated to point at this new transformer for all `libguides` source transformations. This transformer, similiar to `MITLibWebsite`, extracts metadata primarily from the full HTML of a web crawl, in this case LibGuides. LibGuides are particular amenable to this approach due to structured Dublin Core metadata being present in the <head> tag, which is identical to the DC metadata we got from OAI-PMH crawls. In addition to this DC metadata, we now have the full-text of the Libguide as well. In addition to the HTML, this transformer utilizes the Springshare Libguides API to retrieve additional metadata about the guides to help identify exclusions and augment some metadata. This data is retrieved once and cached for all records in the run. Side effects of this change: * The primary side effect is Transmogrifier expects JSONLines as input for the `libguides` source, as generated by our Browsertrix-Harvester. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-384

Why these changes are being introduced: With the new JSONLines, web crawl based transformer, the OAI-PMH XML based transformer is no longer needed. How this addresses that need: * Removes all artifacts of OAI XML transformer Side effects of this change: * Transmogrifier no longer supports OAI XML transform for libguides Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-384

ghukill · 2026-02-24T14:53:49Z

transmogrifier/sources/json/libguides.py

+        self._api_guides_df: pd.DataFrame | None = None
+
+    @property
+    def api_guides_df(self) -> pd.DataFrame:
+        if self._api_guides_df is None:
+            self._api_guides_df = self.fetch_guides(self.get_api_token())
+        return self._api_guides_df


These work in unison to cache the results of the API call, if and when it's called by the Libguides transformer class. It's "lazy" in the sense it doesn't fire immediately, but will cache after first use, for the run.

ehanson8

Works as expected and basically approved but one question about the new fixture approach

ehanson8 · 2026-02-24T19:09:20Z

tests/sources/json/test_libguides.py

+
+@pytest.fixture(autouse=True)


This is a break from our general practice of defining all fixtures in conftest.py, I can see the convenience given conftest.py can get unwieldy, even with comments, but I'm not sure how I feel about fixtures being spread across unit test files rather in a central location. But happy to discuss this!

When a fixture is in a file like this -- aka a "module" -- and has autouse=True, it'll get applied to all tests in the module, but no other tests. This makes it very convenient to apply to a lot of tests without passing to each, while remaining safe it doesn't apply to other tests.

I do like the convention of making an effort to locate fixtures centrally in conftest, as I think it encourages us to think of fixtures that can be reused across tests and modules. But I do think sometimes module-level fixtures, or helpers like our record stubbers, can be a glove fit in the right circumstances.

Thanks for raising this, appreciate the chance to air out this out more.

I am realizing, looking at the test module, that other fixtures that don't have autouse=True are in the module file as well, and not centrally located in conftest.

This is now.... getting into more subjective territory. My feeling is that if a fixture is only used by tests in a single module (file), then it belongs there, or at least is most handy there. There is a slight risk that locating fixtures in module will "hide" them, and someone could rewrite a similar or identical fixture for a different module not realizing it exists. But even then, I feel like testing suites are better suited by many, modularized test modules. Even if not 100% DRY, they are easy to work with. It's worth the risk.

As I said, I see the convenience and was just curious about your logic which is sound. All good here!

Interested in exploring this approach more personally--for the reason @ehanson8 stated that conftest.py can become somewhat unwieldy! 🤔

transmogrifier/sources/json/libguides.py

ehanson8 · 2026-02-24T19:17:42Z

transmogrifier/sources/json/libguides.py

+
+
+# instantiate a LibGuidesAPIClient singleton
+libguides_api_client = LibGuidesAPIClient()


Makes sense, very helpful context!

ehanson8

Appreciate the fixture discussion, approved!

jonavellecuerdo

Looks good to me! Thanks for the helpful and detailed comments. ✨

ghukill force-pushed the USE-384-html-transform-for-libguides branch from d87050d to 72a816c Compare February 24, 2026 14:01

ghukill commented Feb 24, 2026

View reviewed changes

ghukill added 3 commits February 24, 2026 09:51

Update dependencies and resolve CVEs

d1ee039

ghukill force-pushed the USE-384-html-transform-for-libguides branch from 72a816c to d1ee039 Compare February 24, 2026 14:51

ghukill commented Feb 24, 2026

View reviewed changes

ghukill marked this pull request as ready for review February 24, 2026 14:55

ghukill requested a review from a team as a code owner February 24, 2026 14:55

jonavellecuerdo self-assigned this Feb 24, 2026

ehanson8 reviewed Feb 24, 2026

View reviewed changes

ghukill requested a review from ehanson8 February 25, 2026 15:02

ehanson8 approved these changes Feb 25, 2026

View reviewed changes

jonavellecuerdo approved these changes Feb 25, 2026

View reviewed changes

ghukill merged commit dcfd8cb into main Feb 25, 2026
3 checks passed



		# instantiate a LibGuidesAPIClient singleton
		libguides_api_client = LibGuidesAPIClient()


		@pytest.fixture(autouse=True)

Conversation

ghukill commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Feb 24, 2026 •

edited

Loading

jonavellecuerdo Feb 25, 2026 •

edited

Loading