Skip to content

USE 384 - migrate libguides from OAI XML parsing to web crawl HTML parsing#269

Merged
ghukill merged 3 commits intomainfrom
USE-384-html-transform-for-libguides
Feb 25, 2026
Merged

USE 384 - migrate libguides from OAI XML parsing to web crawl HTML parsing#269
ghukill merged 3 commits intomainfrom
USE-384-html-transform-for-libguides

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Feb 24, 2026

Purpose and background context

This PR migrates the libguides source from a transformer expecting OAI-PMH Dublin Core XML records to a transformer expecting JSONLines of browsertrix-harvester web crawl output.

See the "long-term" strategy in this decision doc, and the subsequent engineering plan for more information.

This PR can be broken down into two main components:

  1. commit to add new JSONLines transformer
  2. commit to remove old XML parser

Overall, the change is pretty clear-cut. I've opted to add inline comments in the new transformer, to explain little bits and bobs and decisions along the way. The output of this transformer has been reviewed by stakeholders and is confirmed to be better records compared to the OAI XML transformer.

The input for Transmogrifier is the result of a libguides configured browsertrix-harvester crawl. This configuration was mapped out in USE-383, and is noted in the LibGuides confluence page.

Lastly, see USE-388 for background on how and where we exclude some LibGuides from TIMDEX (spoiler alert, it's here in Transmogrifier).

The point of providing this background is to contextualize that much of the mechanics and subjective work is complete and approved for LibGuides, with this PR being the implementation of those decisions.

How can a reviewer manually see the effects of these changes?

  1. Set Dev1 credentials in terminal

  2. Set env vars with values shared in slack

LIBGUIDES_CLIENT_ID=...
LIBGUIDES_API_TOKEN=...
  1. Run transmog, writing to a local dataset:
pipenv run transform --verbose \
-s libguides \
-i s3://timdex-extract-dev-222053980223/libguides/libguides-2026-02-20-full-extracted-records-to-index.jsonl \
-o /tmp/use384
  1. Start an ipython shell to run TDA and look at records:
pipenv run ipython
import json
from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset("/tmp/use384")

# rebuild metadata (first time dataset)
td.metadata.rebuild_dataset_metadata()

# load arbitrary record
record = next(td.read_dicts_iter(source="libguides", table="current_records"))
source_record = json.loads(record['source_record'])
transformed_record = json.loads(record['transformed_record'])

The source_record shows what browsertrix-harveseter provided, and transformed_record the TIMDEX record created.

And/or.... here they are for review as well!

Source record:

{
  "url": "https://libguides.mit.edu/bizcat",
  "status": "active",
  "cdx_warc_filename": "rec-d3274de52d0c-libguides-20260205201423787-6.warc.gz",
  "cdx_title": "Analyst reports - Business Databases by Category - LibGuides at MIT Libraries",
  "cdx_offset": "1001127",
  "cdx_length": "16655",
  "html_base64": "ICA8IURPQ1RZUEU...........vaHRtbD4KICA=",
  "response_headers": {
    "Content-Security-Policy": [
      "upgrade-insecure-requests"
    ],
    "content-type": [
      "text/html; charset=UTF-8"
    ],
    "date": [
      "Thu, 05 Feb 2026 20:16:24 GMT"
    ],
    "server": [
      "nginx"
    ],
    "strict-transport-security": [
      "max-age=31536000; preload"
    ],
    "vary": [
      "Accept-Encoding"
    ],
    "x-backend-server": [
      "libguides-us-1.springyaws.com"
    ],
    "x-content-type-options": [
      "nosniff"
    ],
    "x-springy-cache-disabled": [
      "0"
    ],
    "x-orig-content-encoding": [
      "gzip"
    ]
  }
}

Transformed record:

{
  "source": "LibGuides",
  "source_link": "https://libguides.mit.edu/bizcat",
  "timdex_record_id": "libguides:guides-383403",
  "title": "Business Databases by Category: Analyst reports",
  "citation": "Business Databases by Category: Analyst reports. MIT Libraries. LibGuide. https://libguides.mit.edu/bizcat",
  "content_type": [
    "LibGuide"
  ],
  "dates": [
    {
      "kind": "Accessed",
      "value": "2026-02-20"
    },
    {
      "kind": "Created",
      "value": "2015-09-13"
    },
    {
      "kind": "Modified",
      "value": "2026-02-02"
    }
  ],
  "format": "electronic resource",
  "fulltext": "MIT LibGuides Business Databases by Category Analyst reports Search this Guide Search Business Databases by Category: Analyst reports\nNavigate this site Business resources home & related guides Jump to: Analyst reports Articles Companies Countries Economic indicators Financial Markets Industries Market Research See also: Business databases A-Z Business & Management Librarian Shikha Sharma Contact: sharmash@mit.edu Room E53-168K 617.253.5670 More ways to get help Ask Us Ask a question, make\u00a0an appointment, give\u00a0feedback, or visit us. Analyst Reports Capital IQ Click on Research tab to find CFRA Equity Reports. more... less... New users can register for an account by following the  New User? link on the screen. LSEG Workspace LSEG Workspace is the new name and interface for ThomsonONE database. To find analyst reports  for a specific company (also known as sell-side, aftermarket, broker, or equity research), search by ticker symbol or company name in the top search box. Then, under News & Research menu, click on Company Research. Use filters on the left to refine your search. To screen for analyst reports based on a set of criteria, enter ADVRES in the search box to get a menu for advanced research. Use filters on the left to refine your search by industry, geography, contributor, keywords in the text, and more. Note: There is a daily limit of 150 pages for viewing and downloading analyst reports.  This limit resets at 12:00 AM Eastern Time daily. more... less... New user? Register for an account first. Next: Articles >>",
  "identifiers": [
    {
      "value": "383403",
      "kind": "LibGuide ID"
    }
  ],
  "languages": [
    "en"
  ],
  "links": [
    {
      "url": "https://libguides.mit.edu/bizcat/analysts"
    }
  ],
  "publishers": [
    {
      "name": "MIT Libraries"
    }
  ],
  "rights": [
    {
      "description": "Copyright MIT Libraries 2026"
    }
  ],
  "timdex_provenance": {
    "source": "libguides",
    "run_date": "2026-02-20",
    "run_id": "ae55cb4b-2b17-4caf-93cf-15955f416bb5",
    "run_record_offset": 0
  }
}

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES

  • The pipeline lambdas will be updated to run browsertrix-harvester for libguides in the ETL StepFunction, producing the JSONLines that this new transformation requires

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

@ghukill ghukill force-pushed the USE-384-html-transform-for-libguides branch from d87050d to 72a816c Compare February 24, 2026 14:01
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pickled pandas dataframe is data from a LibGuides API request, thereby allowing tests to use it without an API call.

Comment on lines +139 to +147
LIBGUIDES_TOKEN_URL = os.getenv(
"LIBGUIDES_TOKEN_URL", "https://lgapi-us.libapps.com/1.2/oauth/token"
)
LIBGUIDES_GUIDES_URL = os.getenv(
"LIBGUIDES_GUIDES_URL", "https://lgapi-us.libapps.com/1.2/guides"
)
LIBGUIDES_API_TOKEN = os.getenv("LIBGUIDES_API_TOKEN")
LIBGUIDES_CLIENT_ID = os.getenv("LIBGUIDES_CLIENT_ID")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting generally on this file, that we may be approaching a time when a bonafied Config class for Transgmorifier could be helpful.

Short of that, continued with module level constants.



# instantiate a LibGuidesAPIClient singleton
libguides_api_client = LibGuidesAPIClient()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is important. This instantiates the LibGuidesAPIClient as a singleton, meaning it's instantiated once when this module is loaded and then never again.

Why?

Transmogrifier has a lot of opinionation that transformers should be stateless, hence a lot of the class methods that just want an input source record. Most of the time this is fine, but it gets tricky when we want to cache or reuse "expensive" data like an API call for all records in the run. By instantiating this once, module level, we don't need an __init__ method on the Libguides transformer class itself, and methods that need to remain @classmethod can remain so.

This could be problematic in a long running program, where this singleton would not get updated between runs. Thankfully, that's not an issue for us: Transmogrifier is also highly opinionated as a single source, single run application.

Overall flow for how this works:

  1. CLI is run.
  2. The source is libguides, so we instantiate the Libguides transformer.
  3. This is the first time imports are called from this module, so it's the first time this singleton LibGuidesAPIClient is instantiated.
  4. Transforms are performed, using the Libguides transformer, which in turn uses this pre-initialized API client.
  5. Transform is done, program exits.
  6. Future runs, process repeats.

I thought long and hard about this, and felt this was a good middle-ground pattern. Short of refactoring Transmogrifier to allow transformers to have their own __init__ methods and utilize state (self) more often, this felt like an understandable and safe way to ensure we don't query the LibGuides API multiple times for a single run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, very helpful context!

Why these changes are being introduced:

A decision was made to migrate the libguides source from an OAI-PMH harvest of Dublin Core XML records
to a Browsertrix-Harvester web crawl that produced HTML for each crawled libguides.  This decision
exceeds the scope of this commit.

How this addresses that need:

A new JSONLines transformer called `Libguides` has been created.  As of this commit, the OAI-PMH XML
transformer remains.  The config is updated to point at this new transformer for all `libguides` source
transformations.

This transformer, similiar to `MITLibWebsite`, extracts metadata primarily from the full HTML of a web
crawl, in this case LibGuides.  LibGuides are particular amenable to this approach due to structured
Dublin Core metadata being present in the <head> tag, which is identical to the DC metadata we got
from OAI-PMH crawls.

In addition to this DC metadata, we now have the full-text of the Libguide as well.

In addition to the HTML, this transformer utilizes the Springshare Libguides API to retrieve additional
metadata about the guides to help identify exclusions and augment some metadata.  This data is retrieved
once and cached for all records in the run.

Side effects of this change:
* The primary side effect is Transmogrifier expects JSONLines as input for the `libguides` source, as
generated by our Browsertrix-Harvester.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-384
Why these changes are being introduced:

With the new JSONLines, web crawl based transformer, the OAI-PMH XML based transformer is no longer
needed.

How this addresses that need:
* Removes all artifacts of OAI XML transformer

Side effects of this change:
* Transmogrifier no longer supports OAI XML transform for libguides

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-384
@ghukill ghukill force-pushed the USE-384-html-transform-for-libguides branch from 72a816c to d1ee039 Compare February 24, 2026 14:51
Comment on lines +67 to +73
self._api_guides_df: pd.DataFrame | None = None

@property
def api_guides_df(self) -> pd.DataFrame:
if self._api_guides_df is None:
self._api_guides_df = self.fetch_guides(self.get_api_token())
return self._api_guides_df
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These work in unison to cache the results of the API call, if and when it's called by the Libguides transformer class. It's "lazy" in the sense it doesn't fire immediately, but will cache after first use, for the run.

@ghukill ghukill marked this pull request as ready for review February 24, 2026 14:55
@ghukill ghukill requested a review from a team as a code owner February 24, 2026 14:55
@jonavellecuerdo jonavellecuerdo self-assigned this Feb 24, 2026
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected and basically approved but one question about the new fixture approach

Comment on lines +11 to +12

@pytest.fixture(autouse=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a break from our general practice of defining all fixtures in conftest.py, I can see the convenience given conftest.py can get unwieldy, even with comments, but I'm not sure how I feel about fixtures being spread across unit test files rather in a central location. But happy to discuss this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a fixture is in a file like this -- aka a "module" -- and has autouse=True, it'll get applied to all tests in the module, but no other tests. This makes it very convenient to apply to a lot of tests without passing to each, while remaining safe it doesn't apply to other tests.

I do like the convention of making an effort to locate fixtures centrally in conftest, as I think it encourages us to think of fixtures that can be reused across tests and modules. But I do think sometimes module-level fixtures, or helpers like our record stubbers, can be a glove fit in the right circumstances.

Thanks for raising this, appreciate the chance to air out this out more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am realizing, looking at the test module, that other fixtures that don't have autouse=True are in the module file as well, and not centrally located in conftest.

This is now.... getting into more subjective territory. My feeling is that if a fixture is only used by tests in a single module (file), then it belongs there, or at least is most handy there. There is a slight risk that locating fixtures in module will "hide" them, and someone could rewrite a similar or identical fixture for a different module not realizing it exists. But even then, I feel like testing suites are better suited by many, modularized test modules. Even if not 100% DRY, they are easy to work with. It's worth the risk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, I see the convenience and was just curious about your logic which is sound. All good here!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested in exploring this approach more personally--for the reason @ehanson8 stated that conftest.py can become somewhat unwieldy! 🤔



# instantiate a LibGuidesAPIClient singleton
libguides_api_client = LibGuidesAPIClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, very helpful context!

@ghukill ghukill requested a review from ehanson8 February 25, 2026 15:02
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the fixture discussion, approved!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for the helpful and detailed comments. ✨

@ghukill ghukill merged commit dcfd8cb into main Feb 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants