USE 384 - migrate libguides from OAI XML parsing to web crawl HTML parsing#269
USE 384 - migrate libguides from OAI XML parsing to web crawl HTML parsing#269
Conversation
d87050d to
72a816c
Compare
There was a problem hiding this comment.
This pickled pandas dataframe is data from a LibGuides API request, thereby allowing tests to use it without an API call.
| LIBGUIDES_TOKEN_URL = os.getenv( | ||
| "LIBGUIDES_TOKEN_URL", "https://lgapi-us.libapps.com/1.2/oauth/token" | ||
| ) | ||
| LIBGUIDES_GUIDES_URL = os.getenv( | ||
| "LIBGUIDES_GUIDES_URL", "https://lgapi-us.libapps.com/1.2/guides" | ||
| ) | ||
| LIBGUIDES_API_TOKEN = os.getenv("LIBGUIDES_API_TOKEN") | ||
| LIBGUIDES_CLIENT_ID = os.getenv("LIBGUIDES_CLIENT_ID") | ||
|
|
There was a problem hiding this comment.
Noting generally on this file, that we may be approaching a time when a bonafied Config class for Transgmorifier could be helpful.
Short of that, continued with module level constants.
|
|
||
|
|
||
| # instantiate a LibGuidesAPIClient singleton | ||
| libguides_api_client = LibGuidesAPIClient() |
There was a problem hiding this comment.
This line is important. This instantiates the LibGuidesAPIClient as a singleton, meaning it's instantiated once when this module is loaded and then never again.
Why?
Transmogrifier has a lot of opinionation that transformers should be stateless, hence a lot of the class methods that just want an input source record. Most of the time this is fine, but it gets tricky when we want to cache or reuse "expensive" data like an API call for all records in the run. By instantiating this once, module level, we don't need an __init__ method on the Libguides transformer class itself, and methods that need to remain @classmethod can remain so.
This could be problematic in a long running program, where this singleton would not get updated between runs. Thankfully, that's not an issue for us: Transmogrifier is also highly opinionated as a single source, single run application.
Overall flow for how this works:
- CLI is run.
- The source is
libguides, so we instantiate theLibguidestransformer. - This is the first time imports are called from this module, so it's the first time this singleton
LibGuidesAPIClientis instantiated. - Transforms are performed, using the
Libguidestransformer, which in turn uses this pre-initialized API client. - Transform is done, program exits.
- Future runs, process repeats.
I thought long and hard about this, and felt this was a good middle-ground pattern. Short of refactoring Transmogrifier to allow transformers to have their own __init__ methods and utilize state (self) more often, this felt like an understandable and safe way to ensure we don't query the LibGuides API multiple times for a single run.
There was a problem hiding this comment.
Makes sense, very helpful context!
Why these changes are being introduced: A decision was made to migrate the libguides source from an OAI-PMH harvest of Dublin Core XML records to a Browsertrix-Harvester web crawl that produced HTML for each crawled libguides. This decision exceeds the scope of this commit. How this addresses that need: A new JSONLines transformer called `Libguides` has been created. As of this commit, the OAI-PMH XML transformer remains. The config is updated to point at this new transformer for all `libguides` source transformations. This transformer, similiar to `MITLibWebsite`, extracts metadata primarily from the full HTML of a web crawl, in this case LibGuides. LibGuides are particular amenable to this approach due to structured Dublin Core metadata being present in the <head> tag, which is identical to the DC metadata we got from OAI-PMH crawls. In addition to this DC metadata, we now have the full-text of the Libguide as well. In addition to the HTML, this transformer utilizes the Springshare Libguides API to retrieve additional metadata about the guides to help identify exclusions and augment some metadata. This data is retrieved once and cached for all records in the run. Side effects of this change: * The primary side effect is Transmogrifier expects JSONLines as input for the `libguides` source, as generated by our Browsertrix-Harvester. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-384
Why these changes are being introduced: With the new JSONLines, web crawl based transformer, the OAI-PMH XML based transformer is no longer needed. How this addresses that need: * Removes all artifacts of OAI XML transformer Side effects of this change: * Transmogrifier no longer supports OAI XML transform for libguides Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-384
72a816c to
d1ee039
Compare
| self._api_guides_df: pd.DataFrame | None = None | ||
|
|
||
| @property | ||
| def api_guides_df(self) -> pd.DataFrame: | ||
| if self._api_guides_df is None: | ||
| self._api_guides_df = self.fetch_guides(self.get_api_token()) | ||
| return self._api_guides_df |
There was a problem hiding this comment.
These work in unison to cache the results of the API call, if and when it's called by the Libguides transformer class. It's "lazy" in the sense it doesn't fire immediately, but will cache after first use, for the run.
ehanson8
left a comment
There was a problem hiding this comment.
Works as expected and basically approved but one question about the new fixture approach
|
|
||
| @pytest.fixture(autouse=True) |
There was a problem hiding this comment.
This is a break from our general practice of defining all fixtures in conftest.py, I can see the convenience given conftest.py can get unwieldy, even with comments, but I'm not sure how I feel about fixtures being spread across unit test files rather in a central location. But happy to discuss this!
There was a problem hiding this comment.
When a fixture is in a file like this -- aka a "module" -- and has autouse=True, it'll get applied to all tests in the module, but no other tests. This makes it very convenient to apply to a lot of tests without passing to each, while remaining safe it doesn't apply to other tests.
I do like the convention of making an effort to locate fixtures centrally in conftest, as I think it encourages us to think of fixtures that can be reused across tests and modules. But I do think sometimes module-level fixtures, or helpers like our record stubbers, can be a glove fit in the right circumstances.
Thanks for raising this, appreciate the chance to air out this out more.
There was a problem hiding this comment.
I am realizing, looking at the test module, that other fixtures that don't have autouse=True are in the module file as well, and not centrally located in conftest.
This is now.... getting into more subjective territory. My feeling is that if a fixture is only used by tests in a single module (file), then it belongs there, or at least is most handy there. There is a slight risk that locating fixtures in module will "hide" them, and someone could rewrite a similar or identical fixture for a different module not realizing it exists. But even then, I feel like testing suites are better suited by many, modularized test modules. Even if not 100% DRY, they are easy to work with. It's worth the risk.
There was a problem hiding this comment.
As I said, I see the convenience and was just curious about your logic which is sound. All good here!
There was a problem hiding this comment.
Interested in exploring this approach more personally--for the reason @ehanson8 stated that conftest.py can become somewhat unwieldy! 🤔
|
|
||
|
|
||
| # instantiate a LibGuidesAPIClient singleton | ||
| libguides_api_client = LibGuidesAPIClient() |
There was a problem hiding this comment.
Makes sense, very helpful context!
ehanson8
left a comment
There was a problem hiding this comment.
Appreciate the fixture discussion, approved!
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Looks good to me! Thanks for the helpful and detailed comments. ✨
Purpose and background context
This PR migrates the
libguidessource from a transformer expecting OAI-PMH Dublin Core XML records to a transformer expecting JSONLines of browsertrix-harvester web crawl output.See the "long-term" strategy in this decision doc, and the subsequent engineering plan for more information.
This PR can be broken down into two main components:
Overall, the change is pretty clear-cut. I've opted to add inline comments in the new transformer, to explain little bits and bobs and decisions along the way. The output of this transformer has been reviewed by stakeholders and is confirmed to be better records compared to the OAI XML transformer.
The input for Transmogrifier is the result of a
libguidesconfigured browsertrix-harvester crawl. This configuration was mapped out in USE-383, and is noted in the LibGuides confluence page.Lastly, see USE-388 for background on how and where we exclude some LibGuides from TIMDEX (spoiler alert, it's here in Transmogrifier).
The point of providing this background is to contextualize that much of the mechanics and subjective work is complete and approved for LibGuides, with this PR being the implementation of those decisions.
How can a reviewer manually see the effects of these changes?
Set Dev1 credentials in terminal
Set env vars with values shared in slack
The
source_recordshows what browsertrix-harveseter provided, andtransformed_recordthe TIMDEX record created.And/or.... here they are for review as well!
Source record:
{ "url": "https://libguides.mit.edu/bizcat", "status": "active", "cdx_warc_filename": "rec-d3274de52d0c-libguides-20260205201423787-6.warc.gz", "cdx_title": "Analyst reports - Business Databases by Category - LibGuides at MIT Libraries", "cdx_offset": "1001127", "cdx_length": "16655", "html_base64": "ICA8IURPQ1RZUEU...........vaHRtbD4KICA=", "response_headers": { "Content-Security-Policy": [ "upgrade-insecure-requests" ], "content-type": [ "text/html; charset=UTF-8" ], "date": [ "Thu, 05 Feb 2026 20:16:24 GMT" ], "server": [ "nginx" ], "strict-transport-security": [ "max-age=31536000; preload" ], "vary": [ "Accept-Encoding" ], "x-backend-server": [ "libguides-us-1.springyaws.com" ], "x-content-type-options": [ "nosniff" ], "x-springy-cache-disabled": [ "0" ], "x-orig-content-encoding": [ "gzip" ] } }Transformed record:
{ "source": "LibGuides", "source_link": "https://libguides.mit.edu/bizcat", "timdex_record_id": "libguides:guides-383403", "title": "Business Databases by Category: Analyst reports", "citation": "Business Databases by Category: Analyst reports. MIT Libraries. LibGuide. https://libguides.mit.edu/bizcat", "content_type": [ "LibGuide" ], "dates": [ { "kind": "Accessed", "value": "2026-02-20" }, { "kind": "Created", "value": "2015-09-13" }, { "kind": "Modified", "value": "2026-02-02" } ], "format": "electronic resource", "fulltext": "MIT LibGuides Business Databases by Category Analyst reports Search this Guide Search Business Databases by Category: Analyst reports\nNavigate this site Business resources home & related guides Jump to: Analyst reports Articles Companies Countries Economic indicators Financial Markets Industries Market Research See also: Business databases A-Z Business & Management Librarian Shikha Sharma Contact: sharmash@mit.edu Room E53-168K 617.253.5670 More ways to get help Ask Us Ask a question, make\u00a0an appointment, give\u00a0feedback, or visit us. Analyst Reports Capital IQ Click on Research tab to find CFRA Equity Reports. more... less... New users can register for an account by following the New User? link on the screen. LSEG Workspace LSEG Workspace is the new name and interface for ThomsonONE database. To find analyst reports for a specific company (also known as sell-side, aftermarket, broker, or equity research), search by ticker symbol or company name in the top search box. Then, under News & Research menu, click on Company Research. Use filters on the left to refine your search. To screen for analyst reports based on a set of criteria, enter ADVRES in the search box to get a menu for advanced research. Use filters on the left to refine your search by industry, geography, contributor, keywords in the text, and more. Note: There is a daily limit of 150 pages for viewing and downloading analyst reports. This limit resets at 12:00 AM Eastern Time daily. more... less... New user? Register for an account first. Next: Articles >>", "identifiers": [ { "value": "383403", "kind": "LibGuide ID" } ], "languages": [ "en" ], "links": [ { "url": "https://libguides.mit.edu/bizcat/analysts" } ], "publishers": [ { "name": "MIT Libraries" } ], "rights": [ { "description": "Copyright MIT Libraries 2026" } ], "timdex_provenance": { "source": "libguides", "run_date": "2026-02-20", "run_id": "ae55cb4b-2b17-4caf-93cf-15955f416bb5", "run_record_offset": 0 } }Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES
libguidesin the ETL StepFunction, producing the JSONLines that this new transformation requiresWhat are the relevant tickets?
Code review