Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ jsonlines = "*"
lxml = "*"
lxml-stubs = "*"
python-dateutil = "*"
requests = "*"
sentry-sdk = "*"
smart-open = {version = "*", extras = ["s3"]}
types-python-dateutil = "*"
Expand Down
1,591 changes: 856 additions & 735 deletions Pipfile.lock

Large diffs are not rendered by default.

6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,15 @@ flowchart TD
DSpace
GeoData
MARC
Others
transmogrifier((transmogrifier))
timdex-dataset
timdex-index-manager
ArchivesSpace[("ArchivesSpace<br>(EAD XML)")] --> transmogrifier
DSpace[("DSpace<br>(METS XML)")] --> transmogrifier
GeoData[("GeoData<br>(Aardvark JSON)")] --> transmogrifier
MARC[("Alma<br>(MARCXML)")] --> transmogrifier
Others[("*Other Sources")] --> transmogrifier
transmogrifier --> timdex-dataset["TIMDEX Parquet Dataset"]
timdex-dataset["TIMDEX Parquet Dataset"] --> timdex-index-manager((timdex-index-manager))
```
Expand Down Expand Up @@ -52,7 +54,9 @@ WORKSPACE=### Set to `dev` for local development, this will be set to `stage` an
### Optional

```shell
WARNING_ONLY_LOGGERS=### Comma-seperated list of logger names to set as WARNING only, e.g. 'botocore,charset_normalizer,smart_open'
WARNING_ONLY_LOGGERS=### Comma-seperated list of logger names to set as WARNING only, e.g. 'botocore,charset_normalizer,smart_open'
LIBGUIDES_API_TOKEN=### Libguides API token [required for libguides source]
LIBGUIDES_CLIENT_ID=### Libguides account id [required for libguides source]
```

## CLI commands
Expand Down
10 changes: 9 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,12 @@ disallow_untyped_defs = true
exclude = ["tests/", "output/"]

[[tool.mypy.overrides]]
module = ["bs4", "bs4.*"]
module = [
"bs4",
"bs4.*",
"pandas",
"requests"
]
ignore_missing_imports = true

[tool.pytest.ini_options]
Expand Down Expand Up @@ -42,12 +47,15 @@ ignore = [
"D205",
"D212",
"D402",
"EM101",
"EM102",
"G004",
"PLR0912",
"PLR0913",
"PLR0915",
"S321",
"TD002",
"TD003",
"TRY003"
]

Expand Down
36 changes: 36 additions & 0 deletions tests/fixtures/libguides/libguide.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
<!DOCTYPE html>
<html lang="en">

<head>
<title>Analyst reports - Business Databases by Category - LibGuides at MIT
Libraries</title>
<meta content="LibGuides: Business Databases by Category: Analyst reports"
name="DC.Title">
<meta content="Shikha Sharma" name="DC.Creator">
<meta content="Business" name="DC.Subject">
<meta content="Databases" name="DC.Subject">
<meta content="This is a libguide about business databases." name="DC.Description"/>
<meta content="MIT Libraries" name="DC.Publishers"/>
<meta content="Copyright MIT Libraries 2026" name="DC.Rights"/>
<meta content="en" name="DC.Language"/>
<meta content="https://libguides.mit.edu/bizcat/analysts" name="DC.Identifier"/>
<meta content="Sep 13, 2015" name="DC.Date.Created"/>
<meta content="Feb 2, 2026" name="DC.Date.Modified"/>
<meta content="LibGuides: Business Databases by Category: Analyst reports"
property="og:title"/>
<meta content="LibGuides: Business Databases by Category: Analyst reports"
property="og:description"/>
<meta content="website" property="og:type"/>
<meta content="https://libguides.mit.edu/bizcat/analysts" property="og:url"/>
<meta content="summary_large_image" name="twitter:card"/>
<meta content="@springshare" name="twitter:site"/>
</head>

<body class="s-lg-guide-body">
<div><p>You should not find me.</p></div>
<div class="s-lib-header"><p>I am header information.</p></div>
<div class="s-lib-main"><p>I am the main content.</p></div>
<div><p>You should not find me either.</p></div>
</body>

</html>
8 changes: 8 additions & 0 deletions tests/fixtures/libguides/libguide_minimal_dc.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<html>
<head>
<title>Test</title>
</head>
<body>
<div class="s-lib-main"><p>Content</p></div>
</body>
</html>
8 changes: 8 additions & 0 deletions tests/fixtures/libguides/libguide_non_url_identifier.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<html>
<head>
<meta name="DC.Title" content="Test"/>
<meta name="DC.Identifier" content="ISBN:1234567890"/>
<meta name="DC.Subject" content=" "/>
</head>
<body></body>
</html>

Large diffs are not rendered by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pickled pandas dataframe is data from a LibGuides API request, thereby allowing tests to use it without an API call.

Binary file not shown.

This file was deleted.

This file was deleted.

This file was deleted.

113 changes: 0 additions & 113 deletions tests/fixtures/oai_dc/springshare/libguides/libguides_records.xml

This file was deleted.

Loading