ispider_core

ispider is a module to spider websites

Multicore and multithreaded
Accepts hundreds/thousands of websites/domains as input
Sparse requests to avoid repeated calls against the same domain
The httpx engine works in asyncio blocks defined by settings.ASYNC_BLOCK_SIZE, so total concurrent threads are ASYNC_BLOCK_SIZE * POOLS
It supports retry with different engines (httpx, curl, seleniumbase [testing])

It was designed for maximum speed, so it has some limitations:

As of v0.7, it does not support files (pdf, video, images, etc); it only processes HTML

HOW IT WORKS - SIMPLE

-- Crawl - Depth == 0

Get all the landing pages for domains in the provided list.
If "robots" is selected, download the robots.txt file.
If "sitemaps" is selected, parse the robots.txt and retrieve all the sitemaps.
All data is saved under USER_DATA/data/dumps/dom_tld.

-- Spider - Depth > 0

Extract all links from landing pages and sitemaps.
Download the HTML pages, extract internal links, and follow them recursively.

HOW IT WORKS - MORE DETAILED

Crawl - Depth == 0

Create objects in the form (('https://domain.com', 'landing_page', 'domain.com', depth, retries, engine))
Add them to the LIFO queue qout
A thread retrieves elements from qout in variable-size blocks (depending on QUEUE_MAX_SIZE)
Fill a FIFO queue qin
Different workers (defined in settings.POOLS) get elements from qin and download them to USER_DATA/data/dumps/dom_tld
Landing pages are saved as _.html
Each worker processes the landing page; if the result is OK (status_code == 200), it tries to get robots.txt
On failure, it tries the next available engine (fallback)
It creates an object (('https://domain.com/robots.txt', 'robots', 'domain.com', depth=1, retries=0, engine))
Each worker retrieves the robots.txt; if "sitemaps" is defined in settings.CRAWL_METHODS, it attempts to get all sitemaps from robots.txt and dom_tld/sitemaps.xml
It creates objects (('https://domain.com/sitemap.xml', 'sitemaps', 'domain.com', depth=1, retries=0, engine)) and for other sitemaps found in robots.txt
Every successful or failed download is logged as a row in USER_FOLDER/jsons/crawl_conn_meta*json with all information available from the engine; these files are useful for statistics/reports from the spider
When there are no more elements in qin, after a 90-second timeout, jobs stop.

Spider - Depths > 0

It reads entries from USER_FOLDER/jsons/crawl_conn_meta*json for the domains in the list
It retrieves landing pages and sitemaps
If sitemaps are compressed, it uncompresses them
Extract all links from landing pages and sitemaps
Create objects (('https://domain.com/link1', 'internals', 'domain.com', depth=2, retries=0, engine))
Use the same engine that was used for the last successful request to the domain TLD
Add these objects to qout
Thread qin moves blocks from qout to qin, sparsing them
Download all links, save them, and save data in JSON
Parse the HTML, extract all INTERNAL links, follow them recursively, increasing depth

Schema

This is the projectual schema of the crawler/spider

USAGE

Install it

pip install ispider

First use

from ispider_core import ISpider

if __name__ == '__main__':
    # Check the readme for the complete avail parameters
    config_overrides = {
        'USER_FOLDER': '/Your/Dump/Folder',
        'POOLS': 64,
        'ASYNC_BLOCK_SIZE': 32,
        'MAXIMUM_RETRIES': 2,
        'CRAWL_METHODS': [],
        'CODES_TO_RETRY': [430, 503, 500, 429],
        'CURL_INSECURE': True,
        'ENGINES': ['curl'],
        'EXCLUDED_DOMAINS': ['facebook.com', 'instagram.com']
    }

    # Specify a list of domains
    doms = ['domain1.com', 'domain2.com'....]

    # Run
    with ISpider(domains=doms, **config_overrides) as spider:
        spider.run()

TO KNOW

At first execution,

It creates the folder settings.USER_FOLDER
It creates settings.USER_FOLDER/data/ with dumps/ and jsons/
settings.USER_FOLDER/data/dumps are the downloaded websites
settings.USER_FOLDER/data/jsons are the connection results for every request

SETTINGS

Actual default settings are:

    """
    ## *********************************
    ## GENERIC SETTINGS
    # Output folder for controllers, dumps and jsons
    USER_FOLDER = "~/.ispider/"

    # Log level
    LOG_LEVEL = 'DEBUG'

    ## i.e., status_code = 430
    CODES_TO_RETRY = [430, 503, 500, 429]
    MAXIMUM_RETRIES = 2

    # Delay time after some status code to be retried
    TIME_DELAY_RETRY = 0

    ## Number of concurrent connection on the same process during crawling
    # Concurrent por process
    ASYNC_BLOCK_SIZE = 4

    # Concurrent processes (number of cores used, check your CPU spec)
    POOLS = 4

    # Max timeout for connecting,
    TIMEOUT = 5

    # This need to be a list, 
    # curl is used as subprocess, so be sure you installed it on your system
    # Retry will use next available engine.
    # The script begins wit the suprfast httpx
    # If fail, try with curl
    # If fail, it tries with seleniumbase, headless and uc mode activate
    ENGINES = ['httpx', 'curl', 'seleniumbase']

    CURL_INSECURE = False

    ## *********************************
    # CRAWLER
    # File size 
    # Max file size dumped on the disk. 
    # This to avoid big sitemaps with errors.
    MAX_CRAWL_DUMP_SIZE = 52428800

    # Max depth to follow in sitemaps
    SITEMAPS_MAX_DEPTH = 2

    # Crawler will get robots and sitemaps too
    CRAWL_METHODS = ['robots', 'sitemaps']

    ## *********************************
    ## SPIDER
    # Queue max, till 1 billion is ok on normal systems
    QUEUE_MAX_SIZE = 100000

    # Max depth to follow in websites
    WEBSITES_MAX_DEPTH = 2

    # This is not implemented yet
    MAX_PAGES_POR_DOMAIN = 1000000

    # This try to exclude some kind of files
    # It also test first bits of content of some common files, 
    # to exclude them even if online element has no extension
    EXCLUDED_EXTENSIONS = [
        "pdf", "csv",
        "mp3", "jpg", "jpeg", "png", "gif", "bmp", "tiff", "webp", "svg", "ico", "tif",
        "jfif", "eps", "raw", "cr2", "nef", "orf", "arw", "rw2", "sr2", "dng", "heif", "avif", "jp2", "jpx",
        "wdp", "hdp", "psd", "ai", "cdr", "ppsx"
        "ics", "ogv",
        "mpg", "mp4", "mov", "m4v",
        "zip", "rar"
    ]

    # Exclude all urls that contains this REGEX
    EXCLUDED_EXPRESSIONS_URL = [
        # r'test',
    ]

    # If not empty, follow only URLs that match these regex patterns
    INCLUDED_EXPRESSIONS_URL = [
        # r'/\d{4}/\d{2}/\d{2}/',
    ]

    # Exclude specific domains from crawling/spidering.
    # Accepts values like "example.com" or full URLs.
    EXCLUDED_DOMAINS = []

    """

NOTES

Deduplication is not 100% safe, sometimes pages are downloaded multiple times, and skipped in file check. On ~10 domains, check duplication has small delay. But on 10000 domains after 500k links, the domain list is so big that checking if a link is already downloaded or not was decreasing considerably the speed (from 30000 urls/min to 300 urls/min). That's why I preferred avoid a list, and left just "check file".

SEO checks (modular)

You can run independent SEO checks during crawling/spidering. Results are stored in each JSON response row under seo_issues.

Available checks:

response_crawlability: flags 3xx/4xx/5xx, redirect chains, and timeouts.
broken_links: generic status >= 400 detector.
http_status_503: dedicated 503 detector.
title_meta_quality: validates <title> and meta description length/presence and flags title == h1.
h1_too_long: validates H1 length threshold.
heading_structure: checks h1 count and heading-order skips.
indexability_canonical: checks canonical presence/self-reference, homepage canonicals, and noindex directives.
schema_news_article: detects NewsArticle structured data and required properties.
image_optimization: flags missing image dimensions/ALT and oversized hero hints.
internal_linking: flags weak anchors, no internal links, and too many external links.
url_hygiene: validates URL length/case/params/special chars and the newsroom pattern /yyyy/mm/dd/slug/.
content_length: flags thin content (default <250 words).
security_headers: checks HSTS, CSP, and X-Frame-Options.

SEO issue codes (priority + short description)

Code	Priority	Description
`BROKEN_LINK`	medium	URL returned an HTTP status code >= 400.
`CANONICAL_MISSING`	medium	Canonical tag is missing.
`CANONICAL_NOT_SELF`	low	Canonical URL is not self-referential.
`CANONICAL_TO_HOMEPAGE`	high	Canonical points to homepage from an internal page.
`CONTENT_TOO_THIN`	medium	Visible content word count is below the configured minimum.
`H1_MISSING`	high	No H1 heading found on the page.
`H1_MULTIPLE`	high	More than one H1 heading found.
`H1_TOO_LONG`	low	H1 text length exceeds configured maximum (`SEO_H1_MAX_CHARS`).
`HEADING_ORDER_SKIP`	low	Heading hierarchy skips levels (for example `h2` -> `h4`).
`HERO_IMAGE_FETCHPRIORITY_MISSING`	low	First image is missing `fetchpriority=high`.
`HERO_IMAGE_TOO_LARGE`	medium	Hero image appears larger than configured size threshold.
`HTTP_3XX`	low	Response is a redirect (3xx).
`HTTP_4XX`	high	Response is a client error (4xx).
`HTTP_5XX`	high	Response is a server error (5xx).
`HTTP_503`	high	Response specifically returned 503 Service Unavailable.
`IMAGE_ALT_MISSING`	low	At least one image is missing ALT text.
`IMAGE_LAZY_LOADING_MISSING`	low	Non-hero image missing `loading=lazy`.
`META_DESCRIPTION_LENGTH`	low	Meta description length is outside recommended range.
`META_DESCRIPTION_MISSING`	medium	Meta description is missing.
`NOINDEX_DETECTED`	high	`noindex` detected in meta robots or x-robots-tag.
`NO_INTERNAL_LINKS`	medium	No internal links found on the page.
`REDIRECT_CHAIN`	medium	Redirect chain length is greater than 1.
`REQUEST_TIMEOUT`	high	Request timed out.
`SCHEMA_NEWSARTICLE_MISSING`	high	`NewsArticle` JSON-LD schema not found.
`SCHEMA_REQUIRED_FIELDS_MISSING`	high	`NewsArticle` schema is missing required fields.
`SECURITY_HEADERS_MISSING`	low	One or more security headers are missing (HSTS, CSP, X-Frame-Options).
`TITLE_EQUALS_H1`	low	`<title>` is identical to H1.
`TITLE_LENGTH`	medium	`<title>` length is outside recommended range.
`TITLE_MISSING`	high	`<title>` tag is missing.
`TOO_MANY_EXTERNAL_LINKS`	low	Unique external domains exceed configured threshold.
`URL_HAS_PARAMETERS`	low	URL contains query parameters.
`URL_NEWS_PATTERN_MISMATCH`	medium	URL does not match expected `/yyyy/mm/dd/slug/` pattern.
`URL_SPECIAL_CHARS`	low	URL path contains special characters.
`URL_TOO_LONG`	low	URL length exceeds configured threshold.
`URL_UPPERCASE`	low	URL path contains uppercase letters.
`WEAK_ANCHOR_TEXT`	low	Generic anchor texts detected (for example “read more”, “click here”).

Configure with settings:

config_overrides = {
    'SEO_CHECKS_ENABLED': True,
    'SEO_ENABLED_CHECKS': ['response_crawlability', 'title_meta_quality', 'schema_news_article'],
    'SEO_DISABLED_CHECKS': ['http_status_503'],
    'SEO_H1_MAX_CHARS': 70,
}

Tip for Google News-focused runs: combine INCLUDED_EXPRESSIONS_URL with a day filter (example: r'^.*/2026/02/07/.*$') and keep response_crawlability, indexability_canonical, and schema_news_article enabled.

To add a new check, create a class in ispider_core/seo/checks/ with name and run(resp) and register it in ispider_core/seo/runner.py.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
ispider_core		ispider_core
static		static
tests		tests
.gitignore		.gitignore
.heartbeat		.heartbeat
LICENCE		LICENCE
MANIFEST.in		MANIFEST.in
README.md		README.md
ispider_results.db		ispider_results.db
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ispider_core

HOW IT WORKS - SIMPLE

HOW IT WORKS - MORE DETAILED

Crawl - Depth == 0

Spider - Depths > 0

Schema

USAGE

TO KNOW

SETTINGS

NOTES

SEO checks (modular)

SEO issue codes (priority + short description)

About

Uh oh!

Releases 15

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

danruggi/ispider

Folders and files

Latest commit

History

Repository files navigation

ispider_core

HOW IT WORKS - SIMPLE

HOW IT WORKS - MORE DETAILED

Crawl - Depth == 0

Spider - Depths > 0

Schema

USAGE

TO KNOW

SETTINGS

NOTES

SEO checks (modular)

SEO issue codes (priority + short description)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages