Skip to content

danruggi/ispider

Repository files navigation

ispider_core

ispider is a module to spider websites

  • Multicore and multithreaded
  • Accepts hundreds/thousands of websites/domains as input
  • Sparse requests to avoid repeated calls against the same domain
  • The httpx engine works in asyncio blocks defined by settings.ASYNC_BLOCK_SIZE, so total concurrent threads are ASYNC_BLOCK_SIZE * POOLS
  • It supports retry with different engines (httpx, curl, seleniumbase [testing])

It was designed for maximum speed, so it has some limitations:

  • As of v0.7, it does not support files (pdf, video, images, etc); it only processes HTML

HOW IT WORKS - SIMPLE

-- Crawl - Depth == 0

  • Get all the landing pages for domains in the provided list.
  • If "robots" is selected, download the robots.txt file.
  • If "sitemaps" is selected, parse the robots.txt and retrieve all the sitemaps.
  • All data is saved under USER_DATA/data/dumps/dom_tld.

-- Spider - Depth > 0

  • Extract all links from landing pages and sitemaps.
  • Download the HTML pages, extract internal links, and follow them recursively.

HOW IT WORKS - MORE DETAILED

Crawl - Depth == 0

  • Create objects in the form (('https://domain.com', 'landing_page', 'domain.com', depth, retries, engine))
  • Add them to the LIFO queue qout
  • A thread retrieves elements from qout in variable-size blocks (depending on QUEUE_MAX_SIZE)
  • Fill a FIFO queue qin
  • Different workers (defined in settings.POOLS) get elements from qin and download them to USER_DATA/data/dumps/dom_tld
  • Landing pages are saved as _.html
  • Each worker processes the landing page; if the result is OK (status_code == 200), it tries to get robots.txt
  • On failure, it tries the next available engine (fallback)
  • It creates an object (('https://domain.com/robots.txt', 'robots', 'domain.com', depth=1, retries=0, engine))
  • Each worker retrieves the robots.txt; if "sitemaps" is defined in settings.CRAWL_METHODS, it attempts to get all sitemaps from robots.txt and dom_tld/sitemaps.xml
  • It creates objects (('https://domain.com/sitemap.xml', 'sitemaps', 'domain.com', depth=1, retries=0, engine)) and for other sitemaps found in robots.txt
  • Every successful or failed download is logged as a row in USER_FOLDER/jsons/crawl_conn_meta*json with all information available from the engine; these files are useful for statistics/reports from the spider
  • When there are no more elements in qin, after a 90-second timeout, jobs stop.

Spider - Depths > 0

  • It reads entries from USER_FOLDER/jsons/crawl_conn_meta*json for the domains in the list
  • It retrieves landing pages and sitemaps
  • If sitemaps are compressed, it uncompresses them
  • Extract all links from landing pages and sitemaps
  • Create objects (('https://domain.com/link1', 'internals', 'domain.com', depth=2, retries=0, engine))
  • Use the same engine that was used for the last successful request to the domain TLD
  • Add these objects to qout
  • Thread qin moves blocks from qout to qin, sparsing them
  • Download all links, save them, and save data in JSON
  • Parse the HTML, extract all INTERNAL links, follow them recursively, increasing depth

Schema

This is the projectual schema of the crawler/spider alt text

USAGE

Install it

pip install ispider

First use

from ispider_core import ISpider

if __name__ == '__main__':
    # Check the readme for the complete avail parameters
    config_overrides = {
        'USER_FOLDER': '/Your/Dump/Folder',
        'POOLS': 64,
        'ASYNC_BLOCK_SIZE': 32,
        'MAXIMUM_RETRIES': 2,
        'CRAWL_METHODS': [],
        'CODES_TO_RETRY': [430, 503, 500, 429],
        'CURL_INSECURE': True,
        'ENGINES': ['curl'],
        'EXCLUDED_DOMAINS': ['facebook.com', 'instagram.com']
    }

    # Specify a list of domains
    doms = ['domain1.com', 'domain2.com'....]

    # Run
    with ISpider(domains=doms, **config_overrides) as spider:
        spider.run()

TO KNOW

At first execution,

  • It creates the folder settings.USER_FOLDER

  • It creates settings.USER_FOLDER/data/ with dumps/ and jsons/

  • settings.USER_FOLDER/data/dumps are the downloaded websites

  • settings.USER_FOLDER/data/jsons are the connection results for every request

SETTINGS

Actual default settings are:

    """
    ## *********************************
    ## GENERIC SETTINGS
    # Output folder for controllers, dumps and jsons
    USER_FOLDER = "~/.ispider/"

    # Log level
    LOG_LEVEL = 'DEBUG'

    ## i.e., status_code = 430
    CODES_TO_RETRY = [430, 503, 500, 429]
    MAXIMUM_RETRIES = 2

    # Delay time after some status code to be retried
    TIME_DELAY_RETRY = 0

    ## Number of concurrent connection on the same process during crawling
    # Concurrent por process
    ASYNC_BLOCK_SIZE = 4

    # Concurrent processes (number of cores used, check your CPU spec)
    POOLS = 4

    # Max timeout for connecting,
    TIMEOUT = 5

    # This need to be a list, 
    # curl is used as subprocess, so be sure you installed it on your system
    # Retry will use next available engine.
    # The script begins wit the suprfast httpx
    # If fail, try with curl
    # If fail, it tries with seleniumbase, headless and uc mode activate
    ENGINES = ['httpx', 'curl', 'seleniumbase']

    CURL_INSECURE = False

    ## *********************************
    # CRAWLER
    # File size 
    # Max file size dumped on the disk. 
    # This to avoid big sitemaps with errors.
    MAX_CRAWL_DUMP_SIZE = 52428800

    # Max depth to follow in sitemaps
    SITEMAPS_MAX_DEPTH = 2

    # Crawler will get robots and sitemaps too
    CRAWL_METHODS = ['robots', 'sitemaps']

    ## *********************************
    ## SPIDER
    # Queue max, till 1 billion is ok on normal systems
    QUEUE_MAX_SIZE = 100000

    # Max depth to follow in websites
    WEBSITES_MAX_DEPTH = 2

    # This is not implemented yet
    MAX_PAGES_POR_DOMAIN = 1000000

    # This try to exclude some kind of files
    # It also test first bits of content of some common files, 
    # to exclude them even if online element has no extension
    EXCLUDED_EXTENSIONS = [
        "pdf", "csv",
        "mp3", "jpg", "jpeg", "png", "gif", "bmp", "tiff", "webp", "svg", "ico", "tif",
        "jfif", "eps", "raw", "cr2", "nef", "orf", "arw", "rw2", "sr2", "dng", "heif", "avif", "jp2", "jpx",
        "wdp", "hdp", "psd", "ai", "cdr", "ppsx"
        "ics", "ogv",
        "mpg", "mp4", "mov", "m4v",
        "zip", "rar"
    ]

    # Exclude all urls that contains this REGEX
    EXCLUDED_EXPRESSIONS_URL = [
        # r'test',
    ]

    # If not empty, follow only URLs that match these regex patterns
    INCLUDED_EXPRESSIONS_URL = [
        # r'/\d{4}/\d{2}/\d{2}/',
    ]

    # Exclude specific domains from crawling/spidering.
    # Accepts values like "example.com" or full URLs.
    EXCLUDED_DOMAINS = []

    """

NOTES

  • Deduplication is not 100% safe, sometimes pages are downloaded multiple times, and skipped in file check. On ~10 domains, check duplication has small delay. But on 10000 domains after 500k links, the domain list is so big that checking if a link is already downloaded or not was decreasing considerably the speed (from 30000 urls/min to 300 urls/min). That's why I preferred avoid a list, and left just "check file".

SEO checks (modular)

You can run independent SEO checks during crawling/spidering. Results are stored in each JSON response row under seo_issues.

Available checks:

  • response_crawlability: flags 3xx/4xx/5xx, redirect chains, and timeouts.
  • broken_links: generic status >= 400 detector.
  • http_status_503: dedicated 503 detector.
  • title_meta_quality: validates <title> and meta description length/presence and flags title == h1.
  • h1_too_long: validates H1 length threshold.
  • heading_structure: checks h1 count and heading-order skips.
  • indexability_canonical: checks canonical presence/self-reference, homepage canonicals, and noindex directives.
  • schema_news_article: detects NewsArticle structured data and required properties.
  • image_optimization: flags missing image dimensions/ALT and oversized hero hints.
  • internal_linking: flags weak anchors, no internal links, and too many external links.
  • url_hygiene: validates URL length/case/params/special chars and the newsroom pattern /yyyy/mm/dd/slug/.
  • content_length: flags thin content (default <250 words).
  • security_headers: checks HSTS, CSP, and X-Frame-Options.

SEO issue codes (priority + short description)

Code Priority Description
BROKEN_LINK medium URL returned an HTTP status code >= 400.
CANONICAL_MISSING medium Canonical tag is missing.
CANONICAL_NOT_SELF low Canonical URL is not self-referential.
CANONICAL_TO_HOMEPAGE high Canonical points to homepage from an internal page.
CONTENT_TOO_THIN medium Visible content word count is below the configured minimum.
H1_MISSING high No H1 heading found on the page.
H1_MULTIPLE high More than one H1 heading found.
H1_TOO_LONG low H1 text length exceeds configured maximum (SEO_H1_MAX_CHARS).
HEADING_ORDER_SKIP low Heading hierarchy skips levels (for example h2 -> h4).
HERO_IMAGE_FETCHPRIORITY_MISSING low First image is missing fetchpriority=high.
HERO_IMAGE_TOO_LARGE medium Hero image appears larger than configured size threshold.
HTTP_3XX low Response is a redirect (3xx).
HTTP_4XX high Response is a client error (4xx).
HTTP_5XX high Response is a server error (5xx).
HTTP_503 high Response specifically returned 503 Service Unavailable.
IMAGE_ALT_MISSING low At least one image is missing ALT text.
IMAGE_LAZY_LOADING_MISSING low Non-hero image missing loading=lazy.
META_DESCRIPTION_LENGTH low Meta description length is outside recommended range.
META_DESCRIPTION_MISSING medium Meta description is missing.
NOINDEX_DETECTED high noindex detected in meta robots or x-robots-tag.
NO_INTERNAL_LINKS medium No internal links found on the page.
REDIRECT_CHAIN medium Redirect chain length is greater than 1.
REQUEST_TIMEOUT high Request timed out.
SCHEMA_NEWSARTICLE_MISSING high NewsArticle JSON-LD schema not found.
SCHEMA_REQUIRED_FIELDS_MISSING high NewsArticle schema is missing required fields.
SECURITY_HEADERS_MISSING low One or more security headers are missing (HSTS, CSP, X-Frame-Options).
TITLE_EQUALS_H1 low <title> is identical to H1.
TITLE_LENGTH medium <title> length is outside recommended range.
TITLE_MISSING high <title> tag is missing.
TOO_MANY_EXTERNAL_LINKS low Unique external domains exceed configured threshold.
URL_HAS_PARAMETERS low URL contains query parameters.
URL_NEWS_PATTERN_MISMATCH medium URL does not match expected /yyyy/mm/dd/slug/ pattern.
URL_SPECIAL_CHARS low URL path contains special characters.
URL_TOO_LONG low URL length exceeds configured threshold.
URL_UPPERCASE low URL path contains uppercase letters.
WEAK_ANCHOR_TEXT low Generic anchor texts detected (for example “read more”, “click here”).

Configure with settings:

config_overrides = {
    'SEO_CHECKS_ENABLED': True,
    'SEO_ENABLED_CHECKS': ['response_crawlability', 'title_meta_quality', 'schema_news_article'],
    'SEO_DISABLED_CHECKS': ['http_status_503'],
    'SEO_H1_MAX_CHARS': 70,
}

Tip for Google News-focused runs: combine INCLUDED_EXPRESSIONS_URL with a day filter (example: r'^.*/2026/02/07/.*$') and keep response_crawlability, indexability_canonical, and schema_news_article enabled.

To add a new check, create a class in ispider_core/seo/checks/ with name and run(resp) and register it in ispider_core/seo/runner.py.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages