Skip to content

RoverCrawler is a single-file Python web crawler designed to explore websites and generate a tree-mapped representation of their structure.

License

Notifications You must be signed in to change notification settings

URDev4ever/RoverCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RoverCrawler 🕷️🚀

🇺🇸 English | 🇪🇸 Español

image-removebg-preview (43)

Single-file web crawler for site structure mapping

RoverCrawler is a single-file Python web crawler designed to explore websites and generate a tree-mapped representation of their structure. It supports interactive mode, command-line usage, colored tree output, rate limiting, and exporting results — all without external project scaffolding.

Built for clarity, portability, and controlled crawling.


✨ Features

  • 📄 Single Python file (rovercrawler.py)

  • 🌳 Tree-based site structure mapping (default output)

  • 🎨 Subtle colored output (cross-platform via colorama)

  • 🧭 Interactive configuration mode

  • 🖥️ Full CLI support (argparse)

  • 🔍 Domain-restricted crawling (optional external links)

  • 🛑 Safety limits (max depth & max pages)

  • ⏱️ Rate limiting to avoid hammering servers

  • 📊 Crawl statistics (pages, links, errors, speed)

  • 📦 Export results to:

    • JSON
    • Plain text
  • 💻 Cross-platform (Windows / Linux / macOS)


🖥️ Installation

Just clone this repository: (you NEED git installed for you to be able to clone it)

git clone https://github.com/URDev4ever/RoverCrawler.git
cd RoverCrawler/

📦 Requirements

Python 3.8+ recommended.

External dependencies (install once):

pip install requests beautifulsoup4 colorama

🚀 Usage

1️⃣ Interactive Mode (recommended for manual scans)

Just run the script without arguments:

python rovercrawler.py

You will be prompted to configure:

  • Target URL
  • Max crawl depth
  • Max pages
  • Verbose mode
  • External link following

2️⃣ Command-Line Mode (CLI)

Basic usage:

python rovercrawler.py https://example.com

With options:

python rovercrawler.py https://example.com -d 4 -p 200 -v --external

⚙️ Command-Line Options

Option Description
url Target URL to crawl
-d, --depth Maximum crawl depth
-p, --pages Maximum pages to crawl
-v, --verbose Enable verbose output
-e, --external Follow external (out-of-domain) links
-t, --timeout Request timeout (seconds)
--export-json FILE Export results as JSON
--export-txt FILE Export results as plain text
--no-banner Disable ASCII banner
--no-colors Disable colored output

🌳 Output Example (Tree View)

/
├── /about
│   ├── /team
│   └── /history
├── /blog
│   ├── /post-1
│   └── /post-2
└── /contact
  • Internal links are shown in cyan
  • External links (if enabled) are marked and colored yellow
  • Output is depth-aware and loop-safe

📤 Exporting Results

Export to JSON

python rovercrawler.py https://example.com --export-json results.json

The JSON preserves the tree structure, ideal for post-processing or visualization.


Export to Plain Text

python rovercrawler.py https://example.com --export-txt results.txt
  • Colors are automatically stripped
  • Includes crawl metadata and statistics

📊 Crawl Statistics

At the end of each crawl, RoverCrawler reports:

  • Pages crawled
  • Links discovered
  • Errors encountered
  • Total time elapsed
  • Average crawl speed (pages/sec)

Example:

Pages crawled: 87
Links found:  412
Errors:       2
Time elapsed: 12.4 seconds
Avg speed:    7.0 pages/sec

🧠 Technical Notes

  • Uses BFS (Breadth-First Search) for predictable tree depth
  • Normalizes URLs (scheme, domain, path)
  • Skips common binary/static file extensions
  • Ignores fragments, mailto, javascript, tel links
  • Enforces rate limiting per request
  • Uses a single requests.Session() for efficiency

⚠️ Disclaimer

RoverCrawler is intended for educational, research, and legitimate testing purposes. Always respect:

  • Website terms of service
  • robots.txt
  • Applicable local laws

You are responsible for how you use this tool.


Made with <3 by URDev.