Skip to content

IP reputation aggregator from 127 threat feeds with fast binary search lookups.

License

Notifications You must be signed in to change notification settings

tn3w/IPBlocklist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔒 IPBlocklist

Threat intelligence aggregator that collects, processes, and serves IP reputation data from 127 security feeds into an optimized binary format for fast lookups.

GitHub Workflow Status Dataset Size Individual IPs CIDR Ranges

Download Threat Data

🚀 Key Features

  • ✅ Fast IP lookups in <1ms using binary search
  • ✅ 5.0M+ IPs and CIDR ranges from 127 threat intelligence feeds
  • ✅ Malware C&C servers, botnets, spam networks, compromised hosts
  • ✅ VPN providers, Tor nodes, datacenter/hosting ASNs
  • ✅ Optimized integer storage for minimal memory footprint
  • ✅ Support for both IPv4 and IPv6
  • ✅ Automated daily updates via GitHub Actions

📥 Download & Extract

The dataset is available as a downloadable binary file.

Threat Intelligence Data

The threat intelligence dataset is approximately 12MB.

# Download the file
wget https://github.com/tn3w/IPBlocklist/releases/latest/download/blocklist.bin

# Verify the file
ls -lh blocklist.bin

📊 Architecture

feeds.json ──────────> aggregator.py ──────────> blocklist.bin
  (config)              (processor)              (threat intel)

📖 Overview

IPBlocklist downloads threat intelligence from multiple sources (malware C&C servers, botnets, spam networks, VPN providers, Tor nodes, etc.) and converts them into a compact, searchable binary format. IP addresses and CIDR ranges are stored as delta-encoded integers for efficient binary search lookups.

The system uses open-source security feeds configured in feeds.json, which are processed by aggregator.py into a unified blocklist.bin file.

📁 Data Models

feeds.json

Configuration file defining all threat intelligence sources. Each feed is an independent object with complete metadata.

Structure: Array of feed objects

[
    {
        "name": "feodotracker",
        "url": "https://feodotracker.abuse.ch/downloads/ipblocklist.txt",
        "description": "Feodo Tracker - Botnet C&C",
        "regex": "^(?![#;/])([0-9a-fA-F:.]+(?:/\\d+)?)",
        "base_score": 1.0,
        "confidence": 0.95,
        "flags": ["is_malware", "is_botnet", "is_c2_server"],
        "categories": ["malware", "botnet"]
    }
]

Required Fields:

  • name: Unique identifier for the feed
  • url: Download URL for the threat list
  • description: Human-readable description
  • regex: Pattern to extract IPs/CIDRs from feed content
  • base_score: Threat severity (0.0-1.0)
  • confidence: Data reliability (0.0-1.0)
  • flags: Boolean indicators (is_anycast, is_botnet, is_brute_force, is_c2_server, is_cdn, is_cloud, is_compromised, is_datacenter, is_forum_spammer, is_isp, is_malware, is_mobile, is_phishing, is_proxy, is_scanner, is_spammer, is_tor, is_vpn, is_web_attacker)
  • categories: Categories for scoring (anonymizer, attacks, botnet, compromised, infrastructure, malware, spam)

Optional Fields:

  • provider_name: VPN/hosting provider name

datacenter_asns.json

List of Autonomous System Numbers (ASNs) associated with datacenter and hosting providers.

Structure: Array of ASN strings

["15169", "16509", "13335", "8075", "14061"]

This file is automatically generated when processing the datacenter_asns feed and can be used for O(1) ASN lookups to identify datacenter traffic.

blocklist.bin

Processed binary output with delta-encoded IP ranges for fast lookups.

Structure: Binary format with varint encoding

[4 bytes: timestamp (u32)]
[2 bytes: feed count (u16)]
For each feed:
  [1 byte: name length (u8)]
  [N bytes: feed name (utf-8)]
  [4 bytes: range count (u32)]
  For each range:
    [varint: from_delta]
    [varint: range_size]

Encoding:

  • Timestamp: Unix timestamp as 32-bit unsigned integer
  • Feed names: Length-prefixed UTF-8 strings
  • Ranges: Delta-encoded start positions with varint compression
  • Range size: End - start encoded as varint

Integer Conversion:

  • IPv4: 10.0.0.1167772161
  • IPv6: 2001:db8::142540766411282592856903984951653826561
  • CIDR: 10.0.0.0/27(167772160, 167772191) (network to broadcast)
  • Single IP: Stored as range with size 0

⚙️ aggregator.py

Downloads and processes all feeds in parallel, handling multiple formats and edge cases.

Features:

  • Parallel downloads with ThreadPoolExecutor (10 workers)
  • IPv4/IPv6 support with embedded address extraction
  • CIDR range expansion to [start, end] pairs
  • ASN resolution for datacenter and Tor networks
  • Deduplication and sorting for binary search
  • Regex-based parsing for diverse feed formats

Special Handling:

  • datacenter_asns: Resolves ASN numbers to IP ranges via RIPE API
  • tor_onionoo: Combines Tor relay list with known Tor ASNs
  • IPv6 mapped addresses: Extracts embedded IPv4 (::ffff:192.0.2.1)
  • 6to4 tunnels: Extracts IPv4 from 2002::/16 addresses

Usage:

python aggregator.py

Output: Creates/updates blocklist.bin with all processed feeds and datacenter_asns.json with datacenter ASN list

🐍 Python Lookup Examples

Database Loader

import struct
import ipaddress
from typing import Dict, List, Tuple, Optional


def read_varint(f) -> int:
    result = shift = 0
    while True:
        byte = f.read(1)[0]
        result |= (byte & 0x7F) << shift
        if not (byte & 0x80):
            return result
        shift += 7


def binary_search(ranges: List[Tuple], target: int) -> Optional[int]:
    left, right = 0, len(ranges) - 1
    best_match = None
    best_size = float('inf')

    while left <= right:
        mid = (left + right) // 2
        start, end = ranges[mid]

        if start <= target <= end:
            size = end - start
            if size < best_size:
                best_size = size
                best_match = mid
            left = mid + 1
        elif target < start:
            right = mid - 1
        else:
            left = mid + 1

    return best_match


class BlocklistLoader:
    def __init__(self, path: str = "blocklist.bin"):
        self.feeds: Dict[str, List[Tuple[int, int]]] = {}
        self.timestamp: int = 0
        self._load(path)

    def _load(self, path: str):
        with open(path, "rb") as f:
            self.timestamp = struct.unpack("<I", f.read(4))[0]
            feed_count = struct.unpack("<H", f.read(2))[0]

            for _ in range(feed_count):
                name_len = struct.unpack("<B", f.read(1))[0]
                feed_name = f.read(name_len).decode("utf-8")
                range_count = struct.unpack("<I", f.read(4))[0]

                ranges = []
                current = 0
                for _ in range(range_count):
                    current += read_varint(f)
                    size = read_varint(f)
                    ranges.append((current, current + size))

                self.feeds[feed_name] = ranges

    def check_ip(self, ip: str) -> List[str]:
        target = int(ipaddress.ip_address(ip))
        matches = []

        for feed_name, ranges in self.feeds.items():
            if binary_search(ranges, target) is not None:
                matches.append(feed_name)

        return matches


blocklist = BlocklistLoader()
result = blocklist.check_ip("8.8.8.8")
print(result)

Batch Lookup

def check_batch(blocklist: BlocklistLoader, ip_list: List[str]) -> Dict[str, List[str]]:
    results = {}
    for ip in ip_list:
        results[ip] = blocklist.check_ip(ip)
    return results


ips = ["10.0.0.1", "192.168.1.1", "8.8.8.8"]
results = check_batch(blocklist, ips)
for ip, feeds in results.items():
    print(f"{ip}: {feeds}")

Datacenter ASN Lookup

import json

def load_datacenter_asns(asn_file="datacenter_asns.json"):
    """Load datacenter ASNs into a set for O(1) lookups."""
    try:
        with open(asn_file) as f:
            return set(json.load(f))
    except Exception as e:
        print(f"Error loading ASNs: {e}")
        return set()

def is_datacenter_asn(asn, asns=None):
    """Check if ASN belongs to a datacenter."""
    if not asns:
        asns = load_datacenter_asns()
    return asn.replace("AS", "").strip() in asns

asns = load_datacenter_asns()
for asn in ["AS16509", "AS13335", "AS15169"]:
    result = "is" if is_datacenter_asn(asn, asns) else "is not"
    print(f"{asn} {result} a datacenter ASN")

Reputation Scoring

import json


with open("feeds.json") as f:
    feeds_config = json.load(f)

sources = {feed["name"]: feed for feed in feeds_config}


def check_ip_with_reputation(blocklist: BlocklistLoader, ip: str) -> Dict:
    matches = blocklist.check_ip(ip)

    if not matches:
        return {"ip": ip, "score": 0.0, "feeds": []}

    flags = {}
    scores = {
        "anonymizer": [], "attacks": [], "botnet": [],
        "compromised": [], "infrastructure": [], "malware": [], "spam": []
    }

    for list_name in matches:
        source = sources.get(list_name)
        if not source:
            continue

        for flag in source.get("flags", []):
            flags[flag] = True

        provider = source.get("provider_name")
        if provider:
            flags["vpn_provider"] = provider

        base_score = source.get("base_score", 0.5)
        for category in source.get("categories", []):
            if category in scores:
                scores[category].append(base_score)

    total = 0.0
    for category_scores in scores.values():
        if not category_scores:
            continue
        combined = 1.0
        for score in sorted(category_scores, reverse=True):
            combined *= 1.0 - score
        total += 1.0 - combined

    return {
        "ip": ip,
        "score": min(total / 1.5, 1.0),
        "feeds": matches,
        **flags
    }


result = check_ip_with_reputation(blocklist, "8.8.8.8")
print(json.dumps(result, indent=2))

⚡ Performance Characteristics

Dataset Statistics:

  • Total feeds: 127
  • Individual IPs: 4.4M (4.4M IPv4, 6k IPv6)
  • CIDR ranges: 552K (545K IPv4, 7K IPv6)
  • Total entries: 5.0M
  • File size: 12MB (compressed with varint encoding)

Lookup Complexity:

  • Binary search: O(log n) per feed
  • Typical lookup: <1ms for 127 feeds with 5.0M entries

Memory Usage:

  • Delta encoding: ~2-3 bytes per range (varint compressed)
  • Feed names: Length-prefixed UTF-8 strings
  • Total memory: ~12MB loaded in RAM

💡 Use Cases

  • API Rate Limiting: Block known malicious IPs
  • Fraud Detection: Flag VPN/proxy/datacenter traffic
  • Security Analytics: Enrich logs with threat intelligence
  • Access Control: Restrict Tor exit nodes or anonymizers
  • Compliance: Block traffic from sanctioned networks

📜 License

Copyright 2025 TN3W

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.