Skip to content

SchemaForge is a schema-first data pipeline tool that automatically discovers JSON structures and converts them to analytics-ready formats. Stop wasting time on manual schema definitions and data wranglingβ€”let SchemaForge do the heavy lifting.

License

Notifications You must be signed in to change notification settings

Syntax-Error-1337/SchemaForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SchemaForge πŸ”¨

Transform JSON Chaos into Analytics-Ready Data

Python 3.8+ License: MIT Code Style: Black

Quick Start β€’ Features β€’ Documentation β€’ CLI Reference


🎯 What is SchemaForge?

SchemaForge automatically discovers JSON schemas and converts them to analytics-ready formats. Stop wasting hours on manual data wranglingβ€”let SchemaForge handle type detection, schema inference, and format conversion in seconds.

The Problem vs. The Solution

❌ Traditional Workflow

πŸ“„ JSON Files
    ↓ (manual analysis)
πŸ“ Write Schemas
    ↓ (write conversion code)
πŸ› Debug Type Errors
    ↓ (fix, repeat)
⏰ Hours Later...
    ↓
βœ… Ready for Analysis

βœ… SchemaForge Workflow

πŸ“„ JSON Files
    ↓ (one command)
πŸ” Auto Schema Discovery
    ↓ (one command)
βœ… Parquet/CSV/Avro/ORC/Feather
    ↓
⚑ Minutes Later!

Time Saved: Hours β†’ Minutes | Errors: Many β†’ Zero


✨ Features

🧠 Intelligent Schema Discovery

  • Advanced Type Detection - Strings, numbers, booleans, timestamps, URLs, emails, UUIDs, IPs, and more
  • Smart Pattern Recognition - Automatically detects enums, embedded JSON, and numeric strings
  • Statistical Analysis - Collects min/max, length stats, and value distributions
  • Nested Structure Handling - Flattens complex JSON with intuitive dot notation

πŸ“ Universal Format Support

Input: 11+ JSON formats auto-detected

  • Standard JSON Arrays β€’ NDJSON β€’ Wrapper Objects β€’ GeoJSON β€’ Socrata/OpenData β€’ Single Objects β€’ Python Literals β€’ Embedded JSON

Output: 4 analytics-ready formats

  • Parquet (recommended) β€’ CSV β€’ Avro β€’ ORC β€’ Feather

πŸš€ Production-Ready Tools

  • Schema Validation - Verify data quality before processing
  • Performance Benchmarking - Measure and optimize your pipelines
  • Batch Processing - Convert multiple files in one command
  • Sampling Support - Handle massive files efficiently

⚑ High-Performance Architecture

  • Memory-SAFE Processing - Automatic 80% memory limit protection (no OOM crashes)
  • Dynamic Scaling - Auto-scales worker processes based on available RAM
  • Smart Chunking - Automatically chunks large files to fit in memory
  • Parallel Processing - Multi-core conversion for maximum throughput

πŸ’Ύ Memory Management

SchemaForge is built for big data. It automatically manages system resources to ensure stability and speed.

Intelligent Resource Scaling

  1. Memory Protection: Continuously monitors RAM usage, capping at 80% (default) to keep your system responsive.
  2. Adaptive Workers: Calculates the optimal number of parallel workers (1-8) based on your specific hardware config.
  3. Smart Chunking: Large files (>50MB) are automatically processed in memory-safe chunks.

πŸ’‘ Zero Configuration Needed: These optimizations happen automatically. Just run convert and let SchemaForge handle the resources.

πŸ“¦ Installation

# Clone the repository
git clone https://github.com/Syntax-Error-1337/SchemaForge.git
cd SchemaForge

# Install dependencies
pip install -r requirements.txt

Requirements: Python 3.8+, pandas, pyarrow, fastavro, ijson


πŸš€ Quick Start

Three Simple Steps

# 1️⃣ Place your JSON files in the data directory
cp your_data/*.json data/

# 2️⃣ Discover schemas
python -m src.cli scan-schemas

# 3️⃣ Convert to your preferred format
python -m src.cli convert --format parquet

That's it! Your data is now in output/, ready for analysis. πŸŽ‰

What Just Happened?

  • βœ… All JSON structures automatically analyzed
  • βœ… Types inferred with statistical confidence
  • βœ… Nested objects flattened intelligently
  • βœ… Schema reports generated (Markdown + JSON)
  • βœ… Data converted to optimized format

πŸ”§ CLI Reference

Core Commands

Command Purpose Example
scan-schemas Analyze JSON structure python -m src.cli scan-schemas
convert Transform to analytics formats python -m src.cli convert --format parquet
validate Verify schema compliance python -m src.cli validate
benchmark Measure performance python -m src.cli benchmark

scan-schemas - Discover JSON Schemas

python -m src.cli scan-schemas [OPTIONS]

Options:

  • --data-dir - Input directory (default: data)
  • --output-report - Report path (default: reports/schema_report.md)
  • --max-sample-size - Sample size for large files
  • --sampling-strategy - first or random sampling

Examples:

# Basic usage
python -m src.cli scan-schemas

# Large files with random sampling
python -m src.cli scan-schemas --max-sample-size 10000 --sampling-strategy random

# Custom directory
python -m src.cli scan-schemas --data-dir my_json_data --output-report custom/schema.md

Output:

  • schema_report.md - Human-readable documentation
  • schema_report.json - Machine-readable schema

convert - Transform to Analytics Formats

python -m src.cli convert --format [parquet|csv|avro|orc|feather] [OPTIONS]

Options:

  • --format - Required: Output format
  • --data-dir - Input directory (default: data)
  • --output-dir - Output directory (default: output)
  • --schema-report - Schema JSON path (default: reports/schema_report.json)

Examples:

# Convert to Parquet (recommended for big data)
python -m src.cli convert --format parquet

# Convert to CSV (universal compatibility)
python -m src.cli convert --format csv

# Convert to Avro (schema evolution)
python -m src.cli convert --format avro

# Convert to Feather (fast I/O for Pandas/Arrow)
python -m src.cli convert --format feather

# Custom directories
python -m src.cli convert --format parquet --data-dir raw_data --output-dir lake/

⚠️ Note: Run scan-schemas first to generate the schema report.


validate - Verify Data Quality

python -m src.cli validate [OPTIONS]

Options:

  • --data-dir - Directory to validate (default: data)
  • --schema-report - Schema for validation (default: reports/schema_report.json)

Example:

python -m src.cli validate --data-dir production_data

benchmark - Performance Testing

python -m src.cli benchmark [OPTIONS]

Options:

  • --type - Benchmark type: schema, conversion, or all (default: all)
  • --formats - Formats to test (default: parquet,csv,avro,orc,feather)
  • --result-dir - Results directory (default: result)

Example:

python -m src.cli benchmark --type all --result-dir benchmarks/

πŸ’Ό Use Cases

🏒 Data Engineering

Challenge: Inconsistent JSON from multiple APIs
Solution: Unified schema discovery and conversion
Result: 80% faster pipeline development

πŸ”¬ Research Data

Challenge: Diverse datasets from experiments and surveys
Solution: One-command conversion to analysis-ready formats
Result: More time analyzing, less time wrangling

🌐 Open Data

Challenge: Complex formats from Socrata/CKAN portals
Solution: Automatic column extraction and transformation
Result: Easy access to government datasets

πŸ—„οΈ Data Lakes

Challenge: Efficient storage for massive JSON collections
Solution: Convert to optimized columnar formats
Result: Better compression, faster queries, lower costs


πŸ“– Documentation

Supported JSON Formats

SchemaForge automatically detects and handles:

  1. Standard JSON Array - [{...}, {...}]
  2. NDJSON - Newline-delimited records
  3. Wrapper Objects - {"data": [...], "meta": {...}}
  4. Socrata/OpenData - Array-based tabular format
  5. GeoJSON - Geographic feature collections
  6. Single Objects - Individual JSON records
  7. Embedded JSON - JSON strings within fields

Schema Inference

Detected Types:

  • Basic: string, integer, float, boolean
  • Advanced: timestamp, url, email, uuid, ip_address
  • Structured: array<T>, object, json_string
  • Special: numeric_string, enum

Features:

  • βœ… Nested structure flattening (user.address.city)
  • βœ… Nullable field detection
  • βœ… Mixed type recognition
  • βœ… Statistical profiling (min/max, length, distributions)
  • βœ… Enum detection for categorical fields

Complete Workflow Example

# 1. Scan and analyze
python -m src.cli scan-schemas --data-dir raw_data

# 2. Validate quality
python -m src.cli validate --data-dir raw_data

# 3. Convert for analytics
python -m src.cli convert --format parquet --output-dir processed/

# 4. Benchmark performance
python -m src.cli benchmark --type all

πŸ§ͺ Testing

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

# Run specific test
pytest tests/test_schema_reader.py

Test Coverage:

  • βœ… All 11+ JSON formats
  • βœ… Type inference for all data types
  • βœ… Format conversion (Parquet/CSV/Avro/ORC/Feather)
  • βœ… Error handling and edge cases

🎯 Performance Tips

  1. Use Sampling for Large Files

    python -m src.cli scan-schemas --max-sample-size 10000 --sampling-strategy random
  2. Choose the Right Format

    • Parquet β†’ Big data analytics (best compression)
    • Avro β†’ Schema evolution & streaming
    • ORC β†’ Hadoop/Hive ecosystems
    • Feather β†’ Fast I/O for Pandas/Arrow workflows
    • CSV β†’ Universal compatibility
  3. Monitor Performance

    python -m src.cli benchmark --type all

See BENCHMARK_OPTIMIZATION.md for detailed optimization guide.


πŸ—οΈ Project Structure

SchemaForge/
β”œβ”€β”€ data/              # Input JSON files
β”œβ”€β”€ output/            # Converted files
β”œβ”€β”€ reports/           # Schema reports (.md + .json)
β”œβ”€β”€ result/            # Benchmark results
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ schema_reader/      # Schema inference engine
β”‚   β”‚   β”œβ”€β”€ core.py         # Main SchemaReader logic
β”‚   β”‚   β”œβ”€β”€ inference.py    # Type detection & analysis
β”‚   β”‚   β”œβ”€β”€ reporting.py    # Report generation
β”‚   β”‚   └── types.py        # Data models
β”‚   β”œβ”€β”€ converter/          # Format conversion
β”‚   β”‚   β”œβ”€β”€ core.py         # Main Converter logic
β”‚   β”‚   β”œβ”€β”€ parquet.py      # Parquet support
β”‚   β”‚   β”œβ”€β”€ feather.py      # Feather support
β”‚   β”‚   β”œβ”€β”€ avro.py         # Avro support
β”‚   β”‚   β”œβ”€β”€ orc.py          # ORC support
β”‚   β”‚   └── csv.py          # CSV support
β”‚   β”œβ”€β”€ benchmark/          # Performance testing
β”‚   β”‚   β”œβ”€β”€ core.py         # Benchmark suite
β”‚   β”‚   β”œβ”€β”€ schema.py       # Schema benchmarks
β”‚   β”‚   └── conversion.py   # Conversion benchmarks
β”‚   β”œβ”€β”€ json_loader.py      # JSON format detection
β”‚   β”œβ”€β”€ validator.py        # Schema validation
β”‚   └── cli.py              # Command-line interface
└── tests/             # Test suite
└── tests/             # Test suite

🀝 Contributing

We welcome contributions! Here's how:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing)
  3. Make your changes with tests
  4. Run tests (pytest tests/ -v)
  5. Submit a Pull Request

Ideas for Contributions:

  • Schema versioning tools
  • Streaming processing for huge files
  • GUI/Web interface
  • Database export support
  • Additional output formats

πŸ“ License

MIT License - see LICENSE for details.


πŸ™ Acknowledgments

Built for data engineers, researchers, and developers who are tired of manual schema definitions.

Powered by: pandas β€’ pyarrow β€’ fastavro β€’ ijson β€’ pytest


πŸ“ž Support

Before opening an issue:

  1. Check existing issues
  2. Try python -m src.cli [command] --help
  3. Run pytest tests/ -v to verify installation

SchemaForge - Transform Data Chaos into Analytics Gold πŸ”¨

⭐ Star us on GitHub if SchemaForge saved you time! ⭐

⬆ Back to Top

About

SchemaForge is a schema-first data pipeline tool that automatically discovers JSON structures and converts them to analytics-ready formats. Stop wasting time on manual schema definitions and data wranglingβ€”let SchemaForge do the heavy lifting.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages