Transform JSON Chaos into Analytics-Ready Data
Quick Start β’ Features β’ Documentation β’ CLI Reference
SchemaForge automatically discovers JSON schemas and converts them to analytics-ready formats. Stop wasting hours on manual data wranglingβlet SchemaForge handle type detection, schema inference, and format conversion in seconds.
|
β Traditional Workflow |
β SchemaForge Workflow |
Time Saved: Hours β Minutes | Errors: Many β Zero
- Advanced Type Detection - Strings, numbers, booleans, timestamps, URLs, emails, UUIDs, IPs, and more
- Smart Pattern Recognition - Automatically detects enums, embedded JSON, and numeric strings
- Statistical Analysis - Collects min/max, length stats, and value distributions
- Nested Structure Handling - Flattens complex JSON with intuitive dot notation
Input: 11+ JSON formats auto-detected
- Standard JSON Arrays β’ NDJSON β’ Wrapper Objects β’ GeoJSON β’ Socrata/OpenData β’ Single Objects β’ Python Literals β’ Embedded JSON
Output: 4 analytics-ready formats
- Parquet (recommended) β’ CSV β’ Avro β’ ORC β’ Feather
- Schema Validation - Verify data quality before processing
- Performance Benchmarking - Measure and optimize your pipelines
- Batch Processing - Convert multiple files in one command
- Sampling Support - Handle massive files efficiently
- Memory-SAFE Processing - Automatic 80% memory limit protection (no OOM crashes)
- Dynamic Scaling - Auto-scales worker processes based on available RAM
- Smart Chunking - Automatically chunks large files to fit in memory
- Parallel Processing - Multi-core conversion for maximum throughput
SchemaForge is built for big data. It automatically manages system resources to ensure stability and speed.
- Memory Protection: Continuously monitors RAM usage, capping at 80% (default) to keep your system responsive.
- Adaptive Workers: Calculates the optimal number of parallel workers (1-8) based on your specific hardware config.
- Smart Chunking: Large files (>50MB) are automatically processed in memory-safe chunks.
π‘ Zero Configuration Needed: These optimizations happen automatically. Just run
convertand let SchemaForge handle the resources.
# Clone the repository
git clone https://github.com/Syntax-Error-1337/SchemaForge.git
cd SchemaForge
# Install dependencies
pip install -r requirements.txtRequirements: Python 3.8+, pandas, pyarrow, fastavro, ijson
# 1οΈβ£ Place your JSON files in the data directory
cp your_data/*.json data/
# 2οΈβ£ Discover schemas
python -m src.cli scan-schemas
# 3οΈβ£ Convert to your preferred format
python -m src.cli convert --format parquetThat's it! Your data is now in output/, ready for analysis. π
- β All JSON structures automatically analyzed
- β Types inferred with statistical confidence
- β Nested objects flattened intelligently
- β Schema reports generated (Markdown + JSON)
- β Data converted to optimized format
| Command | Purpose | Example |
|---|---|---|
scan-schemas |
Analyze JSON structure | python -m src.cli scan-schemas |
convert |
Transform to analytics formats | python -m src.cli convert --format parquet |
validate |
Verify schema compliance | python -m src.cli validate |
benchmark |
Measure performance | python -m src.cli benchmark |
python -m src.cli scan-schemas [OPTIONS]Options:
--data-dir- Input directory (default:data)--output-report- Report path (default:reports/schema_report.md)--max-sample-size- Sample size for large files--sampling-strategy-firstorrandomsampling
Examples:
# Basic usage
python -m src.cli scan-schemas
# Large files with random sampling
python -m src.cli scan-schemas --max-sample-size 10000 --sampling-strategy random
# Custom directory
python -m src.cli scan-schemas --data-dir my_json_data --output-report custom/schema.mdOutput:
schema_report.md- Human-readable documentationschema_report.json- Machine-readable schema
python -m src.cli convert --format [parquet|csv|avro|orc|feather] [OPTIONS]Options:
--format- Required: Output format--data-dir- Input directory (default:data)--output-dir- Output directory (default:output)--schema-report- Schema JSON path (default:reports/schema_report.json)
Examples:
# Convert to Parquet (recommended for big data)
python -m src.cli convert --format parquet
# Convert to CSV (universal compatibility)
python -m src.cli convert --format csv
# Convert to Avro (schema evolution)
python -m src.cli convert --format avro
# Convert to Feather (fast I/O for Pandas/Arrow)
python -m src.cli convert --format feather
# Custom directories
python -m src.cli convert --format parquet --data-dir raw_data --output-dir lake/
β οΈ Note: Runscan-schemasfirst to generate the schema report.
python -m src.cli validate [OPTIONS]Options:
--data-dir- Directory to validate (default:data)--schema-report- Schema for validation (default:reports/schema_report.json)
Example:
python -m src.cli validate --data-dir production_datapython -m src.cli benchmark [OPTIONS]Options:
--type- Benchmark type:schema,conversion, orall(default:all)--formats- Formats to test (default:parquet,csv,avro,orc,feather)--result-dir- Results directory (default:result)
Example:
python -m src.cli benchmark --type all --result-dir benchmarks/Challenge: Inconsistent JSON from multiple APIs
Solution: Unified schema discovery and conversion
Result: 80% faster pipeline development
Challenge: Diverse datasets from experiments and surveys
Solution: One-command conversion to analysis-ready formats
Result: More time analyzing, less time wrangling
Challenge: Complex formats from Socrata/CKAN portals
Solution: Automatic column extraction and transformation
Result: Easy access to government datasets
Challenge: Efficient storage for massive JSON collections
Solution: Convert to optimized columnar formats
Result: Better compression, faster queries, lower costs
SchemaForge automatically detects and handles:
- Standard JSON Array -
[{...}, {...}] - NDJSON - Newline-delimited records
- Wrapper Objects -
{"data": [...], "meta": {...}} - Socrata/OpenData - Array-based tabular format
- GeoJSON - Geographic feature collections
- Single Objects - Individual JSON records
- Embedded JSON - JSON strings within fields
Detected Types:
- Basic:
string,integer,float,boolean - Advanced:
timestamp,url,email,uuid,ip_address - Structured:
array<T>,object,json_string - Special:
numeric_string,enum
Features:
- β
Nested structure flattening (
user.address.city) - β Nullable field detection
- β Mixed type recognition
- β Statistical profiling (min/max, length, distributions)
- β Enum detection for categorical fields
# 1. Scan and analyze
python -m src.cli scan-schemas --data-dir raw_data
# 2. Validate quality
python -m src.cli validate --data-dir raw_data
# 3. Convert for analytics
python -m src.cli convert --format parquet --output-dir processed/
# 4. Benchmark performance
python -m src.cli benchmark --type all# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html
# Run specific test
pytest tests/test_schema_reader.pyTest Coverage:
- β All 11+ JSON formats
- β Type inference for all data types
- β Format conversion (Parquet/CSV/Avro/ORC/Feather)
- β Error handling and edge cases
-
Use Sampling for Large Files
python -m src.cli scan-schemas --max-sample-size 10000 --sampling-strategy random
-
Choose the Right Format
- Parquet β Big data analytics (best compression)
- Avro β Schema evolution & streaming
- ORC β Hadoop/Hive ecosystems
- Feather β Fast I/O for Pandas/Arrow workflows
- CSV β Universal compatibility
-
Monitor Performance
python -m src.cli benchmark --type all
See BENCHMARK_OPTIMIZATION.md for detailed optimization guide.
SchemaForge/
βββ data/ # Input JSON files
βββ output/ # Converted files
βββ reports/ # Schema reports (.md + .json)
βββ result/ # Benchmark results
βββ src/
β βββ schema_reader/ # Schema inference engine
β β βββ core.py # Main SchemaReader logic
β β βββ inference.py # Type detection & analysis
β β βββ reporting.py # Report generation
β β βββ types.py # Data models
β βββ converter/ # Format conversion
β β βββ core.py # Main Converter logic
β β βββ parquet.py # Parquet support
β β βββ feather.py # Feather support
β β βββ avro.py # Avro support
β β βββ orc.py # ORC support
β β βββ csv.py # CSV support
β βββ benchmark/ # Performance testing
β β βββ core.py # Benchmark suite
β β βββ schema.py # Schema benchmarks
β β βββ conversion.py # Conversion benchmarks
β βββ json_loader.py # JSON format detection
β βββ validator.py # Schema validation
β βββ cli.py # Command-line interface
βββ tests/ # Test suite
βββ tests/ # Test suite
We welcome contributions! Here's how:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing) - Make your changes with tests
- Run tests (
pytest tests/ -v) - Submit a Pull Request
Ideas for Contributions:
- Schema versioning tools
- Streaming processing for huge files
- GUI/Web interface
- Database export support
- Additional output formats
MIT License - see LICENSE for details.
Built for data engineers, researchers, and developers who are tired of manual schema definitions.
Powered by: pandas β’ pyarrow β’ fastavro β’ ijson β’ pytest
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README
Before opening an issue:
- Check existing issues
- Try
python -m src.cli [command] --help - Run
pytest tests/ -vto verify installation
SchemaForge - Transform Data Chaos into Analytics Gold π¨
β Star us on GitHub if SchemaForge saved you time! β