DataForge

DataForge is a high-performance CLI tool designed to automate the preparation and management of machine learning datasets. It helps you transform raw and dirty data (like videos and unsorted images) into clean, balanced, and ready-to-train datasets with minimal effort.

📖 [Read the full documentation here](https://seregacodit.github.io/DataForge/)

Key Features

Parallel Processing: uses multiprocessing to handle thousands of files quickly.
Vectorized Calculations: employs NumPy for ultra-fast image comparison.
Smart Caching: incremental caching (MD5-based) allows working with large datasets on NAS without re-calculating everything.
Config: Built with Pydantic v2 for safe and flexible settings via JSON or CLI.

Available Commands

move — move files from source to target directory based on patterns.
slice — convert video files into sequences of images. Use --remove to delete the source video after a successful slice.
delete — safely remove files matching specific patterns.
dedup — find and remove visual duplicates using dHash.
- Threshold: information similarity limit (0-100%).
- Core Size: higher values (e.g., 32, 64) detect small changes (like a moving car), lower values (e.g., 8) ignore noise.
clean-annotations — automatically find and delete "orphan" annotation files (XML/TXT) that no longer have a corresponding image.
convert-annotations — convert dataset labels between formats (e.g., Pascal VOC to YOLO).

Automation & Intervals

By default, commands run once. If you want to monitor a folder and process files as they appear, use the repeat flag:

Use -r to run the command in a cycle.
Set the delay between cycles with -s (seconds).

Quick Start

Clone the repository:

git clone https://github.com/SeregaCodit/DataForge.git
cd DataForge

Setup environment:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Check usage:

python data_forge.py --help             # See all commands
python data_forge.py {command} --help   # See arguments for a specific command

Workflow

For multiple tasks, you can modify start_all_tasks.sh and run them all in the background:

bash start_all_tasks.sh

To stop all running DataForge processes:

pkill -f data_forge.py

Configuration

You can manage all default settings in config.json. DataForge follows this priority: CLI Arguments > config.json > Internal Defaults.

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.github/workflows		.github/workflows
.idea		.idea
const_utils		const_utils
docs		docs
file_operations		file_operations
logger		logger
tests		tests
tools		tools
.gitignore		.gitignore
.python-version		.python-version
README.MD		README.MD
config.json		config.json
data_forge.py		data_forge.py
mkdocs.yml		mkdocs.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
start_all_tasks.sh		start_all_tasks.sh
tst_commands.py		tst_commands.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataForge

Key Features

Available Commands

Automation & Intervals

Quick Start

Workflow

Configuration

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

SeregaCodit/DataForge

Folders and files

Latest commit

History

Repository files navigation

DataForge

Key Features

Available Commands

Automation & Intervals

Quick Start

Workflow

Configuration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages