A tiny autograd engine and neural network library built from scratch for learning purposes.
Inspired by Andrej Karpathy's micrograd. Features are added incrementally to understand how autograd and neural networks work under the hood. Includes detailed comments and explanations of learnings and challenges encountered along the way.
An educational project to deeply understand:
- Automatic differentiation (autograd) — how neural networks compute gradients
- Backpropagation and the chain rule — the mathematical foundation of training
- How neural network libraries like PyTorch work under the hood — low-level implementation details
Phases 1-3 complete. The core autograd engine, neural network primitives, and tensor support are fully implemented and working. The project includes:
- A scalar autograd engine (
Valueclass) with full backpropagation - Neural network building blocks (
Neuron,Layer,MLP) - Custom 2D tensor implementation with matrix multiplication
- Both scalar-based (
MLP) and matrix-based (TensorMLP) neural networks - Comprehensive test suite
Phase 1: Scalar autograd ✅
Valueclass that wraps numbers and tracks computation history- Basic operations:
+,-,*,**,neg - Activation functions:
tanh(),relu(),exp(),log() - Backward pass with topological sort
- Gradient accumulation for multi-path computation graphs
Phase 2: Neural network primitives ✅
- Activation functions (ReLU, tanh) via enum
- Softmax for probability distributions
Neuron,Layer,MLPclasses- Training loop with gradient descent
- Trained on toy datasets
Phase 3: Tensor support ✅
- Custom
Tensorclass built from scratch (no NumPy dependency) - Matrix multiplication (
matmul) and element-wise addition - Integrated with autograd via Value objects
TensorMLPimplementation with full training loop- Benchmark suite comparing ValueMLP, TensorMLP, and PyTorch
Phase 4: C++ backend 🚧 (in progress)
- Rewrite tensor operations in C++ (target: close the 1,000x+ gap with PyTorch)
- Python bindings with pybind11
- Learn low-level memory layout and optimization
- See
csrc/README.mdfor build instructions
Phase 5: Real ML applications
- Implement common models (CNNs, RNNs, transformers)
- Train on real datasets (MNIST, CIFAR, etc.)
- Model serialization (save/load weights)
- Pretrained weights and fine-tuning
Phase 6: Training visualization
- Real-time loss/accuracy curves
- Weight and gradient distribution evolution
- Decision boundary animations (2D classification)
- Network activation heatmaps
- Interactive web frontend for training monitoring
git clone https://github.com/yourusername/microtensor.git
cd microtensor
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e . # installs microtensor + builds C++ extensionmicrotensor/
├── microtensor/
│ ├── __init__.py # Package exports
│ ├── engine.py # Core autograd engine (Value class)
│ ├── tensor.py # Custom 2D tensor implementation
│ ├── activations.py # Activation enum + softmax function
│ └── neural_networks/
│ ├── value_nn.py # Scalar-based NN (Neuron, Layer, MLP)
│ └── tensor_nn.py # Matrix-based NN (TensorMLP)
├── csrc/
│ ├── matmul.cpp # C++ tensor operations (pybind11)
│ └── README.md # C++ build instructions
├── examples/
│ ├── train_value_mlp.py # ValueMLP training demo
│ └── train_tensor_mlp.py # TensorMLP training demo
├── tests/
│ ├── test_value_nn.py # ValueMLP tests
│ ├── test_tensor.py # Tensor operation tests
│ └── test_tensor_nn.py # TensorMLP tests
├── benchmarks/
│ └── benchmark.py # Performance benchmarking suite
├── setup.py # Package + C++ extension build config
└── requirements.txt # Python dependencies
from microtensor.engine import Value
a = Value(2.0)
b = Value(3.0)
c = a * b + a ** 2
c.backward()
print(a._grad) # dc/da = b + 2a = 7.0
print(b._grad) # dc/db = a = 2.0from microtensor.engine import Value
from microtensor.neural_networks.value_nn import Neuron, Layer, MLP
from microtensor.activations import Activation
# Single neuron with 3 inputs
neuron = Neuron(3, activation=Activation.TANH)
inputs = [Value(1.0), Value(2.0), Value(3.0)]
output = neuron(inputs)
# Layer with 3 inputs, 2 outputs
layer = Layer(3, 2, activation=Activation.RELU)
outputs = layer(inputs) # returns list of 2 Values
# MLP: 3 inputs → 4 hidden → 1 output
mlp = MLP([3, 4, 1])
prediction = mlp(inputs)from microtensor.tensor import Tensor
from microtensor.engine import Value
# Create 2D tensors
a = Tensor([[1, 2], [3, 4]])
b = Tensor([[5, 6], [7, 8]])
# Matrix multiplication
c = a.matmul(b) # [[19, 22], [43, 50]]
# Element-wise addition
d = a + b # [[6, 8], [10, 12]]
# Tensors with Value objects (for autograd)
a = Tensor([[Value(1), Value(2)], [Value(3), Value(4)]])from microtensor.neural_networks.tensor_nn import TensorMLP
from microtensor.engine import Value
# Create model: 4 inputs → 5 hidden → 1 output
model = TensorMLP([4, 5, 1])
# Training data
xs = [[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]]
ys = [-1.0, 1.0]
# Training loop
for epoch in range(1000):
# Forward pass
predictions = []
for x in xs:
x_values = [Value(val) for val in x]
output = model(x_values)
predictions.append(output.data[0][0])
# Compute loss
loss = sum((pred - Value(y)) ** 2 for pred, y in zip(predictions, ys))
# Backward pass and update
model.zero_grad()
loss.backward()
model.step(lr=0.01)# Train ValueMLP (scalar-based)
python examples/train_value_mlp.py --epochs 10000 --lr 0.01
# Train TensorMLP (matrix-based)
python examples/train_tensor_mlp.py --epochs 10000 --lr 0.01All tensor operations are implemented in pure Python from scratch (no NumPy). Run the benchmark suite to measure performance:
python benchmarks/benchmark.py # batch=1 (default)
python benchmarks/benchmark.py --batch 50 # batch=50 (realistic)
python benchmarks/benchmark.py --no-pytorch # skip PyTorch comparison
python benchmarks/benchmark.py --runs 20 # more runs for stabilityCurrent results (full training step, batch=50, vs PyTorch):
| Architecture | Params | ValueMLP | TensorMLP | PyTorch | Gap |
|---|---|---|---|---|---|
| 4→8→1 | 49 | 10.6ms | 14.9ms | 168μs | 89x |
| 4→16→8→1 | 225 | 148ms | 142ms | 109μs | 1,364x |
| 8→32→16→1 | 833 | 711ms | 817ms | 118μs | 6,950x |
| 8→64→32→8→1 | 2929 | 3.14s | 3.34s | 329μs | 10,170x |
Key observations:
- Pure Python is ~4,000x slower than PyTorch on average with batch=50
- The gap explodes with batch size — PyTorch vectorizes, microtensor loops in Python
TensorMLP≈ValueMLPspeed — tensor abstraction adds overhead without C optimization- Largest model: 3 seconds vs 329 microseconds
Phase 4 target: Close the 1,000x+ gap by implementing tensor ops in C.
pip install -e .
pytestTests cover:
- Value autograd: Forward pass, backward pass, gradient computation
- Tensor operations: Shape inference, matrix multiplication, edge cases
- ValueMLP: Output shapes, parameter counts, backpropagation
- TensorMLP: Forward pass, gradients, zero_grad, step updates
The best way to understand something is to build it. Reading about backprop is one thing — implementing it yourself makes it stick.