summaryrefslogtreecommitdiff
path: root/cnn_v3/docs
diff options
context:
space:
mode:
Diffstat (limited to 'cnn_v3/docs')
-rw-r--r--cnn_v3/docs/CNN_V3.md66
-rw-r--r--cnn_v3/docs/HOWTO.md50
2 files changed, 48 insertions, 68 deletions
diff --git a/cnn_v3/docs/CNN_V3.md b/cnn_v3/docs/CNN_V3.md
index 9d64fe3..3f8f7db 100644
--- a/cnn_v3/docs/CNN_V3.md
+++ b/cnn_v3/docs/CNN_V3.md
@@ -19,9 +19,7 @@ CNN v3 is a next-generation post-processing effect using:
- Training from both Blender renders and real photos
- Strict test framework: per-pixel bit-exact validation across all implementations
-**Status:** Design phase. G-buffer implementation is prerequisite.
-
-**Prerequisites:** G-buffer (GEOM_BUFFER.md) must be implemented first.
+**Status:** Phases 1–5 complete. Parity validated (max_err=4.88e-4 ≤ 1/255). Next: `train_cnn_v3.py` for FiLM MLP training.
---
@@ -40,17 +38,17 @@ G-Buffer (albedo, normal, depth, matID, UV)
U-Net
┌─────────────────────────────────────────┐
│ Encoder │
- │ enc0 (H×W, 8ch) ────────────skip──────┤
+ │ enc0 (H×W, 4ch) ────────────skip──────┤
│ ↓ down (avg pool 2×2) │
- │ enc1 (H/2×W/2, 16ch) ───────skip──────┤
+ │ enc1 (H/2×W/2, 8ch) ────────skip──────┤
│ ↓ down │
- │ bottleneck (H/4×W/4, 16ch) │
+ │ bottleneck (H/4×W/4, 8ch) │
│ │
│ Decoder │
- │ ↑ up (bilinear 2×) + skip enc1 │
- │ dec1 (H/2×W/2, 16ch) │
+ │ ↑ up (nearest ×2) + skip enc1 │
+ │ dec1 (H/2×W/2, 4ch) │
│ ↑ up + skip enc0 │
- │ dec0 (H×W, 8ch) │
+ │ dec0 (H×W, 4ch) │
└─────────────────────────────────────────┘
@@ -80,14 +78,14 @@ A small MLP takes a conditioning vector `c` and outputs all γ/β:
c = [beat_phase, beat_time/8, audio_intensity, style_p0, style_p1] (5D)
↓ Linear(5 → 16) → ReLU
↓ Linear(16 → N_film_params)
- → [γ_enc0(8ch), β_enc0(8ch), γ_enc1(16ch), β_enc1(16ch),
- γ_dec1(16ch), β_dec1(16ch), γ_dec0(8ch), β_dec0(8ch)]
- = 2 × (8+16+16+8) = 96 parameters output
+ → [γ_enc0(4ch), β_enc0(4ch), γ_enc1(8ch), β_enc1(8ch),
+ γ_dec1(4ch), β_dec1(4ch), γ_dec0(4ch), β_dec0(4ch)]
+ = 2 × (4+8+4+4) = 40 parameters output
```
**Runtime cost:** trivial (one MLP forward pass per frame, CPU-side).
**Training:** jointly trained with U-Net — backprop through FiLM to MLP.
-**Size:** MLP weights ~(5×16 + 16×96) × 2 bytes f16 ≈ 3 KB.
+**Size:** MLP weights ~(5×16 + 16×40) × 2 bytes f16 ≈ 1.4 KB.
**Why FiLM instead of just uniform parameters?**
- γ/β are per-channel, enabling fine-grained style control
@@ -346,38 +344,20 @@ All f16, little-endian, same packing as v2 (`pack2x16float`).
**CNN v3 target: ≤ 6 KB weights**
-| Component | Params | f16 bytes |
-|-----------|--------|-----------|
-| enc0: Conv(20→8, 3×3) | 20×8×9=1440 | 2880 |
-| enc1: Conv(8→16, 3×3) | 8×16×9=1152 | 2304 |
-| bottleneck: Conv(16→16, 3×3) | 16×16×9=2304 | 4608 |
-| dec1: Conv(32→8, 3×3) | 32×8×9=2304 | 4608 |
-| dec0: Conv(16→8, 3×3) | 16×8×9=1152 | 2304 |
-| output: Conv(8→4, 1×1) | 8×4=32 | 64 |
-| FiLM MLP (~96 outputs) | ~1600 | 3200 |
-| **Total** | | **~20 KB** |
-
-This exceeds target. **Mitigation strategies:**
-
-1. **Reduce channels:** [4, 8] instead of [8, 16] → cuts conv params by ~4×
-2. **1 level only:** remove H/4 level → drops bottleneck + one dec level
-3. **1×1 conv at bottleneck** (no spatial, just channel mixing)
-4. **FiLM only at bottleneck** → smaller MLP output
+**Implemented architecture (fits ≤ 4 KB):**
-**Conservative plan (fits ≤ 6 KB):**
-```
-enc0: Conv(20→4, 3×3) = 20×4×9 = 720 weights
-enc1: Conv(4→8, 3×3) = 4×8×9 = 288 weights
-bottleneck: Conv(8→8, 1×1) = 8×8×1 = 64 weights
-dec1: Conv(16→4, 3×3) = 16×4×9 = 576 weights
-dec0: Conv(12→4, 3×3) = 12×4×9 = 432 weights
-output: Conv(4→4, 1×1) = 4×4 = 16 weights
-FiLM MLP (5→24 outputs) = 5×16+16×24 = 464 weights
-Total: ~2560 weights × 2B = ~5.0 KB f16 ✓
-```
+| Component | Weights | Bias | Total f16 |
+|-----------|---------|------|-----------|
+| enc0: Conv(20→4, 3×3) | 20×4×9=720 | +4 | 724 |
+| enc1: Conv(4→8, 3×3) | 4×8×9=288 | +8 | 296 |
+| bottleneck: Conv(8→8, 1×1) | 8×8×1=64 | +8 | 72 |
+| dec1: Conv(16→4, 3×3) | 16×4×9=576 | +4 | 580 |
+| dec0: Conv(8→4, 3×3) | 8×4×9=288 | +4 | 292 |
+| FiLM MLP (5→16→40) | 5×16+16×40=720 | +16+40 | 776 |
+| **Total** | | | **~3.9 KB f16** |
-Note: enc0 input is 20ch (feature buffer), dec1 input is 16ch (8 bottleneck + 8 skip),
-dec0 input is 12ch (4 dec1 output + 8 enc0 skip). Skip connections concatenate.
+Skip connections: dec1 input = 8ch (bottleneck) + 8ch (enc1 skip) = 16ch.
+dec0 input = 4ch (dec1) + 4ch (enc0 skip) = 8ch.
---
diff --git a/cnn_v3/docs/HOWTO.md b/cnn_v3/docs/HOWTO.md
index 22266d3..425a33b 100644
--- a/cnn_v3/docs/HOWTO.md
+++ b/cnn_v3/docs/HOWTO.md
@@ -135,7 +135,7 @@ Mix freely; the dataloader treats all sample directories uniformly.
## 3. Training
-*(Network not yet implemented — this section will be filled as Phase 3+ lands.)*
+*(Script not yet written — see TODO.md. Architecture spec in `CNN_V3.md` §Training.)*
**Planned command:**
```bash
@@ -146,21 +146,15 @@ python3 cnn_v3/training/train_cnn_v3.py \
```
**FiLM conditioning** during training:
-- Beat/audio inputs are randomized per sample
-- Network learns to produce varied styles from same geometry
-
-**Validation:**
-```bash
-python3 cnn_v3/training/train_cnn_v3.py --validate \
- --checkpoint cnn_v3/weights/cnn_v3_weights.bin \
- --input test_frame.png
-```
+- Beat/audio inputs randomized per sample
+- MLP: `Linear(5→16) → ReLU → Linear(16→40)` trained jointly with U-Net
+- Output: γ/β for enc0(4ch) + enc1(8ch) + dec1(4ch) + dec0(4ch) = 40 floats
---
-## 4. Running the CNN v3 Effect (Future)
+## 4. Running the CNN v3 Effect
-Once the C++ CNNv3Effect exists:
+`CNNv3Effect` is implemented. Wire into a sequence:
```seq
# BPM 120
@@ -169,27 +163,32 @@ SEQUENCE 0 0 "Scene with CNN v3"
EFFECT + CNNv3Effect gbuf_feat0 gbuf_feat1 -> sink 0 60
```
-FiLM parameters are uploaded via uniform each frame:
+FiLM parameters uploaded each frame:
```cpp
cnn_v3_effect->set_film_params(
params.beat_phase, params.beat_time / 8.0f, params.audio_intensity,
style_p0, style_p1);
```
+FiLM γ/β default to identity (γ=1, β=0) until `train_cnn_v3.py` produces a trained MLP.
+
---
## 5. Per-Pixel Validation
-The CNN v3 design requires exact parity between PyTorch, WGSL (HTML), and C++.
+C++ parity test passes: `src/tests/gpu/test_cnn_v3_parity.cc` (2 tests).
+
+```bash
+cmake -B build -DDEMO_BUILD_TESTS=ON && cmake --build build -j4
+cd build && ./test_cnn_v3_parity
+```
-*(Validation tooling not yet implemented.)*
+Results (8×8 test tensors, random weights):
+- enc0 max_err = 1.95e-3 ✓
+- dec1 max_err = 1.95e-3 ✓
+- final max_err = 4.88e-4 ✓ (all ≤ 1/255 = 3.92e-3)
-**Planned workflow:**
-1. Export test input + weights as JSON
-2. Run Python reference → save per-pixel output
-3. Run HTML WebGPU tool → compare against Python
-4. Run C++ `cnn_v3_test` tool → compare against Python
-5. All comparisons must pass at ≤ 1/255 per pixel
+Test vectors generated by `cnn_v3/training/gen_test_vectors.py` (PyTorch reference).
---
@@ -197,12 +196,13 @@ The CNN v3 design requires exact parity between PyTorch, WGSL (HTML), and C++.
| Phase | Status | Notes |
|-------|--------|-------|
-| 1 — G-buffer (raster + pack) | ✅ Done | Integrated, 35/35 tests pass |
-| 1 — G-buffer (SDF + shadow passes) | TODO | Placeholder in place |
+| 1 — G-buffer (raster + pack) | ✅ Done | Integrated, 36/36 tests pass |
+| 1 — G-buffer (SDF + shadow passes) | TODO | Placeholder: shadow=1, transp=0 |
| 2 — Training infrastructure | ✅ Done | blender_export.py, pack_*_sample.py |
| 3 — WGSL U-Net shaders | ✅ Done | 5 compute shaders + cnn_v3/common snippet |
-| 4 — C++ CNNv3Effect | ✅ Done | FiLM uniform upload, 35/35 tests pass |
-| 5 — Parity validation | TODO | Test vectors, ≤1/255 |
+| 4 — C++ CNNv3Effect | ✅ Done | FiLM uniform upload, 36/36 tests pass |
+| 5 — Parity validation | ✅ Done | test_cnn_v3_parity.cc, max_err=4.88e-4 |
+| 6 — FiLM MLP training | TODO | train_cnn_v3.py not yet written |
---