docs(cnn_v3): update CNN_V3.md + HOWTO.md to reflect Phases 1-5 complete

- CNN_V3.md: status line, architecture channel counts (8/16→4/8), FiLM MLP output count (96→40 params), size budget table (real implemented values) - HOWTO.md: Phase status table (5→done, add phase 6 training TODO), sections 3-5 rewritten to reflect what exists vs what is still planned
author: skal <pascal.massimino@gmail.com> 2026-03-21 09:54:16 +0100
committer: skal <pascal.massimino@gmail.com> 2026-03-21 09:54:16 +0100
commit: 5e740fc8f5f48fdd8ec4b84ae0c9a3c74e387d4f (patch)
tree: c330c8402e771d4b02316331d734802337d413c4
parent: 673a24215b2670007317060325256059d1448f3b (diff)
2 files changed, 48 insertions, 68 deletions
diff --git a/cnn_v3/docs/CNN_V3.md b/cnn_v3/docs/CNN_V3.md
index 9d64fe3..3f8f7db 100644
--- a/cnn_v3/docs/CNN_V3.md
+++ b/cnn_v3/docs/CNN_V3.md
@@ -19,9 +19,7 @@ CNN v3 is a next-generation post-processing effect using:
 - Training from both Blender renders and real photos
 - Strict test framework: per-pixel bit-exact validation across all implementations
 
-**Status:** Design phase. G-buffer implementation is prerequisite.
-
-**Prerequisites:** G-buffer (GEOM_BUFFER.md) must be implemented first.
+**Status:** Phases 1–5 complete. Parity validated (max_err=4.88e-4 ≤ 1/255). Next: `train_cnn_v3.py` for FiLM MLP training.
 
 ---
 
@@ -40,17 +38,17 @@ G-Buffer (albedo, normal, depth, matID, UV)
   U-Net
   ┌─────────────────────────────────────────┐
   │  Encoder                                │
-  │  enc0 (H×W, 8ch) ────────────skip──────┤
+  │  enc0 (H×W, 4ch) ────────────skip──────┤
   │  ↓ down (avg pool 2×2)                  │
-  │  enc1 (H/2×W/2, 16ch) ───────skip──────┤
+  │  enc1 (H/2×W/2, 8ch) ────────skip──────┤
   │  ↓ down                                 │
-  │  bottleneck (H/4×W/4, 16ch)            │
+  │  bottleneck (H/4×W/4, 8ch)             │
   │                                         │
   │  Decoder                                │
-  │  ↑ up (bilinear 2×) + skip enc1        │
-  │  dec1 (H/2×W/2, 16ch)                  │
+  │  ↑ up (nearest ×2) + skip enc1         │
+  │  dec1 (H/2×W/2, 4ch)                   │
   │  ↑ up + skip enc0                       │
-  │  dec0 (H×W, 8ch)                        │
+  │  dec0 (H×W, 4ch)                        │
   └─────────────────────────────────────────┘
         │
         ▼
@@ -80,14 +78,14 @@ A small MLP takes a conditioning vector `c` and outputs all γ/β:
 c = [beat_phase, beat_time/8, audio_intensity, style_p0, style_p1]   (5D)
     ↓ Linear(5 → 16) → ReLU
     ↓ Linear(16 → N_film_params)
-    → [γ_enc0(8ch), β_enc0(8ch), γ_enc1(16ch), β_enc1(16ch),
-       γ_dec1(16ch), β_dec1(16ch), γ_dec0(8ch), β_dec0(8ch)]
-       = 2 × (8+16+16+8) = 96 parameters output
+    → [γ_enc0(4ch), β_enc0(4ch), γ_enc1(8ch), β_enc1(8ch),
+       γ_dec1(4ch), β_dec1(4ch), γ_dec0(4ch), β_dec0(4ch)]
+       = 2 × (4+8+4+4) = 40 parameters output
 ```
 
 **Runtime cost:** trivial (one MLP forward pass per frame, CPU-side).
 **Training:** jointly trained with U-Net — backprop through FiLM to MLP.
-**Size:** MLP weights ~(5×16 + 16×96) × 2 bytes f16 ≈ 3 KB.
+**Size:** MLP weights ~(5×16 + 16×40) × 2 bytes f16 ≈ 1.4 KB.
 
 **Why FiLM instead of just uniform parameters?**
 - γ/β are per-channel, enabling fine-grained style control
@@ -346,38 +344,20 @@ All f16, little-endian, same packing as v2 (`pack2x16float`).
 
 **CNN v3 target: ≤ 6 KB weights**
 
-| Component | Params | f16 bytes |
-|-----------|--------|-----------|
-| enc0: Conv(20→8, 3×3) | 20×8×9=1440 | 2880 |
-| enc1: Conv(8→16, 3×3) | 8×16×9=1152 | 2304 |
-| bottleneck: Conv(16→16, 3×3) | 16×16×9=2304 | 4608 |
-| dec1: Conv(32→8, 3×3) | 32×8×9=2304 | 4608 |
-| dec0: Conv(16→8, 3×3) | 16×8×9=1152 | 2304 |
-| output: Conv(8→4, 1×1) | 8×4=32 | 64 |
-| FiLM MLP (~96 outputs) | ~1600 | 3200 |
-| **Total** | | **~20 KB** |
-
-This exceeds target. **Mitigation strategies:**
-
-1. **Reduce channels:** [4, 8] instead of [8, 16] → cuts conv params by ~4×
-2. **1 level only:** remove H/4 level → drops bottleneck + one dec level
-3. **1×1 conv at bottleneck** (no spatial, just channel mixing)
-4. **FiLM only at bottleneck** → smaller MLP output
+**Implemented architecture (fits ≤ 4 KB):**
 
-**Conservative plan (fits ≤ 6 KB):**
-```
-enc0: Conv(20→4, 3×3)       = 20×4×9  = 720 weights
-enc1: Conv(4→8, 3×3)        = 4×8×9   = 288 weights
-bottleneck: Conv(8→8, 1×1)  = 8×8×1   = 64 weights
-dec1: Conv(16→4, 3×3)       = 16×4×9  = 576 weights
-dec0: Conv(12→4, 3×3)       = 12×4×9  = 432 weights
-output: Conv(4→4, 1×1)      = 4×4     = 16 weights
-FiLM MLP (5→24 outputs)     = 5×16+16×24 = 464 weights
-Total: ~2560 weights × 2B = ~5.0 KB f16 ✓
-```
+| Component | Weights | Bias | Total f16 |
+|-----------|---------|------|-----------|
+| enc0: Conv(20→4, 3×3) | 20×4×9=720 | +4 | 724 |
+| enc1: Conv(4→8, 3×3) | 4×8×9=288 | +8 | 296 |
+| bottleneck: Conv(8→8, 1×1) | 8×8×1=64 | +8 | 72 |
+| dec1: Conv(16→4, 3×3) | 16×4×9=576 | +4 | 580 |
+| dec0: Conv(8→4, 3×3) | 8×4×9=288 | +4 | 292 |
+| FiLM MLP (5→16→40) | 5×16+16×40=720 | +16+40 | 776 |
+| **Total** | | | **~3.9 KB f16** |
 
-Note: enc0 input is 20ch (feature buffer), dec1 input is 16ch (8 bottleneck + 8 skip),
-dec0 input is 12ch (4 dec1 output + 8 enc0 skip). Skip connections concatenate.
+Skip connections: dec1 input = 8ch (bottleneck) + 8ch (enc1 skip) = 16ch.
+dec0 input = 4ch (dec1) + 4ch (enc0 skip) = 8ch.
 
 ---
 
diff --git a/cnn_v3/docs/HOWTO.md b/cnn_v3/docs/HOWTO.md
index 22266d3..425a33b 100644
--- a/cnn_v3/docs/HOWTO.md
+++ b/cnn_v3/docs/HOWTO.md
@@ -135,7 +135,7 @@ Mix freely; the dataloader treats all sample directories uniformly.
 
 ## 3. Training
 
-*(Network not yet implemented — this section will be filled as Phase 3+ lands.)*
+*(Script not yet written — see TODO.md. Architecture spec in `CNN_V3.md` §Training.)*
 
 **Planned command:**
 ```bash
@@ -146,21 +146,15 @@ python3 cnn_v3/training/train_cnn_v3.py \
 ```
 
 **FiLM conditioning** during training:
-- Beat/audio inputs are randomized per sample
-- Network learns to produce varied styles from same geometry
-
-**Validation:**
-```bash
-python3 cnn_v3/training/train_cnn_v3.py --validate \
-    --checkpoint cnn_v3/weights/cnn_v3_weights.bin \
-    --input test_frame.png
-```
+- Beat/audio inputs randomized per sample
+- MLP: `Linear(5→16) → ReLU → Linear(16→40)` trained jointly with U-Net
+- Output: γ/β for enc0(4ch) + enc1(8ch) + dec1(4ch) + dec0(4ch) = 40 floats
 
 ---
 
-## 4. Running the CNN v3 Effect (Future)
+## 4. Running the CNN v3 Effect
 
-Once the C++ CNNv3Effect exists:
+`CNNv3Effect` is implemented. Wire into a sequence:
 
 ```seq
 # BPM 120
@@ -169,27 +163,32 @@ SEQUENCE 0 0 "Scene with CNN v3"
   EFFECT + CNNv3Effect   gbuf_feat0 gbuf_feat1 -> sink       0 60
 ```
 
-FiLM parameters are uploaded via uniform each frame:
+FiLM parameters uploaded each frame:
 ```cpp
 cnn_v3_effect->set_film_params(
     params.beat_phase, params.beat_time / 8.0f, params.audio_intensity,
     style_p0, style_p1);
 ```
 
+FiLM γ/β default to identity (γ=1, β=0) until `train_cnn_v3.py` produces a trained MLP.
+
 ---
 
 ## 5. Per-Pixel Validation
 
-The CNN v3 design requires exact parity between PyTorch, WGSL (HTML), and C++.
+C++ parity test passes: `src/tests/gpu/test_cnn_v3_parity.cc` (2 tests).
+
+```bash
+cmake -B build -DDEMO_BUILD_TESTS=ON && cmake --build build -j4
+cd build && ./test_cnn_v3_parity
+```
 
-*(Validation tooling not yet implemented.)*
+Results (8×8 test tensors, random weights):
+- enc0 max_err = 1.95e-3 ✓
+- dec1 max_err = 1.95e-3 ✓
+- final max_err = 4.88e-4 ✓  (all ≤ 1/255 = 3.92e-3)
 
-**Planned workflow:**
-1. Export test input + weights as JSON
-2. Run Python reference → save per-pixel output
-3. Run HTML WebGPU tool → compare against Python
-4. Run C++ `cnn_v3_test` tool → compare against Python
-5. All comparisons must pass at ≤ 1/255 per pixel
+Test vectors generated by `cnn_v3/training/gen_test_vectors.py` (PyTorch reference).
 
 ---
 
@@ -197,12 +196,13 @@ The CNN v3 design requires exact parity between PyTorch, WGSL (HTML), and C++.
 
 | Phase | Status | Notes |
 |-------|--------|-------|
-| 1 — G-buffer (raster + pack) | ✅ Done | Integrated, 35/35 tests pass |
-| 1 — G-buffer (SDF + shadow passes) | TODO | Placeholder in place |
+| 1 — G-buffer (raster + pack) | ✅ Done | Integrated, 36/36 tests pass |
+| 1 — G-buffer (SDF + shadow passes) | TODO | Placeholder: shadow=1, transp=0 |
 | 2 — Training infrastructure | ✅ Done | blender_export.py, pack_*_sample.py |
 | 3 — WGSL U-Net shaders | ✅ Done | 5 compute shaders + cnn_v3/common snippet |
-| 4 — C++ CNNv3Effect | ✅ Done | FiLM uniform upload, 35/35 tests pass |
-| 5 — Parity validation | TODO | Test vectors, ≤1/255 |
+| 4 — C++ CNNv3Effect | ✅ Done | FiLM uniform upload, 36/36 tests pass |
+| 5 — Parity validation | ✅ Done | test_cnn_v3_parity.cc, max_err=4.88e-4 |
+| 6 — FiLM MLP training | TODO | train_cnn_v3.py not yet written |
 
 ---
author	skal <pascal.massimino@gmail.com>	2026-03-21 09:54:16 +0100
committer	skal <pascal.massimino@gmail.com>	2026-03-21 09:54:16 +0100
commit	5e740fc8f5f48fdd8ec4b84ae0c9a3c74e387d4f (patch)
tree	c330c8402e771d4b02316331d734802337d413c4
parent	673a24215b2670007317060325256059d1448f3b (diff)