docs(cnn_v3): update CNN_V3.md + HOWTO.md to reflect Phases 1-5 complete

- CNN_V3.md: status line, architecture channel counts (8/16→4/8), FiLM MLP output count (96→40 params), size budget table (real implemented values) - HOWTO.md: Phase status table (5→done, add phase 6 training TODO), sections 3-5 rewritten to reflect what exists vs what is still planned
author: skal <pascal.massimino@gmail.com> 2026-03-21 09:54:16 +0100
committer: skal <pascal.massimino@gmail.com> 2026-03-21 09:54:16 +0100
commit: 5e740fc8f5f48fdd8ec4b84ae0c9a3c74e387d4f (patch)
tree: c330c8402e771d4b02316331d734802337d413c4 /cnn_v3/docs/CNN_V3.md
parent: 673a24215b2670007317060325256059d1448f3b (diff)
1 files changed, 23 insertions, 43 deletions
diff --git a/cnn_v3/docs/CNN_V3.md b/cnn_v3/docs/CNN_V3.md
index 9d64fe3..3f8f7db 100644
--- a/cnn_v3/docs/CNN_V3.md
+++ b/cnn_v3/docs/CNN_V3.md
@@ -19,9 +19,7 @@ CNN v3 is a next-generation post-processing effect using:
 - Training from both Blender renders and real photos
 - Strict test framework: per-pixel bit-exact validation across all implementations
 
-**Status:** Design phase. G-buffer implementation is prerequisite.
-
-**Prerequisites:** G-buffer (GEOM_BUFFER.md) must be implemented first.
+**Status:** Phases 1–5 complete. Parity validated (max_err=4.88e-4 ≤ 1/255). Next: `train_cnn_v3.py` for FiLM MLP training.
 
 ---
 
@@ -40,17 +38,17 @@ G-Buffer (albedo, normal, depth, matID, UV)
   U-Net
   ┌─────────────────────────────────────────┐
   │  Encoder                                │
-  │  enc0 (H×W, 8ch) ────────────skip──────┤
+  │  enc0 (H×W, 4ch) ────────────skip──────┤
   │  ↓ down (avg pool 2×2)                  │
-  │  enc1 (H/2×W/2, 16ch) ───────skip──────┤
+  │  enc1 (H/2×W/2, 8ch) ────────skip──────┤
   │  ↓ down                                 │
-  │  bottleneck (H/4×W/4, 16ch)            │
+  │  bottleneck (H/4×W/4, 8ch)             │
   │                                         │
   │  Decoder                                │
-  │  ↑ up (bilinear 2×) + skip enc1        │
-  │  dec1 (H/2×W/2, 16ch)                  │
+  │  ↑ up (nearest ×2) + skip enc1         │
+  │  dec1 (H/2×W/2, 4ch)                   │
   │  ↑ up + skip enc0                       │
-  │  dec0 (H×W, 8ch)                        │
+  │  dec0 (H×W, 4ch)                        │
   └─────────────────────────────────────────┘
         │
         ▼
@@ -80,14 +78,14 @@ A small MLP takes a conditioning vector `c` and outputs all γ/β:
 c = [beat_phase, beat_time/8, audio_intensity, style_p0, style_p1]   (5D)
     ↓ Linear(5 → 16) → ReLU
     ↓ Linear(16 → N_film_params)
-    → [γ_enc0(8ch), β_enc0(8ch), γ_enc1(16ch), β_enc1(16ch),
-       γ_dec1(16ch), β_dec1(16ch), γ_dec0(8ch), β_dec0(8ch)]
-       = 2 × (8+16+16+8) = 96 parameters output
+    → [γ_enc0(4ch), β_enc0(4ch), γ_enc1(8ch), β_enc1(8ch),
+       γ_dec1(4ch), β_dec1(4ch), γ_dec0(4ch), β_dec0(4ch)]
+       = 2 × (4+8+4+4) = 40 parameters output
 ```
 
 **Runtime cost:** trivial (one MLP forward pass per frame, CPU-side).
 **Training:** jointly trained with U-Net — backprop through FiLM to MLP.
-**Size:** MLP weights ~(5×16 + 16×96) × 2 bytes f16 ≈ 3 KB.
+**Size:** MLP weights ~(5×16 + 16×40) × 2 bytes f16 ≈ 1.4 KB.
 
 **Why FiLM instead of just uniform parameters?**
 - γ/β are per-channel, enabling fine-grained style control
@@ -346,38 +344,20 @@ All f16, little-endian, same packing as v2 (`pack2x16float`).
 
 **CNN v3 target: ≤ 6 KB weights**
 
-| Component | Params | f16 bytes |
-|-----------|--------|-----------|
-| enc0: Conv(20→8, 3×3) | 20×8×9=1440 | 2880 |
-| enc1: Conv(8→16, 3×3) | 8×16×9=1152 | 2304 |
-| bottleneck: Conv(16→16, 3×3) | 16×16×9=2304 | 4608 |
-| dec1: Conv(32→8, 3×3) | 32×8×9=2304 | 4608 |
-| dec0: Conv(16→8, 3×3) | 16×8×9=1152 | 2304 |
-| output: Conv(8→4, 1×1) | 8×4=32 | 64 |
-| FiLM MLP (~96 outputs) | ~1600 | 3200 |
-| **Total** | | **~20 KB** |
-
-This exceeds target. **Mitigation strategies:**
-
-1. **Reduce channels:** [4, 8] instead of [8, 16] → cuts conv params by ~4×
-2. **1 level only:** remove H/4 level → drops bottleneck + one dec level
-3. **1×1 conv at bottleneck** (no spatial, just channel mixing)
-4. **FiLM only at bottleneck** → smaller MLP output
+**Implemented architecture (fits ≤ 4 KB):**
 
-**Conservative plan (fits ≤ 6 KB):**
-```
-enc0: Conv(20→4, 3×3)       = 20×4×9  = 720 weights
-enc1: Conv(4→8, 3×3)        = 4×8×9   = 288 weights
-bottleneck: Conv(8→8, 1×1)  = 8×8×1   = 64 weights
-dec1: Conv(16→4, 3×3)       = 16×4×9  = 576 weights
-dec0: Conv(12→4, 3×3)       = 12×4×9  = 432 weights
-output: Conv(4→4, 1×1)      = 4×4     = 16 weights
-FiLM MLP (5→24 outputs)     = 5×16+16×24 = 464 weights
-Total: ~2560 weights × 2B = ~5.0 KB f16 ✓
-```
+| Component | Weights | Bias | Total f16 |
+|-----------|---------|------|-----------|
+| enc0: Conv(20→4, 3×3) | 20×4×9=720 | +4 | 724 |
+| enc1: Conv(4→8, 3×3) | 4×8×9=288 | +8 | 296 |
+| bottleneck: Conv(8→8, 1×1) | 8×8×1=64 | +8 | 72 |
+| dec1: Conv(16→4, 3×3) | 16×4×9=576 | +4 | 580 |
+| dec0: Conv(8→4, 3×3) | 8×4×9=288 | +4 | 292 |
+| FiLM MLP (5→16→40) | 5×16+16×40=720 | +16+40 | 776 |
+| **Total** | | | **~3.9 KB f16** |
 
-Note: enc0 input is 20ch (feature buffer), dec1 input is 16ch (8 bottleneck + 8 skip),
-dec0 input is 12ch (4 dec1 output + 8 enc0 skip). Skip connections concatenate.
+Skip connections: dec1 input = 8ch (bottleneck) + 8ch (enc1 skip) = 16ch.
+dec0 input = 4ch (dec1) + 4ch (enc0 skip) = 8ch.
 
 ---
author	skal <pascal.massimino@gmail.com>	2026-03-21 09:54:16 +0100
committer	skal <pascal.massimino@gmail.com>	2026-03-21 09:54:16 +0100
commit	5e740fc8f5f48fdd8ec4b84ae0c9a3c74e387d4f (patch)
tree	c330c8402e771d4b02316331d734802337d413c4 /cnn_v3/docs/CNN_V3.md
parent	673a24215b2670007317060325256059d1448f3b (diff)