2 files changed, 1113 insertions, 1 deletions
diff --git a/cnn_v3/README.md b/cnn_v3/README.md
index fdbf648..a22d823 100644
--- a/cnn_v3/README.md
+++ b/cnn_v3/README.md
@@ -31,6 +31,7 @@ Add images directly to these directories and commit them.
 
 ## Status
 
-**TODO:** Define CNN v3 architecture and feature set.
+**Design phase.** Architecture defined, G-buffer prerequisite pending.
 
+See `cnn_v3/docs/CNN_V3.md` for full design.
 See `cnn_v2/` for reference implementation.
diff --git a/cnn_v3/docs/CNN_V3.md b/cnn_v3/docs/CNN_V3.md
new file mode 100644
index 0000000..9d64fe3
--- /dev/null
+++ b/cnn_v3/docs/CNN_V3.md
@@ -0,0 +1,1111 @@
+# CNN v3: U-Net + FiLM
+
+**Technical Design Document**
+
+---
+
+## Overview
+
+CNN v3 is a next-generation post-processing effect using:
+- **U-Net architecture** — encoder/decoder with skip connections for multi-scale stylization
+- **FiLM conditioning** — Feature-wise Linear Modulation, enabling runtime style control via beat, audio, or manual parameters
+- **G-Buffer input** — richer geometric inputs (normals, depth, material) instead of plain RGBD
+- **Per-pixel testability** — exact match between PyTorch, HTML WebGPU, and C++ WebGPU
+
+**Key improvements over v2:**
+- Multi-scale processing (encoder captures global, decoder restores detail)
+- Runtime stylization without retraining (FiLM γ/β from beat/audio/time)
+- Richer scene understanding from G-buffer (normals, material IDs)
+- Training from both Blender renders and real photos
+- Strict test framework: per-pixel bit-exact validation across all implementations
+
+**Status:** Design phase. G-buffer implementation is prerequisite.
+
+**Prerequisites:** G-buffer (GEOM_BUFFER.md) must be implemented first.
+
+---
+
+## Architecture
+
+### Pipeline Overview
+
+```
+G-Buffer (albedo, normal, depth, matID, UV)
+        │
+        ▼
+  FiLM Conditioning
+  (beat_time, audio_intensity, style_params)
+        │ → γ[], β[] per channel
+        ▼
+  U-Net
+  ┌─────────────────────────────────────────┐
+  │  Encoder                                │
+  │  enc0 (H×W, 8ch) ────────────skip──────┤
+  │  ↓ down (avg pool 2×2)                  │
+  │  enc1 (H/2×W/2, 16ch) ───────skip──────┤
+  │  ↓ down                                 │
+  │  bottleneck (H/4×W/4, 16ch)            │
+  │                                         │
+  │  Decoder                                │
+  │  ↑ up (bilinear 2×) + skip enc1        │
+  │  dec1 (H/2×W/2, 16ch)                  │
+  │  ↑ up + skip enc0                       │
+  │  dec0 (H×W, 8ch)                        │
+  └─────────────────────────────────────────┘
+        │
+        ▼
+  output RGBA (H×W)
+```
+
+FiLM is applied **inside each encoder/decoder block**, after each convolution.
+
+### U-Net Block (per level)
+
+```
+input → Conv 3×3 → BN (or none) → FiLM(γ,β) → ReLU → output
+```
+
+FiLM at level `l`:
+```
+FiLM(x, γ_l, β_l) = γ_l ⊙ x + β_l      (per-channel affine)
+```
+
+γ and β are computed from the conditioning MLP, one γ/β pair per channel per level.
+
+### FiLM Conditioning
+
+A small MLP takes a conditioning vector `c` and outputs all γ/β:
+
+```
+c = [beat_phase, beat_time/8, audio_intensity, style_p0, style_p1]   (5D)
+    ↓ Linear(5 → 16) → ReLU
+    ↓ Linear(16 → N_film_params)
+    → [γ_enc0(8ch), β_enc0(8ch), γ_enc1(16ch), β_enc1(16ch),
+       γ_dec1(16ch), β_dec1(16ch), γ_dec0(8ch), β_dec0(8ch)]
+       = 2 × (8+16+16+8) = 96 parameters output
+```
+
+**Runtime cost:** trivial (one MLP forward pass per frame, CPU-side).
+**Training:** jointly trained with U-Net — backprop through FiLM to MLP.
+**Size:** MLP weights ~(5×16 + 16×96) × 2 bytes f16 ≈ 3 KB.
+
+**Why FiLM instead of just uniform parameters?**
+- γ/β are per-channel, enabling fine-grained style control
+- Network learns to use beat/audio meaningfully during training
+- Same weights, different moods: dark/moody vs bright/energetic
+
+---
+
+## G-Buffer Passes
+
+The G-buffer is populated by two passes writing to the same textures, merged by depth.
+Textures need dual usage: `RENDER_ATTACHMENT | STORAGE_BINDING` → use `rgba16float`.
+
+```
+Pass 1: Rasterize triangles → MRT (fragment shader)
+  color[0]:  albedo     rgba16float   material color (pre-lighting)
+  color[1]:  normal_mat rg16float     oct-normal XY + mat_id (u16 packed)
+  depth:     depth32float             hardware z-test + z-write
+
+Pass 2: SDF raymarching → compute shader
+  reads:   depth32float texture (compare SDF hit depth vs rasterized)
+  writes:  albedo, normal_mat storage textures where SDF depth < rasterized
+  writes:  transparency r16float  (volumetric density, not from rasterizer)
+  writes:  shadow       r8unorm   (SDF soft shadow, or shared with light pass)
+
+Pass 3: Lighting / shadow pass → compute shader
+  reads:   depth, normal_mat
+  writes:  shadow r8unorm  (shadow map lookup or SDF shadow ray)
+
+Pass 4: Pack → 32-byte CNN feature buffer (see below)
+  reads:   all G-buffer textures + prev CNN output texture
+  computes: depth_grad (finite diff), samples albedo MIP 1 and MIP 2
+  writes:  feat_tex0 (rgba32uint) + feat_tex1 (rgba32uint)
+```
+
+**Depth unification:** the SDF pass reads the rasterized depth32float, converts its hit
+distance to the same NDC depth value, and only overwrites when closer. Both sources end up
+in the same depth texture which the pack pass reads for `depth` and `depth_grad`.
+
+---
+
+## Input Feature Buffer
+
+**20 channels, 32 bytes/pixel**, packed into two `rgba32uint` textures (8 u32 total).
+Mixed precision: geometric data as f16, color context and categorical data as u8.
+
+**UV is NOT stored** — computed from `coord / resolution` in every shader (free).
+
+---
+
+### Texture 0 — 4 u32, 8 × f16 (geometric, high precision)
+
+| u32 | f16 lo | f16 hi | Notes |
+|-----|--------|--------|-------|
+| [0] | albedo.r | albedo.g | pre-lighting material color |
+| [1] | albedo.b | normal.x | oct-encoded normal X |
+| [2] | normal.y | depth | 1/z normalized |
+| [3] | depth_grad.x | depth_grad.y | finite diff of depth, signed |
+
+Normal z reconstructed as `sqrt(max(0, 1 - nx²- ny²))`.
+Depth gradient captures surface discontinuities and orientation cues for the CNN.
+
+---
+
+### Texture 1 — 4 u32, 12 × u8 + 1 spare u32 (context, low precision)
+
+| u32 | byte 0 | byte 1 | byte 2 | byte 3 |
+|-----|--------|--------|--------|--------|
+| [0] | mat_id | prev.r | prev.g | prev.b |
+| [1] | mip1.r | mip1.g | mip1.b | mip2.r |
+| [2] | mip2.g | mip2.b | shadow | transp. |
+| [3] | — spare — | | | |
+
+All packed via `pack4x8unorm`. Channels:
+- **mat_id**: object/material index (u8/255), carries style category
+- **prev.rgb**: previous CNN output (temporal feedback, recurrent)
+- **mip1.rgb**: albedo at MIP 1 (½ resolution) — medium-frequency color context
+- **mip2.rgb**: albedo at MIP 2 (¼ resolution) — low-frequency color context
+- **shadow**: shadow intensity [0=fully shadowed, 1=fully lit] from shadow pass
+- **transp.**: volumetric transparency [0=opaque, 1=transparent] for fog/smoke/volumetric light
+
+**Texture 1 is fully packed. u32[3] is reserved for future use.**
+
+---
+
+### Pack compute shader
+
+```wgsl
+@compute @workgroup_size(8, 8)
+fn pack_features(@builtin(global_invocation_id) id: vec3u) {
+  let coord = vec2i(id.xy);
+  let uv    = (vec2f(coord) + 0.5) / resolution;
+
+  let albedo = textureLoad(gbuf_albedo, coord, 0).rgb;
+  let nm     = textureLoad(gbuf_normal_mat, coord, 0);
+  let depth  = sample_depth(coord);          // from depth32float
+  let dzdx   = (sample_depth(coord + vec2i(1,0)) - sample_depth(coord - vec2i(1,0))) * 0.5;
+  let dzdy   = (sample_depth(coord + vec2i(0,1)) - sample_depth(coord - vec2i(0,1))) * 0.5;
+  let shadow = textureLoad(gbuf_shadow, coord, 0).r;
+  let transp = textureLoad(gbuf_transp, coord, 0).r;
+  let mat_id = unpack_mat_id(nm);            // u8 from rg16float packing
+  let normal = unpack_oct_normal(nm.rg);     // vec2f
+
+  let mip1 = textureSampleLevel(gbuf_albedo, smplr, uv, 1.0).rgb;
+  let mip2 = textureSampleLevel(gbuf_albedo, smplr, uv, 2.0).rgb;
+  let prev = textureSample(prev_cnn_tex, smplr, uv).rgb;
+
+  textureStore(feat_tex0, coord, vec4u(
+    pack2x16float(albedo.rg),
+    pack2x16float(vec2(albedo.b, normal.x)),
+    pack2x16float(vec2(normal.y, depth)),
+    pack2x16float(vec2(dzdx, dzdy)),
+  ));
+  textureStore(feat_tex1, coord, vec4u(
+    pack4x8unorm(vec4(mat_id, prev.r,  prev.g,  prev.b)),
+    pack4x8unorm(vec4(mip1.r, mip1.g,  mip1.b,  mip2.r)),
+    pack4x8unorm(vec4(mip2.g, mip2.b,  shadow,  transp)),
+    0u,
+  ));
+}
+```
+
+---
+
+### Full channel table (20 channels, 32 bytes/pixel)
+
+| # | Name | Prec | Source |
+|---|------|------|--------|
+| 0 | albedo.r | f16 | Raster/SDF material color |
+| 1 | albedo.g | f16 | |
+| 2 | albedo.b | f16 | |
+| 3 | normal.x | f16 | Oct-encoded, raster/SDF |
+| 4 | normal.y | f16 | |
+| 5 | depth | f16 | Unified depth (1/z) |
+| 6 | depth_grad.x | f16 | Finite diff of depth |
+| 7 | depth_grad.y | f16 | |
+| 8 | mat_id | u8 | Object index / 255 |
+| 9 | prev.r | u8 | Previous CNN output (temporal) |
+| 10 | prev.g | u8 | |
+| 11 | prev.b | u8 | |
+| 12 | mip1.r | u8 | Albedo MIP 1 (½ res) |
+| 13 | mip1.g | u8 | |
+| 14 | mip1.b | u8 | |
+| 15 | mip2.r | u8 | Albedo MIP 2 (¼ res) |
+| 16 | mip2.g | u8 | |
+| 17 | mip2.b | u8 | |
+| 18 | shadow | u8 | Shadow intensity [0=dark, 1=lit] |
+| 19 | transp. | u8 | Volumetric transparency [0=opaque, 1=clear] |
+
+UV computed in-shader. Bias = 1.0 implicit (standard NN, not stored).
+
+**Memory:** 1920×1080 × 32 bytes = **66 MB** feature buffer.
+Plus prev_cnn texture (RGBA8): **8 MB**.
+
+---
+
+### 16-byte fallback (budget-constrained)
+
+Drop temporal, MIPs, shadow, transparency. Geometric data only:
+
+| u32 | channels |
+|-----|----------|
+| [0] | albedo.rg (f16) |
+| [1] | albedo.b, normal.x (f16) |
+| [2] | normal.y, depth (f16) |
+| [3] | depth_grad.x, depth_grad.y (f16) |
+
+8 channels, 16 bytes/pixel = **33 MB**. No temporal coherence, no lighting context.
+
+---
+
+### Temporal parity testing
+
+Temporal breaks single-frame testability. Protocol:
+- **Static parity test**: set `prev_cnn = black` → fully deterministic, run on single frame
+- **Temporal parity test**: 2-frame sequence; frame 1's prev = frame 0's CNN output
+- Test vector NPZ includes prev as explicit input: `test_<n>_{gbuf, prev, cond, expected}.npz`
+
+---
+
+## Testability Framework
+
+**Goal:** Per-pixel bit-exact match (within f16 rounding tolerance) across:
+1. PyTorch reference (f32)
+2. HTML WebGPU validation tool
+3. C++ WebGPU runtime
+
+### Protocol
+
+**Step 1: Reference generation (PyTorch f32)**
+- Export test vectors: 4 canonical G-buffer images + conditioning vectors → expected output
+- Store as PNG + NPZ: `cnn_v3/tests/vectors/test_<n>_{input,cond,expected}.{png,npz}`
+
+**Step 2: f16 export**
+- Convert all weights to f16 (same as v2)
+- Stored in binary format (see below)
+
+**Step 3: Deterministic operations (no ambiguity between impls)**
+- Padding: `same` padding = zero-pad by `kernel_size//2`, explicit in all impls
+- Downsampling: **average pooling 2×2**, stride 2 (not max pool — identical in all)
+- Upsampling: **nearest neighbor** ×2 (no interpolation differences)
+- Activation: ReLU = `max(0, x)` (exact), sigmoid = `1/(1+exp(-x))` (numerically identical)
+- FiLM: `gamma * x + beta` applied per-channel (not per-pixel — channel broadcast)
+- No batch norm at inference (fold BN into conv weights during export)
+
+**Step 4: Validation**
+```bash
+python3 cnn_v3/training/validate_parity.py \
+  --weights cnn_v3/weights/model.bin \
+  --test-vectors cnn_v3/tests/vectors/ \
+  --tolerance 1  # max 1/255 per channel
+```
+
+Tolerance: f16 rounding introduces at most ~0.001 error. Display is 8-bit (1/255 ≈ 0.004).
+**Acceptance criterion:** max per-pixel per-channel absolute error ≤ 1/255.
+
+### Parity Checklist
+
+For each layer, verify:
+- [ ] Input shape matches
+- [ ] Weight layout matches (OIHW = out_ch × in_ch × kH × kW)
+- [ ] Padding: explicit zero-pad, not "reflect" or "replicate"
+- [ ] Convolution output shape matches
+- [ ] FiLM γ/β applied in correct order (after conv, before activation)
+- [ ] Skip connection: concatenation along channel axis (not add)
+- [ ] Upsample: nearest neighbor (not bilinear)
+
+---
+
+## Binary Format
+
+Extends CNN v2 binary format with:
+
+**Header (v3, 28 bytes):**
+
+| Offset | Type | Field | Description |
+|--------|------|-------|-------------|
+| 0x00 | u32 | magic | `0x33_4E_4E_43` ("CNN3") |
+| 0x04 | u32 | version | 3 |
+| 0x08 | u32 | num_enc_levels | U-Net encoder levels (typically 2) |
+| 0x0C | u32 | num_channels | Channels per level (e.g., [8,16]) |
+| 0x10 | u32 | in_channels | Feature buffer input channels (20) |
+| 0x14 | u32 | film_cond_dim | FiLM conditioning input size |
+| 0x18 | u32 | total_weights | Total f16 weight count |
+
+**Sections** (sequential after header):
+1. Encoder conv weights (per level)
+2. Decoder conv weights (per level)
+3. FiLM MLP weights (γ/β generator)
+
+All f16, little-endian, same packing as v2 (`pack2x16float`).
+
+---
+
+## Size Budget
+
+**CNN v3 target: ≤ 6 KB weights**
+
+| Component | Params | f16 bytes |
+|-----------|--------|-----------|
+| enc0: Conv(20→8, 3×3) | 20×8×9=1440 | 2880 |
+| enc1: Conv(8→16, 3×3) | 8×16×9=1152 | 2304 |
+| bottleneck: Conv(16→16, 3×3) | 16×16×9=2304 | 4608 |
+| dec1: Conv(32→8, 3×3) | 32×8×9=2304 | 4608 |
+| dec0: Conv(16→8, 3×3) | 16×8×9=1152 | 2304 |
+| output: Conv(8→4, 1×1) | 8×4=32 | 64 |
+| FiLM MLP (~96 outputs) | ~1600 | 3200 |
+| **Total** | | **~20 KB** |
+
+This exceeds target. **Mitigation strategies:**
+
+1. **Reduce channels:** [4, 8] instead of [8, 16] → cuts conv params by ~4×
+2. **1 level only:** remove H/4 level → drops bottleneck + one dec level
+3. **1×1 conv at bottleneck** (no spatial, just channel mixing)
+4. **FiLM only at bottleneck** → smaller MLP output
+
+**Conservative plan (fits ≤ 6 KB):**
+```
+enc0: Conv(20→4, 3×3)       = 20×4×9  = 720 weights
+enc1: Conv(4→8, 3×3)        = 4×8×9   = 288 weights
+bottleneck: Conv(8→8, 1×1)  = 8×8×1   = 64 weights
+dec1: Conv(16→4, 3×3)       = 16×4×9  = 576 weights
+dec0: Conv(12→4, 3×3)       = 12×4×9  = 432 weights
+output: Conv(4→4, 1×1)      = 4×4     = 16 weights
+FiLM MLP (5→24 outputs)     = 5×16+16×24 = 464 weights
+Total: ~2560 weights × 2B = ~5.0 KB f16 ✓
+```
+
+Note: enc0 input is 20ch (feature buffer), dec1 input is 16ch (8 bottleneck + 8 skip),
+dec0 input is 12ch (4 dec1 output + 8 enc0 skip). Skip connections concatenate.
+
+---
+
+## Training Data
+
+## Training Sample Pipelines
+
+Two sample types feed the same model. The key to compatibility is **channel dropout**
+during training: geometric channels are randomly zeroed with probability p=0.3, forcing
+the network to learn useful behaviour even when channels are absent. Photo samples are
+then a natural zero-filled subset at inference.
+
+---
+
+### Pipeline A: Full G-buffer samples (Blender)
+
+Blender Cycles exports all 20 channels as render passes in a single multi-layer EXR.
+
+**Render passes required:**
+
+| Pass | Blender name | Maps to |
+|------|-------------|---------|
+| Beauty (target) | `Combined` | Training target RGBA |
+| Diffuse color | `DiffCol` | albedo.rgb |
+| World normal | `Normal` | normal.xy (octahedral encode in post) |
+| Depth | `Z` | depth (normalize by far plane) |
+| Object index | `IndexOB` | mat_id |
+| Shadow | `Shadow` | shadow (invert: 1−shadow_catcher) |
+| Alpha / transmission | `Alpha` | transp. (0=opaque, 1=clear) |
+
+depth_grad, mip1, mip2 computed from albedo/depth during pack, not a render pass.
+prev = **zero** during training (no temporal history for static frames).
+
+**Blender script: `cnn_v3/training/blender_export.py`**
+```python
+# Enable passes
+vl = bpy.context.scene.view_layers["ViewLayer"]
+vl.use_pass_diffuse_color   = True
+vl.use_pass_normal          = True
+vl.use_pass_z               = True
+vl.use_pass_object_index    = True
+vl.use_pass_shadow          = True
+
+# Output: multi-layer EXR via compositor File Output node
+# One EXR per frame, all passes in separate layers
+
+# Run headless:
+# blender -b scene.blend -P blender_export.py -- --output renders/frame_###
+```
+
+**Post-processing: `cnn_v3/training/pack_blender_sample.py`**
+```bash
+python3 pack_blender_sample.py \
+  --exr renders/frame_001.exr \
+  --output dataset/full/sample_001/
+# Writes: albedo.png  normal.png  depth.png  matid.png  shadow.png  transp.png  target.png
+```
+
+depth_grad computed on-the-fly during dataloader (same Sobel kernel as runtime shader).
+mip1/mip2 computed from albedo via pyrDown (same as runtime).
+
+---
+
+### Pipeline B: Simple photo samples (albedo + alpha only)
+
+Input: a photo (RGB) + optional alpha mask. No geometry data.
+Missing channels are **zero-filled** — the network degrades gracefully due to dropout training.
+
+| Feature buffer channel | Value |
+|-----------------------|-------|
+| albedo.rgb | Photo RGB |
+| normal.xy | **0, 0** (zero → network ignores) |
+| depth | **0** |
+| depth_grad.xy | **0, 0** |
+| mat_id | **0** |
+| prev.rgb | **0, 0, 0** (no history) |
+| mip1.rgb | Computed from photo (pyrDown ×1) |
+| mip2.rgb | Computed from photo (pyrDown ×2) |
+| shadow | **1.0** (assume fully lit) |
+| transp. | **1 − alpha** (from photo alpha channel, or 0 if no alpha) |
+
+mip1/mip2 are still meaningful (they come from albedo, which we have).
+`transp` from photo alpha lets the network see foreground/background separation when
+available (e.g. cutout photos, PNG with alpha).
+
+**Simple pack script: `cnn_v3/training/pack_photo_sample.py`**
+```bash
+python3 pack_photo_sample.py \
+  --photo photos/img_001.png \       # RGB or RGBA
+  --output dataset/simple/sample_001/
+# Writes: albedo.png  [zeros for normal/depth/matid/shadow]  target.png (= albedo, no GT style)
+```
+
+For photo samples there is **no ground-truth styled target** — they are used for:
+1. Fine-tuning after Blender pre-training (self-supervised or with manual target)
+2. Inference-only testing (visual validation, no loss computed)
+3. Parity testing (compare PyTorch vs WebGPU output on a photo input)
+
+---
+
+### Channel dropout (training robustness)
+
+Applied per-sample during dataloader `__getitem__`:
+
+```python
+GEOMETRIC_CHANNELS = [3, 4, 5, 6, 7]   # normal.xy, depth, depth_grad.xy
+CONTEXT_CHANNELS   = [8, 18, 19]        # mat_id, shadow, transp
+TEMPORAL_CHANNELS  = [9, 10, 11]        # prev.rgb
+
+def apply_channel_dropout(feat, p_geom=0.3, p_context=0.2, p_temporal=0.5):
+    if random.random() < p_geom:
+        feat[GEOMETRIC_CHANNELS] = 0.0   # simulate photo-only input
+    if random.random() < p_context:
+        feat[CONTEXT_CHANNELS] = 0.0
+    if random.random() < p_temporal:
+        feat[TEMPORAL_CHANNELS] = 0.0    # simulate first frame
+    return feat
+```
+
+This ensures the network produces reasonable output regardless of which channels
+are available, and that full and simple pipelines can share one set of weights.
+
+---
+
+### Dataset layout
+
+```
+cnn_v3/training/
+  dataset/
+    full/                   # Blender samples (all 20 channels)
+      sample_000/
+        albedo.png          # RGB
+        normal.png          # RG oct-encoded (or zero)
+        depth.png           # R float16 EXR or 16-bit PNG
+        matid.png           # R u8
+        shadow.png          # R u8
+        transp.png          # R u8
+        target.png          # RGBA styled target
+    simple/                 # Photo samples (albedo+alpha only)
+      sample_000/
+        albedo.png          # RGB (or RGBA if alpha available)
+        target.png          # = albedo (no GT, inference/parity only)
+  test_vectors/
+    full_000_{feat,prev,cond,expected}.npz    # parity: full G-buffer
+    simple_000_{feat,prev,cond,expected}.npz  # parity: photo input
+```
+
+`feat.npz` stores the packed 20-channel float array (H×W×20, f32) ready for the model.
+`prev.npz` stores the previous-frame CNN output (H×W×3, f32), zero for static tests.
+`cond.npz` stores the FiLM conditioning vector (5-d).
+`expected.npz` stores the PyTorch f32 reference output (H×W×4, f32).
+
+---
+
+### Parity test matrix
+
+| Test | G-buffer | Prev | Notes |
+|------|----------|------|-------|
+| `full_static` | Blender sample | zero | Core correctness test |
+| `simple_static` | Photo (zeros for geom) | zero | Photo path correctness |
+| `full_temporal` | Blender frame 1 | frame 0 output | Temporal path |
+| `zero_input` | All zeros | zero | Degenerate stability check |
+
+All tests: max per-pixel per-channel absolute error ≤ 1/255 (PyTorch f32 vs WebGPU f16).
+
+---
+
+## Training Script: `train_cnn_v3.py`
+
+**Key differences from v2:**
+
+```python
+class CNNv3(nn.Module):
+    def __init__(self, enc_channels=[4,8], film_cond_dim=5):
+        super().__init__()
+        # Encoder
+        self.enc = nn.ModuleList([
+            nn.Conv2d(20, enc_channels[0], 3, padding=1),      # 20-ch feature buffer in
+            nn.Conv2d(enc_channels[0], enc_channels[1], 3, padding=1),
+        ])
+        # Bottleneck
+        self.bottleneck = nn.Conv2d(enc_channels[1], enc_channels[1], 1)
+        # Decoder (skip connections: concat → double channels)
+        self.dec = nn.ModuleList([
+            nn.Conv2d(enc_channels[1]*2, enc_channels[0], 3, padding=1),
+            nn.Conv2d(enc_channels[0]*2, 4, 3, padding=1),
+        ])
+        # FiLM MLP: conditioning → γ/β for each level
+        film_out = 2 * sum(enc_channels) * 2  # enc + dec levels, γ and β
+        self.film_mlp = nn.Sequential(
+            nn.Linear(film_cond_dim, 16), nn.ReLU(),
+            nn.Linear(16, film_out),
+        )
+
+    def forward(self, gbuf, cond):
+        # FiLM params from conditioning
+        film = self.film_mlp(cond)  # split into γ/β per level
+
+        # Encoder
+        skips = []
+        x = gbuf
+        for i, enc_layer in enumerate(self.enc):
+            x = enc_layer(x)
+            x = film_apply(x, gamma[i], beta[i])  # FiLM
+            x = F.relu(x)
+            skips.append(x)
+            x = F.avg_pool2d(x, 2)  # ½ resolution
+
+        # Bottleneck
+        x = F.relu(self.bottleneck(x))
+
+        # Decoder
+        for i, dec_layer in enumerate(self.dec):
+            x = F.interpolate(x, scale_factor=2, mode='nearest')  # ×2
+            x = torch.cat([x, skips[-(i+1)]], dim=1)              # skip
+            x = dec_layer(x)
+            x = film_apply(x, gamma[n_enc+i], beta[n_enc+i])      # FiLM
+            x = F.relu(x)
+
+        return torch.sigmoid(x)  # RGBA output [0,1]
+
+def film_apply(x, gamma, beta):
+    # gamma, beta: shape [B, C] → [B, C, 1, 1]
+    return gamma.unsqueeze(-1).unsqueeze(-1) * x + beta.unsqueeze(-1).unsqueeze(-1)
+```
+
+**Export:** fold BN into conv weights (if BN used), quantize to f16, write binary v3.
+
+---
+
+## Training Pipeline Script: `cnn_v3/scripts/train_cnn_v3_full.sh`
+
+Modelled directly on `cnn_v2/scripts/train_cnn_v2_full.sh`. Same structure, same modes,
+extended for v3 specifics (dataset packing, FiLM, parity vectors).
+
+### Modes (same pattern as v2)
+
+```bash
+# Full pipeline: pack → train → export → build → validate
+./train_cnn_v3_full.sh
+
+# Train only (dataset already packed)
+./train_cnn_v3_full.sh --skip-pack
+
+# Validate only (skip training, use existing weights)
+./train_cnn_v3_full.sh --validate
+./train_cnn_v3_full.sh --validate checkpoints/checkpoint_epoch_100.pth
+
+# Export weights only
+./train_cnn_v3_full.sh --export-only checkpoints/checkpoint_epoch_100.pth
+
+# Pack dataset only (run once after new Blender renders or photos)
+./train_cnn_v3_full.sh --pack-only
+```
+
+### Pipeline steps
+
+```
+[1/5] Pack dataset          pack_blender_sample.py / pack_photo_sample.py
+[2/5] Train                 train_cnn_v3.py
+[3/5] Export weights        export_cnn_v3_weights.py  →  .bin + test vectors .npz
+[4/5] Build demo            cmake --build build -j4 --target demo64k
+[5/5] Validate              cnn_v3_test on all input images + parity check
+```
+
+Step 1 is skipped with `--skip-pack` (dataset already exists).
+Steps 3–5 can be run independently with `--export-only` / `--validate`.
+
+### Parameters
+
+**New vs v2:**
+
+| Flag | Default | Notes |
+|------|---------|-------|
+| `--enc-channels C` | `4,8` | Comma-separated encoder channel counts per level |
+| `--film-cond-dim N` | `5` | FiLM conditioning vector size |
+| `--input-mode MODE` | `simple` | `simple` (photo) or `full` (Blender G-buffer) |
+| `--channel-dropout-p F` | `0.3` | Dropout probability for geometric channels |
+| `--blender-dir DIR` | `training/blender_renders/` | Source EXRs for full mode |
+| `--photos-dir DIR` | `training/photos/` | Source PNGs for simple mode |
+| `--generate-vectors` | off | Also run `validate_parity.py` during export step |
+| `--skip-pack` | off | Skip dataset packing (step 1) |
+
+**Kept from v2 unchanged:**
+
+| Flag | Default |
+|------|---------|
+| `--epochs N` | 200 |
+| `--batch-size N` | 16 |
+| `--lr FLOAT` | 1e-3 |
+| `--checkpoint-every N` | 50 |
+| `--patch-size N` | 8 |
+| `--patches-per-image N` | 256 |
+| `--detector TYPE` | harris |
+| `--full-image` | off |
+| `--image-size N` | 256 |
+| `--input DIR` | `training/dataset/` |
+| `--target DIR` | `training/dataset/` (same — target is inside sample dirs) |
+| `--checkpoint-dir DIR` | `checkpoints/` |
+| `--validation-dir DIR` | `validation_results/` |
+| `--output-weights PATH` | `cnn_v3/weights/cnn_v3_weights.bin` |
+
+### Examples
+
+```bash
+# Quick debug run: 1 level, 5 epochs, simple photos
+./train_cnn_v3_full.sh --enc-channels 4,4 --epochs 5 --input-mode simple
+
+# Full Blender pipeline: 500 epochs, channel dropout, generate parity vectors
+./train_cnn_v3_full.sh \
+  --input-mode full \
+  --blender-dir training/blender_renders/ \
+  --enc-channels 4,8 \
+  --epochs 500 \
+  --channel-dropout-p 0.3 \
+  --generate-vectors
+
+# Re-validate existing weights without retraining
+./train_cnn_v3_full.sh --validate
+
+# Export only and open results
+./train_cnn_v3_full.sh --export-only checkpoints/checkpoint_epoch_200.pth \
+  --generate-vectors
+```
+
+### Validation output (step 5)
+
+Same pattern as v2: runs `cnn_v3_test` on each image in `--input`, writes
+`validation_results/<name>_output.png`, opens the folder.
+
+If `--generate-vectors` was passed during export: also runs `validate_parity.py`,
+prints per-implementation max error table:
+
+```
+Parity results:
+  HTML vs PyTorch:  max=0.0039  mean=0.0008  ✓ PASS  (threshold=0.0039)
+  C++  vs PyTorch:  max=0.0039  mean=0.0007  ✓ PASS
+```
+
+---
+
+## WGSL Implementation
+
+**Compute shader approach** (same as v2, extended):
+
+```
+Pass 0: pack_gbuffer.wgsl         — assemble G-buffer channels into storage texture
+Pass 1: cnn_v3_enc0.wgsl          — encoder level 0 (20→4ch, 3×3)
+Pass 2: cnn_v3_enc1.wgsl          — encoder level 1 (4→8ch, 3×3) + downsample
+Pass 3: cnn_v3_bottleneck.wgsl    — bottleneck (8→8, 1×1)
+Pass 4: cnn_v3_dec1.wgsl          — decoder level 1: upsample + skip + (16→4, 3×3)
+Pass 5: cnn_v3_dec0.wgsl          — decoder level 0: upsample + skip + (8→4, 3×3)
+Pass 6: cnn_v3_output.wgsl        — sigmoid + composite to framebuffer
+```
+
+FiLM γ/β values are computed CPU-side each frame and uploaded as a small uniform buffer.
+
+**Uniform: FiLM params (per-frame)**
+```wgsl
+struct FilmParams {
+  gamma_enc0: vec4f,   // 4 channels
+  beta_enc0: vec4f,
+  gamma_enc1: vec4f,   // wait, use array for flexibility
+  beta_enc1: vec4f,
+  // ...
+}
+// ~96 floats × 4 bytes = 384 bytes uniform buffer (well within limits)
+```
+
+---
+
+## HTML Validation Tool: `cnn_v3/tools/index.html`
+
+**Base:** copy `cnn_v2/tools/cnn_v2_test/index.html`, adapt in-place.
+Single self-contained HTML file, no build step, open directly in browser.
+
+---
+
+### What is reused from v2 unchanged
+
+- Full CSS (drop zones, panels, layer-view grid, console, footer)
+- WebGPU init boilerplate (adapter, device, queue)
+- Drop zone + file input JS
+- `FULLSCREEN_QUAD_VS` vertex shader
+- Display / blit shader (output to canvas)
+- Layer viz shader (grayscale 4-channel split + 4× zoom)
+- Weight stats display (min/max per layer)
+- Video playback controls (play/pause, step frame)
+- Save PNG button, blend slider
+- Console logging
+
+---
+
+### Layout changes
+
+**Left sidebar** (replaces v2 left sidebar):
+```
+[ Drop .bin weights ]
+[ Weights Info panel ]          ← same, but shows U-Net topology
+[ Weights Viz panel ]           ← same, shows enc0/enc1/bottleneck/dec layers
+[ Input Mode toggle ]           ← NEW: Simple (photo) / Full (G-buffer)
+[ FiLM Conditioning panel ]     ← NEW: beat_phase, audio_intensity, style_p0, style_p1 sliders
+[ Temporal panel ]              ← NEW: "Use temporal" toggle, "Capture prev frame" button
+```
+
+**Main canvas** (mostly same):
+```
+[ bottom float bar ]
+  Video controls | Blend | View mode | G-buffer channel | Save PNG
+```
+View modes (keyboard): `SPACE` = original, `D` = diff×10, `G` = G-buffer channel view.
+G-buffer channel selector: albedo / normal.xy / depth / depth_grad / shadow / transp / prev.
+
+**Right sidebar** (replaces v2 layer viz):
+```
+[ Layer Visualization panel ]
+  Buttons: Features | Enc0 | Enc1 | BN | Dec1 | Dec0 | Output
+  4-channel grid  (or 8-channel grid for Enc1/BN, shown as 2 rows)
+  Zoom view (4×, mouse-driven)
+```
+"Features" button shows the 20-channel feature buffer split across 5 rows of 4.
+
+---
+
+### Input modes
+
+**Simple mode (default):** drop one PNG or video.
+- Albedo = image RGB
+- Alpha → `transp = 1 − alpha` (if RGBA PNG)
+- All geometric channels (normal, depth, depth_grad, mat_id) = 0
+- Shadow = 1.0 (fully lit)
+- Prev = black (or captured from previous render)
+- Mip1/mip2 computed from albedo in PACK_SHADER
+
+**Full mode:** drop multiple PNGs by filename convention.
+The tool detects channel assignment by filename:
+```
+*albedo*  or  *color*   → albedo (RGB)
+*normal*                → normal (RG oct-encoded)
+*depth*                 → depth (R, 16-bit PNG or EXR)
+*matid*  or  *index*    → mat_id (R u8)
+*shadow*                → shadow (R u8)
+*transp* or  *alpha*    → transparency (R u8)
+```
+Drop all files at once (or one-by-one). Missing channels stay zero.
+Status bar shows which channels are loaded.
+
+---
+
+### New WGSL shaders (inline, same pattern as v2)
+
+| Shader | Replaces | Notes |
+|--------|----------|-------|
+| `PACK_SHADER` | `STATIC_SHADER` | 20ch into feat_tex0 + feat_tex1 (rgba32uint each) |
+| `ENC0_SHADER` | part of `CNN_SHADER` | Conv(20→4, 3×3) + FiLM + ReLU; writes enc0_tex |
+| `ENC1_SHADER` | | Conv(4→8, 3×3) + FiLM + ReLU + avg_pool2×2; writes enc1_tex (half-res) |
+| `BOTTLENECK_SHADER` | | Conv(8→8, 1×1) + FiLM + ReLU; writes bn_tex |
+| `DEC1_SHADER` | | nearest upsample×2 + concat(bn, enc1_skip) + Conv(16→4, 3×3) + FiLM + ReLU |
+| `DEC0_SHADER` | | nearest upsample×2 + concat(dec1, enc0_skip) + Conv(8→4, 3×3) + FiLM + ReLU |
+| `OUTPUT_SHADER` | | Conv(4→4, 1×1) + sigmoid → composites to canvas |
+
+FiLM γ/β computed JS-side from sliders (tiny MLP forward pass in JS), uploaded as uniform.
+
+---
+
+### Textures (GPU-side, all rgba32uint or rgba16float)
+
+| Name | Size | Format | Contents |
+|------|------|--------|----------|
+| `feat_tex0` | W×H | rgba32uint | feature buffer slots 0–7 (f16) |
+| `feat_tex1` | W×H | rgba32uint | feature buffer slots 8–19 (u8+spare) |
+| `enc0_tex` | W×H | rgba32uint | 4 channels f16 (enc0 output, skip) |
+| `enc1_tex` | W/2×H/2 | rgba32uint | 8 channels f16 (enc1 out, skip) — 2 texels per pixel |
+| `bn_tex` | W/2×H/2 | rgba32uint | 8 channels f16 (bottleneck output) |
+| `dec1_tex` | W×H | rgba32uint | 4 channels f16 (dec1 output) |
+| `dec0_tex` | W×H | rgba32uint | 4 channels f16 (dec0 output) |
+| `prev_tex` | W×H | rgba8unorm | previous CNN output (temporal) |
+
+Skip connections: enc0_tex and enc1_tex are **kept alive** across the full forward pass
+(not ping-ponged away). DEC1 and DEC0 read them directly.
+
+---
+
+### Parity test mode
+
+Drop an NPZ file (from `validate_parity.py`) to activate:
+- Loads `feat`, `prev`, `cond`, `expected` arrays
+- Runs full forward pass on the packed features
+- Computes per-pixel per-channel absolute error vs `expected`
+- Reports: max error, mean error, pass/fail (threshold = 1/255)
+- Shows error map on canvas (amplified ×10, same as diff mode)
+
+---
+
+### File size estimate
+
+| Component | Approx size |
+|-----------|-------------|
+| HTML/CSS (reused) | ~4 KB |
+| JS logic (reused + new) | ~15 KB |
+| PACK_SHADER | ~1.5 KB |
+| ENC/DEC shaders (×6) | ~9 KB |
+| Display/viz shaders (reused) | ~3 KB |
+| **Total** | **~33 KB** |
+
+---
+
+### Usage
+
+```bash
+open cnn_v3/tools/index.html
+# or
+python3 -m http.server 8000
+# → http://localhost:8000/cnn_v3/tools/
+```
+
+---
+
+## Implementation Checklist
+
+Ordered for parallel execution where possible. Phases 1 and 2 are independent.
+
+**Architecture locked:** enc_channels = [4, 8]. See Size Budget for weight counts.
+
+---
+
+### Phase 0 — Stub G-buffer (unblocks everything else)
+
+Minimal compute pass, no real geometry. Lets CNN v3 be developed and trained
+before the real G-buffer exists. Wire real G-buffer in Phase 5.
+
+- [ ] `src/effects/cnn_v3_stub_gbuf.wgsl` — compute shader:
+  - albedo = sample current framebuffer (RGBA)
+  - normal.xy = (0.5, 0.5) — neutral, pointing toward camera
+  - depth = 0.5 — constant mid-range
+  - depth_grad.xy = 0, 0
+  - mat_id = 0, prev.rgb = 0, shadow = 1.0, transp = 0.0
+  - mip1/mip2 sampled from framebuffer via `textureSampleLevel`
+  - writes feat_tex0 + feat_tex1 (2 × rgba32uint)
+- [ ] Wire into `CNNv3Effect::render()` as pass 0 (swapped out later for real G-buffer)
+
+---
+
+### Phase 1 — Training infrastructure (parallel with Phase 2)
+
+**1a. PyTorch model**
+- [ ] `cnn_v3/training/train_cnn_v3.py`
+  - [ ] `CNNv3` class: U-Net [4,8], FiLM MLP (5→16→48), channel dropout
+  - [ ] `GBufferDataset`: loads 20-channel feature tensors from packed PNGs
+  - [ ] Training loop, checkpointing, grayscale/RGBA loss option
+
+**1b. Data preparation**
+- [ ] `cnn_v3/training/pack_photo_sample.py` — photo PNG → feat tensor (albedo + zeros)
+- [ ] `cnn_v3/training/pack_blender_sample.py` — multi-layer EXR → packed channel PNGs
+- [ ] `cnn_v3/training/blender_export.py` — headless Blender multi-pass render script
+  - passes: DiffCol, Normal, Z, IndexOB, Shadow, Alpha, Combined (target)
+
+**1c. Export and parity**
+- [ ] `cnn_v3/training/export_cnn_v3_weights.py` — checkpoint → binary v3 .bin (f16)
+- [ ] `cnn_v3/training/validate_parity.py`
+  - [ ] Generate test vectors (4 cases: full_static, simple_static, temporal, zero)
+  - [ ] Compare PyTorch f32 vs HTML WebGPU and C++ outputs
+  - [ ] Report max/mean error per channel, pass/fail at 1/255
+
+**1d. Pipeline script**
+- [ ] `cnn_v3/scripts/train_cnn_v3_full.sh` — pack → train → export → build → validate
+  - all flags from v2 + `--enc-channels`, `--film-cond-dim`, `--input-mode`, `--channel-dropout-p`, `--generate-vectors`, `--skip-pack`
+
+---
+
+### Phase 2 — WGSL shaders (parallel with Phase 1)
+
+All shaders: explicit zero-pad (not clamp), nearest-neighbor upsample,
+no batch norm at inference, `#include` existing snippets where possible.
+
+**2a. Pack pass** (replaces stub in Phase 0 when real G-buffer exists)
+- [ ] `src/effects/cnn_v3_pack.wgsl` — full 20-channel packer
+  - `#include "camera_common"` for depth linearization
+  - reads albedo MIPs via `textureSampleLevel(..., 1.0)` and `(..., 2.0)`
+  - reads prev_cnn_tex (persistent RGBA8 owned by effect)
+  - reads depth32float, normal, shadow, transp G-buffer textures
+  - computes depth_grad (finite diff), oct-encodes normal if needed
+  - writes feat_tex0 (f16×8) + feat_tex1 (u8×12, spare)
+
+**2b. U-Net compute shaders**
+- [ ] `src/effects/cnn_v3_enc0.wgsl` — Conv(20→4, 3×3) + FiLM + ReLU
+- [ ] `src/effects/cnn_v3_enc1.wgsl` — Conv(4→8, 3×3) + FiLM + ReLU + avg_pool 2×2
+- [ ] `src/effects/cnn_v3_bottleneck.wgsl` — Conv(8→8, 1×1) + FiLM + ReLU
+- [ ] `src/effects/cnn_v3_dec1.wgsl` — nearest upsample×2 + concat enc1_skip + Conv(16→4, 3×3) + FiLM + ReLU
+- [ ] `src/effects/cnn_v3_dec0.wgsl` — nearest upsample×2 + concat enc0_skip + Conv(8→4, 3×3) + FiLM + ReLU
+- [ ] `src/effects/cnn_v3_output.wgsl` — Conv(4→4, 1×1) + sigmoid → composite to framebuffer
+
+Reuse from existing shaders:
+- `pack2x16float` / `unpack2x16float` pattern (from CNN v2 shaders)
+- `pack4x8unorm` / `unpack4x8unorm` for feat_tex1
+
+**2c. Register shaders**
+- [ ] Add all shaders to `workspaces/main/assets.txt`
+- [ ] Add externs to `src/effects/shaders.h` + `src/effects/shaders.cc`
+
+---
+
+### Phase 3 — C++ effect
+
+- [ ] `src/effects/cnn_v3_effect.h` — class declaration
+  - textures: feat_tex0, feat_tex1, enc0_tex, enc1_tex (half-res), bn_tex (half-res), dec1_tex, dec0_tex
+  - **`WGPUTexture prev_cnn_tex_`** — persistent RGBA8, owned by effect, initialized black
+  - `FilmParams` uniform buffer (γ/β for 4 levels = 48 floats = 192 bytes)
+  - FiLM MLP weights (loaded from .bin, run CPU-side per frame)
+
+- [ ] `src/effects/cnn_v3_effect.cc` — implementation
+  - [ ] Constructor: create all textures at render resolution
+  - [ ] `render()`: 7-pass dispatch: stub_gbuf (or real) → enc0 → enc1 → bn → dec1 → dec0 → output
+  - [ ] Per-frame: run FiLM MLP (CPU), upload FilmParams uniform
+  - [ ] **After output pass: blit output → `prev_cnn_tex_`** (one GPU copy, cheap)
+  - [ ] `resize()`: recreate resolution-dependent textures (enc1/bn are half-res)
+
+- [ ] `cmake/DemoSourceLists.cmake` — add `cnn_v3_effect.cc` to COMMON_GPU_EFFECTS
+- [ ] `src/gpu/demo_effects.h` — add `#include "effects/cnn_v3_effect.h"`
+- [ ] `workspaces/main/timeline.seq` — add `EFFECT + CNNv3Effect`
+
+---
+
+### Phase 4 — Test scene (rotating cubes + fog SDF → G-buffer)
+
+Provides a real G-buffer for visual validation before the production G-buffer exists.
+Replaces the stub when ready.
+
+**4a. Raster G-buffer pass** (MRT)
+- [ ] `src/effects/cnn_v3_scene_raster.wgsl`
+  - Based on `src/effects/rotating_cube.wgsl`
+  - Fragment outputs: `@location(0)` albedo rgba16float, `@location(1)` normal+matid rg16float
+  - Depth: hardware depth32float
+  - mat_id from push constant / uniform (per-draw-call object index)
+
+**4b. Fog SDF pass** (compute)
+- [ ] `src/effects/cnn_v3_scene_sdf.wgsl`
+  - `#include "render/raymarching_id"` — provides `object_id` → mat_id
+  - `#include "render/shadows"` — `calc_shadow()` → shadow channel
+  - `#include "math/sdf_shapes"` — sdBox, sdSphere for fog/cube SDFs
+  - `#include "camera_common"` — ray setup
+  - Reads rasterized depth32float, overwrites G-buffer textures where SDF wins
+  - Writes transparency channel (volumetric fog density)
+
+**4c. C++ wrapper**
+- [ ] `src/effects/cnn_v3_scene_effect.h/.cc` — `CNNv3SceneEffect`
+  - Owns G-buffer textures (albedo rgba16float, normal_mat rg16float, depth32float, shadow r8unorm, transp r16float)
+  - Pass 1: raster rotating cubes → MRT
+  - Pass 2: SDF fog compute → overwrite where closer
+  - Pass 3: lighting/shadow pass
+  - Outputs are bound as inputs to `CNNv3Effect`'s pack pass
+- [ ] `cmake/DemoSourceLists.cmake` — add `.cc`
+- [ ] `src/gpu/demo_effects.h` — add include
+
+---
+
+### Phase 5 — C++ test
+
+Separate from v1/v2 tests. Uses `CNNv3SceneEffect` + `CNNv3Effect` together.
+
+- [ ] `src/tests/gpu/test_cnn_v3.cc`
+  - [ ] Scene renders (stub G-buffer + real scene G-buffer)
+  - [ ] CNN v3 forward pass with random/identity weights
+  - [ ] Prev frame blit verified (frame 0 → frame 1 temporal path)
+  - [ ] FiLM conditioning: verify different cond vectors produce different outputs
+  - [ ] Shader compilation (all 7 passes)
+- [ ] `cmake/DemoTests.cmake` — add test target
+
+---
+
+### Phase 6 — HTML validation tool
+
+- [ ] Copy `cnn_v2/tools/cnn_v2_test/index.html` → `cnn_v3/tools/index.html`
+- [ ] Replace `STATIC_SHADER` → `PACK_SHADER` (feat_tex0 + feat_tex1, mixed f16/u8)
+- [ ] Replace `CNN_SHADER` → 6 U-Net shaders (ENC0/ENC1/BN/DEC1/DEC0/OUTPUT)
+- [ ] Input mode toggle (Simple/Full) + filename-based channel detection
+- [ ] FiLM conditioning sliders + JS MLP forward pass (tiny, runs in JS)
+- [ ] Temporal: "capture prev frame" button + "use temporal" toggle
+- [ ] Layer viz: U-Net hierarchy buttons (Features/Enc0/Enc1/BN/Dec1/Dec0/Output)
+- [ ] G-buffer channel view (`G` key cycles: albedo/normal/depth/shadow/transp)
+- [ ] Parity test mode: drop NPZ → run → max error report + error map
+
+---
+
+### Phase 7 — Parity validation
+
+- [ ] Train model on photo samples (`--input-mode simple`, 200 epochs)
+- [ ] Export weights + generate test vectors (`--generate-vectors`)
+- [ ] HTML tool: drop .bin + test image → verify visual output
+- [ ] `validate_parity.py`: HTML vs PyTorch ≤ 1/255, C++ vs PyTorch ≤ 1/255
+- [ ] All 4 test cases pass: full_static, simple_static, temporal, zero_input
+- [ ] Wire `CNNv3SceneEffect` G-buffer into `CNNv3Effect` (replace stub)
+
+---
+
+### Phase 8 — Production G-buffer (future)
+
+Wire the real hybrid renderer G-buffer (GEOM_BUFFER.md) into CNNv3Effect,
+replacing `CNNv3SceneEffect`. Train on Blender full-pipeline samples.
+
+---
+
+## Differences from CNN v2
+
+| | CNN v2 | CNN v3 |
+|---|---|---|
+| Architecture | Flat N-layer chain | U-Net encoder/decoder |
+| Input | RGBD + positional enc | 20ch feature buffer (G-buffer + temporal + MIPs + shadow + transp.) |
+| Style control | Static (post-train) | FiLM: runtime γ/β from audio/beat |
+| Skip connections | None | Encoder→decoder concat |
+| Multi-scale | No | Yes (2 levels) |
+| Testability | HTML + C++ (informal) | Strict: test vectors, per-pixel tolerance |
+| Training data | Input/output image pairs | G-buffer render passes (Blender or photo) |
+| Weights | ~3.2 KB | ~3.4 KB (similar) |
+
+---
+
+## References
+
+- **FiLM:** "FiLM: Visual Reasoning with a General Conditioning Layer" (Perez et al., 2018)
+- **U-Net:** "U-Net: Convolutional Networks for Biomedical Image Segmentation" (Ronneberger et al., 2015)
+- **G-Buffer design:** `doc/archive/GEOM_BUFFER.md`
+- **CNN v2 reference:** `cnn_v2/docs/CNN_V2.md`
+- **Binary format base:** `cnn_v2/docs/CNN_V2_BINARY_FORMAT.md`
+- **Effect workflow:** `doc/EFFECT_WORKFLOW.md`
+
+---
+
+**Document Version:** 1.0
+**Created:** 2026-03-19
+**Status:** Design phase — G-buffer prerequisite pending