1 files changed, 816 insertions, 0 deletions
diff --git a/cnn_v3/docs/HOW_TO_CNN.md b/cnn_v3/docs/HOW_TO_CNN.md
new file mode 100644
index 0000000..8c41ab0
--- /dev/null
+++ b/cnn_v3/docs/HOW_TO_CNN.md
@@ -0,0 +1,816 @@
+# CNN v3 — Complete Pipeline Playbook
+
+U-Net + FiLM style-transfer pipeline: data collection → training → export → C++ integration → demo → parity test → HTML tool.
+
+---
+
+## Table of Contents
+
+1. [Overview](#0-overview)
+2. [Collecting Training Samples](#1-collecting-training-samples)
+   - [1a. From Real Photos](#1a-from-real-photos)
+   - [1b. From Blender (Full G-Buffer)](#1b-from-blender-full-g-buffer)
+   - [1c. Dataset Layout](#1c-dataset-layout)
+3. [Training the U-Net + FiLM](#2-training-the-u-net--film)
+4. [Exporting Weights](#3-exporting-weights)
+5. [Wiring into CNNv3Effect (C++)](#4-wiring-into-cnnv3effect-c)
+6. [Running a Demo](#5-running-a-demo)
+7. [Parity Testing](#6-parity-testing)
+8. [HTML WebGPU Tool](#7-html-webgpu-tool)
+9. [Appendix A — File Reference](#appendix-a--file-reference)
+10. [Appendix B — 20-Channel Feature Layout](#appendix-b--20-channel-feature-layout)
+
+---
+
+## 0. Overview
+
+CNN v3 is a 2-level U-Net with FiLM conditioning, designed to run in real-time as a WebGPU compute effect inside the demo.
+
+**Architecture:**
+
+```
+Input: 20-channel G-buffer feature textures (rgba32uint)
+  │
+  enc0 ──── Conv(20→4, 3×3) + FiLM + ReLU         ┐ full res
+  │    ↘ skip                                       │
+  enc1 ──── AvgPool2×2 + Conv(4→8, 3×3) + FiLM    ┐ ½ res
+  │    ↘ skip                                       │
+  bottleneck AvgPool2×2 + Conv(8→8, 1×1) + ReLU   ¼ res (no FiLM)
+  │                                                 │
+  dec1 ←── upsample×2 + cat(enc1 skip) + Conv(16→4, 3×3) + FiLM
+  │                                                 │ ½ res
+  dec0 ←── upsample×2 + cat(enc0 skip) + Conv(8→4, 3×3) + FiLM + sigmoid
+                                                    full res → RGBA output
+```
+
+**FiLM MLP:** `Linear(5→16) → ReLU → Linear(16→40)` trained jointly with U-Net.
+- Input: `[beat_phase, beat_norm, audio_intensity, style_p0, style_p1]`
+- Output: 40 γ/β values controlling style across all 4 FiLM layers
+
+**Weight budget:** ~3.9 KB f16 (fits ≤6 KB target)
+
+**Two data paths:**
+- **Simple mode** — real photos with zeroed geometric channels (normal, depth, matid)
+- **Full mode** — Blender G-buffer renders with all 20 channels populated
+
+**Pipeline summary:**
+
+```
+photos/Blender → pack → dataset/ → train_cnn_v3.py → checkpoint.pth
+                                                            │
+                                           export_cnn_v3_weights.py
+                                                  ┌─────────┴──────────┐
+                                          cnn_v3_weights.bin    cnn_v3_film_mlp.bin
+                                                  │
+                                     CNNv3Effect::upload_weights()
+                                                  │
+                                         demo / HTML tool
+```
+
+---
+
+## 1. Collecting Training Samples
+
+Each sample is a directory containing 7 PNG files. The dataloader discovers samples by scanning for directories containing `albedo.png`.
+
+### 1a. From Real Photos
+
+**What it does:** Converts one photo into a sample with zeroed geometric channels.
+The network handles this correctly because channel-dropout training (§2e) teaches it
+to work with or without geometry data.
+
+**Step 1 — Pack the photo:**
+```bash
+cd cnn_v3/training
+python3 pack_photo_sample.py \
+    --photo /path/to/photo.png \
+    --output dataset/simple/sample_001/
+```
+
+**What gets written:**
+
+| File | Content | Notes |
+|------|---------|-------|
+| `albedo.png` | Photo RGB uint8 | Source image |
+| `normal.png` | (128, 128, 0) uint8 | Neutral "no normal" → reconstructed (0,0,1) |
+| `depth.png` | All zeros uint16 | No depth data |
+| `matid.png` | All zeros uint8 | No material IDs |
+| `shadow.png` | 255 everywhere uint8 | Assume fully lit |
+| `transp.png` | 1 − alpha uint8 | 0 = opaque |
+| `target.png` | Copy of photo RGBA | **Placeholder — must be replaced** |
+
+**Step 2 — Provide a styled target:**
+
+`target.png` defaults to the input photo (identity style). You must replace it with
+your stylized ground truth before training:
+
+```bash
+cp my_stylized_version.png dataset/simple/sample_001/target.png
+```
+
+The network learns the mapping `albedo → target`. If target = albedo, the network
+learns identity (useful as sanity check, not for real training).
+
+**Batch packing:**
+```bash
+for f in photos/*.png; do
+    name=$(basename "${f%.png}")
+    python3 pack_photo_sample.py --photo "$f" \
+        --output dataset/simple/sample_${name}/
+done
+```
+
+**Pitfalls:**
+- Input must be RGB or RGBA; grayscale photos need `.convert('RGB')` first
+- `normal.png` B channel is always 0 (unused); only R and G channels carry oct-encoded XY
+- `mip1`/`mip2` are computed on-the-fly by the dataloader — not stored
+
+---
+
+### 1b. From Blender (Full G-Buffer)
+
+Produces all 20 feature channels including normals, depth, mat IDs, and shadow.
+
+#### Blender requirements
+
+- Blender 3.x+, Cycles render engine
+- Object indices set: *Properties → Object → Relations → Object Index* must be > 0
+  for objects you want tracked in `matid` (IndexOB pass)
+
+#### Step 1 — Render EXRs
+
+```bash
+blender -b scene.blend -P cnn_v3/training/blender_export.py -- \
+    --output /tmp/renders/frame_### \
+    --width 640 --height 360 \
+    --start-frame 1 --end-frame 200
+```
+
+The `--` separator is **required**; arguments after it are passed to the Python script,
+not to Blender. Each `#` in `--output` is replaced by a zero-padded frame digit.
+
+**Available flags:**
+
+| Flag | Default | Notes |
+|------|---------|-------|
+| `--output PATH` | `//renders/frame_###` | `//` = blend file directory; `###` = frame padding |
+| `--width N` | 640 | Render resolution |
+| `--height N` | 360 | Render resolution |
+| `--start-frame N` | scene start | First frame |
+| `--end-frame N` | scene end | Last frame |
+
+**Render pass → CNN channel mapping:**
+
+| Blender pass | EXR channels | CNN use |
+|-------------|-------------|---------|
+| Combined | `.R .G .B .A` | `target.png` (beauty, sRGB-converted) |
+| DiffCol | `.R .G .B` | `albedo.png` (linear → sRGB gamma 2.2) |
+| Normal | `.X .Y .Z` | `normal.png` (world-space, oct-encoded to RG) |
+| Z | `.R` | `depth.png` (mapped as 1/(z+1) → uint16) |
+| IndexOB | `.R` | `matid.png` (object index, clamped uint8) |
+| Shadow | `.R` | `shadow.png` (255 = lit, 0 = shadowed) |
+| Combined alpha | `.A` | `transp.png` (inverted: 0 = opaque) |
+
+**Pitfall:** Blender `Normal` pass uses `.X .Y .Z` channel names in the EXR, not `.R .G .B`.
+`pack_blender_sample.py` handles both naming conventions automatically.
+
+#### Step 2 — Pack EXRs into sample directories
+
+```bash
+python3 cnn_v3/training/pack_blender_sample.py \
+    --exr /tmp/renders/frame_0001.exr \
+    --output dataset/full/sample_0001/
+```
+
+**Dependencies:** `pip install openexr` (preferred) or `pip install imageio[freeimage]`
+
+**Batch packing:**
+```bash
+for exr in /tmp/renders/frame_*.exr; do
+    name=$(basename "${exr%.exr}")
+    python3 pack_blender_sample.py --exr "$exr" \
+        --output dataset/full/${name}/
+done
+```
+
+**What gets written:**
+
+| File | Source | Transform |
+|------|--------|-----------|
+| `albedo.png` | DiffCol pass | Linear → sRGB (γ=2.2), uint8 |
+| `normal.png` | Normal pass | XYZ unit → octahedral RG, uint8 |
+| `depth.png` | Z pass | 1/(z+1) normalized, uint16 |
+| `matid.png` | IndexOB pass | Clamped [0,255], uint8 |
+| `shadow.png` | Shadow pass | uint8 (255=lit) |
+| `transp.png` | Combined alpha | 1−alpha, uint8 |
+| `target.png` | Combined beauty | Linear → sRGB, RGBA uint8 |
+
+**Note:** `depth_grad`, `mip1`, `mip2` are computed on-the-fly by the dataloader. `prev.rgb` is always zero during training (no temporal history for static frames).
+
+**Pitfalls:**
+- `DiffCol` pass not found → warning printed, albedo zeroed (not fatal; training continues)
+- `IndexOB` all zero if Object Index not set in Blender object properties
+- Alpha convention: Blender alpha=1 means opaque; `transp.png` inverts this (transp=0 opaque)
+- `Shadow` pass in Cycles must be explicitly enabled in Render Properties → Passes → Effects
+
+---
+
+### 1c. Dataset Layout
+
+```
+dataset/
+  simple/            ← photo samples, use --input-mode simple
+    sample_001/
+      albedo.png
+      normal.png
+      depth.png
+      matid.png
+      shadow.png
+      transp.png
+      target.png      ← must be replaced with stylized target
+    sample_002/
+    ...
+  full/              ← Blender samples, use --input-mode full
+    sample_0001/
+    sample_0002/
+    ...
+```
+
+- If `simple/` or `full/` subdir is absent the dataloader scans the root directly
+- Minimum viable dataset: 1 sample (smoke test only); practical minimum ~50+ for training
+- You can mix Blender and photo samples in the same subdir; the dataloader treats them identically
+
+---
+
+## 2. Training the U-Net + FiLM
+
+The U-Net conv weights and FiLM MLP train **jointly** in a single run. No separate steps.
+
+### Prerequisites
+
+```bash
+pip install torch torchvision pillow numpy opencv-python
+cd cnn_v3/training
+```
+
+### Quick-start commands
+
+**Smoke test — 1 epoch, validates end-to-end without GPU:**
+```bash
+python3 train_cnn_v3.py --input dataset/ --epochs 1 \
+    --patch-size 32 --detector random
+```
+
+**Standard photo training (patch-based):**
+```bash
+python3 train_cnn_v3.py \
+    --input dataset/ \
+    --input-mode simple \
+    --epochs 200
+```
+
+**Blender G-buffer training:**
+```bash
+python3 train_cnn_v3.py \
+    --input dataset/ \
+    --input-mode full \
+    --epochs 200
+```
+
+**Full-image mode (better global coherence, slower):**
+```bash
+python3 train_cnn_v3.py \
+    --input dataset/ \
+    --input-mode full \
+    --full-image --image-size 256 \
+    --epochs 500
+```
+
+### Flag reference
+
+| Flag | Default | Notes |
+|------|---------|-------|
+| `--input DIR` | `training/dataset` | Dataset root; always set explicitly |
+| `--input-mode` | `simple` | `simple`=photos, `full`=Blender G-buffer |
+| `--epochs N` | 200 | 500 recommended for full-image mode |
+| `--batch-size N` | 16 | Reduce to 4–8 on GPU OOM |
+| `--lr F` | 1e-3 | Reduce to 1e-4 if loss oscillates or NaN |
+| `--patch-size N` | 64 | Smaller = faster epoch, less spatial context |
+| `--patches-per-image N` | 256 | Reduce for small datasets |
+| `--detector` | `harris` | `random` for smoke tests; `shi-tomasi` as alternative |
+| `--channel-dropout-p F` | 0.3 | Lower if all samples have geometry (Blender only) |
+| `--full-image` | off | Resize full image instead of patch crops |
+| `--image-size N` | 256 | Resize target; only used with `--full-image` |
+| `--enc-channels` | `4,8` | Must match C++ constants if changed |
+| `--film-cond-dim N` | 5 | Must match `CNNv3FiLMParams` field count in C++ |
+| `--checkpoint-dir DIR` | `checkpoints/` | Set per-experiment |
+| `--checkpoint-every N` | 50 | 0 to disable intermediate checkpoints |
+
+### Architecture at startup
+
+The model prints its parameter count:
+```
+Model: enc=[4, 8]  film_cond_dim=5  params=2097  (~3.9 KB f16)
+```
+
+If `params` is much higher, `--enc-channels` was changed; update C++ constants accordingly.
+
+### FiLM joint training
+
+The conditioning vector `cond` is **randomised per sample** during training:
+```python
+cond = np.random.rand(5).astype(np.float32)   # uniform [0,1]^5
+```
+This covers the full input space so the MLP is well-conditioned for any beat/audio/style
+combination at inference time. At inference, real values are fed from `set_film_params()`.
+
+### Channel dropout
+
+Applied per-sample to make the model robust to missing channels:
+
+| Channel group | Channels | Drop probability |
+|---------------|----------|-----------------|
+| Geometric | normal.xy, depth, depth_grad.xy [3,4,5,6,7] | `channel_dropout_p` (default 0.3) |
+| Context | mat_id, shadow, transp [8,18,19] | `channel_dropout_p × 0.67` (~0.2) |
+| Temporal | prev.rgb [9,10,11] | 0.5 (always) |
+
+This is why a model trained on Blender data also works on photos (geometry zeroed).
+To disable dropout for a pure-Blender model: `--channel-dropout-p 0`.
+
+### Checkpoints
+
+Saved as `.pth` at `checkpoints/checkpoint_epoch_N.pth`.
+
+Contents of each checkpoint:
+- `epoch` — epoch number
+- `model_state_dict` — all weights (conv + FiLM MLP)
+- `optimizer_state_dict` — Adam state (not needed for export)
+- `loss` — final avg batch loss
+- `config` — `{enc_channels, film_cond_dim, input_mode}` — **required by export script**
+
+The final checkpoint is always written even if `--checkpoint-every 0`.
+
+### Diagnosing training problems
+
+| Symptom | Likely cause | Fix |
+|---------|-------------|-----|
+| `RuntimeError: No samples found` | Wrong `--input` or missing `albedo.png` | Check dataset path |
+| Loss stuck at epoch 1 | Dataset too small | Add more samples |
+| Loss NaN from epoch 1 | Learning rate too high | Use `--lr 1e-4` |
+| CUDA OOM | Batch or patch too large | `--batch-size 4 --patch-size 32` |
+| Loss oscillates | LR too high late in training | Use `--lr 1e-4` or cosine schedule |
+| Loss drops then plateaus | Too few samples | Add more or use `--full-image` |
+
+---
+
+## 3. Exporting Weights
+
+Converts a trained `.pth` checkpoint to two raw binary files for the C++ runtime.
+
+```bash
+cd cnn_v3/training
+python3 export_cnn_v3_weights.py checkpoints/checkpoint_epoch_200.pth
+# writes to export/ by default
+
+python3 export_cnn_v3_weights.py checkpoints/checkpoint_epoch_200.pth \
+    --output /path/to/assets/
+```
+
+### Output files
+
+**`cnn_v3_weights.bin`** — conv+bias weights for all 5 passes, packed as f16-pairs-in-u32:
+
+| Layer | f16 count | Bytes |
+|-------|-----------|-------|
+| enc0 Conv(20→4,3×3)+bias | 724 | — |
+| enc1 Conv(4→8,3×3)+bias | 296 | — |
+| bottleneck Conv(8→8,1×1)+bias | 72 | — |
+| dec1 Conv(16→4,3×3)+bias | 580 | — |
+| dec0 Conv(8→4,3×3)+bias | 292 | — |
+| **Total** | **1964 f16** | **3928 bytes** |
+
+**`cnn_v3_film_mlp.bin`** — FiLM MLP weights as raw f32, row-major:
+
+| Layer | Shape | f32 count |
+|-------|-------|-----------|
+| L0 weight | (16, 5) | 80 |
+| L0 bias | (16,) | 16 |
+| L1 weight | (40, 16) | 640 |
+| L1 bias | (40,) | 40 |
+| **Total** | | **776 f32 = 3104 bytes** |
+
+The FiLM MLP is for CPU-side inference (future — see §4d). The U-Net weights in
+`cnn_v3_weights.bin` are what you need immediately.
+
+### f16 packing format
+
+WGSL `get_w(buf, base, idx)` reads: `pair = buf[(base+idx)/2]`.
+- Even index → low 16 bits of u32
+- Odd index → high 16 bits of u32
+
+The export script produces this layout: `u32 = u16[0::2] | (u16[1::2] << 16)`.
+
+### Expected output
+
+```
+Checkpoint: epoch=200  loss=0.012345
+  enc_channels=[4, 8]  film_cond_dim=5
+
+cnn_v3_weights.bin
+  1964 f16 values → 982 u32 → 3928 bytes
+  Upload via CNNv3Effect::upload_weights(queue, data, 3928)
+
+cnn_v3_film_mlp.bin
+  L0: weight (16, 5) + bias (16,)
+  L1: weight (40, 16) + bias (40,)
+  776 f32 values → 3104 bytes
+```
+
+### Pitfalls
+
+- **`enc_channels` mismatch:** if you changed `--enc-channels` during training, the layer size
+  assertion in the export script fires. The C++ weight-offset constants (`kEnc0Weights` etc.)
+  in `cnn_v3_effect.cc` must also be updated to match.
+- **Old checkpoint missing `config`:** if `config` key is absent (checkpoint from a very early
+  version), the script defaults to `enc_channels=[4,8], film_cond_dim=5`.
+- **`weights_only=True`:** requires PyTorch ≥ 2.0. If you get a warning, upgrade torch.
+
+---
+
+## 4. Wiring into CNNv3Effect (C++)
+
+### Class overview
+
+`CNNv3Effect` (in `cnn_v3/src/cnn_v3_effect.h/.cc`) implements the `Effect` base class.
+It owns:
+- 5 compute pipelines (enc0, enc1, bottleneck, dec1, dec0)
+- 5 params uniform buffers with per-pass `weight_offset` + FiLM γ/β
+- 1 shared storage buffer `weights_buf_` (~4 KB, read-only across all shaders)
+
+### Wiring in a `.seq` file
+
+```
+SEQUENCE 0 0 "Scene with CNN v3"
+  EFFECT + GBufferEffect prev_cnn -> gbuf_feat0 gbuf_feat1  0 60
+  EFFECT + CNNv3Effect   gbuf_feat0 gbuf_feat1 -> sink       0 60
+```
+
+Or direct C++:
+```cpp
+#include "cnn_v3/src/cnn_v3_effect.h"
+
+auto cnn = std::make_shared<CNNv3Effect>(
+    ctx,
+    /*inputs=*/ {"gbuf_feat0", "gbuf_feat1"},
+    /*outputs=*/{"cnn_output"},
+    /*start=*/0.0f, /*end=*/60.0f);
+```
+
+### Uploading weights
+
+Load `cnn_v3_weights.bin` once at startup, before the first `render()`:
+
+```cpp
+// Read binary file
+std::vector<uint8_t> data;
+{
+    std::ifstream f("cnn_v3_weights.bin", std::ios::binary | std::ios::ate);
+    data.resize(f.tellg());
+    f.seekg(0);
+    f.read(reinterpret_cast<char*>(data.data()), data.size());
+}
+
+// Upload to GPU
+cnn->upload_weights(ctx.queue, data.data(), (uint32_t)data.size());
+```
+
+Before `upload_weights()`: all conv weights are zero, so output is `sigmoid(0) = 0.5` gray.
+After: output reflects trained style.
+
+### Setting FiLM parameters each frame
+
+Call before `render()` each frame:
+
+```cpp
+CNNv3FiLMParams fp;
+fp.beat_phase      = params.beat_phase;          // 0-1 within current beat
+fp.beat_norm       = params.beat_time / 8.0f;    // normalized 8-beat cycle
+fp.audio_intensity = params.audio_intensity;     // peak audio level [0,1]
+fp.style_p0        = my_style_p0;                // user-defined style param
+fp.style_p1        = my_style_p1;
+cnn->set_film_params(fp);
+cnn->render(encoder, params, nodes);
+```
+
+**Current `set_film_params` behaviour (placeholder):** applies a hardcoded linear mapping —
+audio modulates gamma, beat modulates beta. This is a heuristic until `cnn_v3_film_mlp.bin`
+is integrated as a CPU-side MLP.
+
+**Future MLP inference** (when integrating `cnn_v3_film_mlp.bin`):
+1. Load `cnn_v3_film_mlp.bin` → 4 matrices/biases in f32
+2. Run forward pass: `h = relu(cond @ L0_W.T + L0_b); out = h @ L1_W.T + L1_b`
+3. Split `out[40]` into per-layer γ/β and write into the Params structs directly
+
+### Uniform struct layout (for debugging)
+
+`CnnV3Params4ch` (enc0, dec1, dec0 — 64 bytes):
+```
+offset  0: weight_offset  u32
+offset  4-31: padding     (vec3u has align=16 in WGSL)
+offset 32: gamma[4]       vec4f
+offset 48: beta[4]        vec4f
+```
+
+`CnnV3ParamsEnc1` (enc1 — 96 bytes): same header, then `gamma_lo/hi` at 32/48, `beta_lo/hi` at 64/80.
+
+Static asserts in `cnn_v3_effect.h` verify exact sizes; a compile failure here means the
+WGSL layout diverged from the C++ struct.
+
+### Intermediate node names
+
+Internal textures are named `<output[0]>_enc0`, `_enc1`, `_bottleneck`, `_dec1`.
+These are declared in `declare_nodes()` at the correct fractional resolutions (W/2, W/4).
+Do not reference them from outside the effect unless debugging.
+
+### Pitfalls
+
+- **`upload_weights` size mismatch:** the call is a raw `wgpuQueueWriteBuffer`. If the `.bin`
+  was generated with different `enc_channels`, inference silently corrupts. Always verify sizes match.
+- **`set_film_params` must be called before `render()`** each frame; stale shadow copies from
+  the previous frame persist otherwise.
+- **GBufferEffect must precede CNNv3Effect** in the same command encoder.
+- **Bind groups are rebuilt each `render()`** — node texture views may change on resize.
+
+---
+
+## 5. Running a Demo
+
+### Build
+
+```bash
+cmake -B build -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j$(nproc)
+./build/demo
+```
+
+### Expected visual output
+
+| Weights state | FiLM state | Expected output |
+|---------------|-----------|-----------------|
+| Not uploaded (zero) | any | Uniform gray (all channels ≈ 0.5) |
+| Uploaded | Identity (γ=1, β=0) | Stylization from conv weights only |
+| Uploaded | Varying beat_phase | Per-channel gamma/beta shift visible |
+| Uploaded | Full audio + beat | Full dynamic style modulation |
+
+### Sanity checks
+
+1. **Black output:** GBufferEffect likely didn't run. Confirm it precedes CNNv3Effect and
+   that `set_scene()` was called.
+
+2. **Uniform gray:** weights not uploaded. Check file path and that `upload_weights` was
+   called before the first `render()`.
+
+3. **Correct but static style:** `set_film_params` may be called with constant zeros.
+   Animate `beat_phase` 0→1 to verify FiLM response.
+
+4. **Resolution artefacts at enc1/bottleneck boundaries:** check that `W` and `H` are
+   divisible by 4 (required by the 2-level pooling chain).
+
+---
+
+## 6. Parity Testing
+
+The parity test validates that WGSL shaders produce bit-accurate results vs. the
+Python/NumPy reference implementation in `gen_test_vectors.py`.
+
+### Build and run
+
+```bash
+cmake -B build -DDEMO_BUILD_TESTS=ON
+cmake --build build -j4
+cd build && ./test_cnn_v3_parity
+```
+
+Two tests run:
+1. **Zero-weight test:** all conv weights zero → output must equal `sigmoid(0) = 0.5`
+   (deterministic, no reference vectors needed)
+2. **Random-weight test:** random weights from fixed seed=42 applied to an 8×8 test
+   tensor → WGSL output compared against Python-computed reference values
+
+### Pass criteria
+
+Tolerance: **max absolute error ≤ 1/255 = 3.92e-3** (one ULP in uint8 space)
+
+Current results (8×8 tensors):
+```
+enc0 max_err = 1.95e-3 ✓
+dec1 max_err = 1.95e-3 ✓
+final max_err = 4.88e-4 ✓
+```
+
+### Regenerating test vectors
+
+If you change `gen_test_vectors.py` or need to refresh the seed:
+
+```bash
+cd cnn_v3/training
+python3 gen_test_vectors.py --header > ../test_vectors.h
+```
+
+Then recompile the parity test. The `--header` flag emits pure C to stdout; everything else
+(self-test results) goes to stderr.
+
+### Parity rules baked into the shaders
+
+If results drift after shader edits, verify these invariants match the Python reference:
+
+| Rule | WGSL | Python (`gen_test_vectors.py`) |
+|------|------|-------------------------------|
+| Border padding | zero-pad (not clamp) | `np.pad(..., mode='constant')` |
+| Downsampling | AvgPool 2×2 exact | `0.25 * sum of 4 neighbours` |
+| Upsampling | `coord / 2` integer | `min(y//2, qH-1)` nearest |
+| Skip connections | channel concatenation | `np.concatenate([up, skip], axis=2)` |
+| FiLM application | after conv+bias, before ReLU | `max(0, γ·x + β)` |
+| Weight layout | OIHW, biases appended | `o * IN * K² + i * K² + ky*K + kx` |
+| f16 quantisation | rgba16float / rgba32uint boundaries | `np.float16(out).astype(np.float32)` |
+
+### Pitfalls
+
+- **Test fails on null/headless backend:** the test requires a real GPU (Dawn/wgpu).
+  Will error early if the WebGPU device cannot be created.
+- **Consistent failure on random-weight test only:** `test_vectors.h` is out of sync.
+  Regenerate with `gen_test_vectors.py --header` and recompile.
+- **Consistent failure on both tests:** shader logic diverged from the parity rules above.
+
+---
+
+## 7. HTML WebGPU Tool
+
+### Current state
+
+There is no dedicated CNN v3 HTML tool yet.
+The CNN v2 tool (`cnn_v2/tools/cnn_v2_test/index.html`) is the reference pattern.
+
+### CNN v2 tool as reference
+
+The v2 tool is a single self-contained HTML file demonstrating:
+- Inline WGSL shaders (no build step)
+- Drag-and-drop `.bin` weight loading
+- Image/video file input
+- Intermediate layer visualisation
+- View modes: CNN output / original / diff×10
+- Side panel with per-layer weight statistics
+
+A v3 tool follows the same pattern with a more complex texture chain.
+
+### What a CNN v3 HTML tool requires
+
+**WGSL shaders to inline** (resolve `#include "cnn_v3/common"` via JS string substitution):
+
+```js
+const common = `/* contents of cnn_v3_common.wgsl */`;
+const enc0_src = enc0_template.replace('#include "cnn_v3/common"', common);
+```
+
+**Texture chain:**
+
+| Texture | Format | Size |
+|---------|--------|------|
+| feat_tex0 (input) | rgba32uint | W × H |
+| feat_tex1 (input) | rgba32uint | W × H |
+| enc0_tex | rgba16float | W × H |
+| enc1_tex | rgba32uint | W/2 × H/2 |
+| bottleneck_tex | rgba32uint | W/4 × H/4 |
+| dec1_tex | rgba16float | W/2 × H/2 |
+| output_tex | rgba16float | W × H |
+
+`rgba32uint` textures cannot be sampled; use `textureLoad` — already done in the shaders.
+
+**Weight loading:**
+
+```js
+const resp = await fetch('cnn_v3_weights.bin');
+const buf  = await resp.arrayBuffer();
+const gpu_buf = device.createBuffer({
+    size: buf.byteLength,
+    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
+});
+device.queue.writeBuffer(gpu_buf, 0, buf);
+```
+
+**FiLM MLP inference (JS-side):**
+
+```js
+// Load cnn_v3_film_mlp.bin as Float32Array
+const mlp = new Float32Array(await (await fetch('cnn_v3_film_mlp.bin')).arrayBuffer());
+const L0_W = mlp.subarray(0,    80);   // (16×5) row-major
+const L0_b = mlp.subarray(80,   96);
+const L1_W = mlp.subarray(96,  736);   // (40×16) row-major
+const L1_b = mlp.subarray(736, 776);
+
+function mlp_forward(cond5) {
+    // h = relu(L0_W @ cond + L0_b)
+    const h = new Float32Array(16);
+    for (let o = 0; o < 16; o++) {
+        let s = L0_b[o];
+        for (let i = 0; i < 5; i++) s += L0_W[o * 5 + i] * cond5[i];
+        h[o] = Math.max(0, s);
+    }
+    // out = L1_W @ h + L1_b
+    const out = new Float32Array(40);
+    for (let o = 0; o < 40; o++) {
+        let s = L1_b[o];
+        for (let i = 0; i < 16; i++) s += L1_W[o * 16 + i] * h[i];
+        out[o] = s;
+    }
+    return out;  // [γenc0×4, βenc0×4, γenc1×8, βenc1×8, γdec1×4, βdec1×4, γdec0×4, βdec0×4]
+}
+```
+
+The 40 outputs split into per-layer γ/β and uploaded to the 5 params uniform buffers
+before each compute dispatch.
+
+**Input feature assembly from a photo:**
+
+For simple (photo-only) mode, build `feat_tex0` and `feat_tex1` from the image data:
+- `feat_tex0`: pack albedo RGB (f16×3), normal XY (128,128 neutral → 0.0 in oct), depth (0), depth_grad (0,0) as `pack2x16float` into rgba32uint
+- `feat_tex1`: pack mat_id (0), prev.rgb (0,0,0), mip1.rgb, mip2.rgb, shadow (1.0), transp (0) as `pack4x8unorm` into rgba32uint
+
+See `cnn_v3/shaders/gbuf_pack.wgsl` for the exact packing layout (mirrors `GBufferEffect`).
+
+### Serving locally
+
+Chrome requires a real HTTP server for WebGPU (not `file://`):
+
+```bash
+python3 -m http.server 8080
+# Open: http://localhost:8080/cnn_v3/tools/cnn_v3_test/index.html
+```
+
+### Browser requirements
+
+- Chrome 113+ with WebGPU enabled (default on desktop)
+- Firefox Nightly with `dom.webgpu.enabled = true`
+- Required features: check `device.features.has('shader-f16')` for f16 support;
+  fall back to f32 accumulation if absent
+
+### Pitfalls
+
+- `rgba32uint` requires `STORAGE` + `TEXTURE_BINDING` usage flags; missing either causes bind group creation failure
+- WGSL `#include "cnn_v3/common"` must be resolved via JS string replace before passing to `device.createShaderModule()`
+- Workgroup dispatch: `Math.ceil(W / 8)` × `Math.ceil(H / 8)` — same formula as C++
+- Cross-origin image loading requires CORS headers or same-origin hosting
+
+---
+
+## Appendix A — File Reference
+
+| File | Purpose |
+|------|---------|
+| `cnn_v3/training/blender_export.py` | Configure Blender Cycles passes, render multi-layer EXR |
+| `cnn_v3/training/pack_blender_sample.py` | EXR → sample PNG directory (7 files) |
+| `cnn_v3/training/pack_photo_sample.py` | Photo → zeroed-geometry sample directory |
+| `cnn_v3/training/cnn_v3_utils.py` | Dataset class, feature assembly, channel dropout, salient-point detection |
+| `cnn_v3/training/train_cnn_v3.py` | CNNv3 model definition, training loop, CLI |
+| `cnn_v3/training/export_cnn_v3_weights.py` | Checkpoint → `cnn_v3_weights.bin` + `cnn_v3_film_mlp.bin` |
+| `cnn_v3/training/gen_test_vectors.py` | NumPy reference forward pass + C header generator |
+| `cnn_v3/test_vectors.h` | Compiled-in test vectors (auto-generated, do not edit) |
+| `cnn_v3/src/cnn_v3_effect.h` | C++ class, Params structs, `CNNv3FiLMParams` API |
+| `cnn_v3/src/cnn_v3_effect.cc` | Effect implementation: pipelines, render, weight upload |
+| `cnn_v3/src/gbuffer_effect.h/.cc` | GBufferEffect: rasterise + pack G-buffer feature textures |
+| `src/tests/gpu/test_cnn_v3_parity.cc` | Per-pixel parity test (WGSL vs. Python reference) |
+| `cnn_v3/docs/CNN_V3.md` | Full architecture spec (U-Net, FiLM, WGSL uniform layouts) |
+| `cnn_v2/tools/cnn_v2_test/index.html` | HTML tool reference pattern (v2) |
+
+---
+
+## Appendix B — 20-Channel Feature Layout
+
+| Index | Channel | Source | Encoding |
+|-------|---------|--------|----------|
+| 0–2 | albedo.rgb | `albedo.png` | f32 [0,1] |
+| 3–4 | normal.xy | `normal.png` RG | oct-encoded f32 [0,1] |
+| 5 | depth | `depth.png` | f32 [0,1] (1/(z+1)) |
+| 6–7 | depth_grad.xy | computed from depth | central diff, signed |
+| 8 | mat_id | `matid.png` | f32 [0,1] |
+| 9–11 | prev.rgb | previous frame output | zero during training |
+| 12–14 | mip1.rgb | pyrdown(albedo) | f32 [0,1] |
+| 15–17 | mip2.rgb | pyrdown(mip1) | f32 [0,1] |
+| 18 | shadow | `shadow.png` | f32 [0,1] (1=lit) |
+| 19 | transp | `transp.png` | f32 [0,1] (0=opaque) |
+
+**Feature texture packing** (`feat_tex0` / `feat_tex1`, both `rgba32uint`):
+
+```
+feat_tex0 (4×u32 = 8 f16 channels via pack2x16float):
+  .x = pack2x16float(albedo.r,  albedo.g)
+  .y = pack2x16float(albedo.b,  normal.x)
+  .z = pack2x16float(normal.y,  depth)
+  .w = pack2x16float(dgrad.x,   dgrad.y)
+
+feat_tex1 (4×u32 = 12 u8 channels + padding via pack4x8unorm):
+  .x = pack4x8unorm(mat_id, prev.r,  prev.g,   prev.b)
+  .y = pack4x8unorm(mip1.r, mip1.g,  mip1.b,   mip2.r)
+  .z = pack4x8unorm(mip2.g, mip2.b,  shadow,   transp)
+  .w = 0 (unused, 8 reserved channels)
+```