# CNN v3 — Complete Pipeline Playbook

U-Net + FiLM style-transfer pipeline: data collection → training → export → C++ integration → demo → parity test → HTML tool.

---

## Table of Contents

1. [Overview](#0-overview)
2. [Collecting Training Samples](#1-collecting-training-samples)
   - [1a. From Real Photos](#1a-from-real-photos)
   - [1b. From Blender (Full G-Buffer)](#1b-from-blender-full-g-buffer)
   - [1c. Dataset Layout](#1c-dataset-layout)
3. [Training the U-Net + FiLM](#2-training-the-u-net--film)
4. [Exporting Weights](#3-exporting-weights)
5. [Wiring into CNNv3Effect (C++)](#4-wiring-into-cnnv3effect-c)
6. [Running a Demo](#5-running-a-demo)
7. [Parity Testing](#6-parity-testing)
8. [HTML WebGPU Tool](#7-html-webgpu-tool)
9. [Appendix A — File Reference](#appendix-a--file-reference)
10. [Appendix B — 20-Channel Feature Layout](#appendix-b--20-channel-feature-layout)

---

## 0. Overview

CNN v3 is a 2-level U-Net with FiLM conditioning, designed to run in real-time as a WebGPU compute effect inside the demo.

**Architecture:**

```
Input: 20-channel G-buffer feature textures (rgba32uint)
  │
  enc0 ──── Conv(20→4, 3×3) + FiLM + ReLU         ┐ full res
  │    ↘ skip                                       │
  enc1 ──── AvgPool2×2 + Conv(4→8, 3×3) + FiLM    ┐ ½ res
  │    ↘ skip                                       │
  bottleneck AvgPool2×2 + Conv(8→8, 1×1) + ReLU   ¼ res (no FiLM)
  │                                                 │
  dec1 ←── upsample×2 + cat(enc1 skip) + Conv(16→4, 3×3) + FiLM
  │                                                 │ ½ res
  dec0 ←── upsample×2 + cat(enc0 skip) + Conv(8→4, 3×3) + FiLM + sigmoid
                                                    full res → RGBA output
```

**FiLM MLP:** `Linear(5→16) → ReLU → Linear(16→40)` trained jointly with U-Net.
- Input: `[beat_phase, beat_norm, audio_intensity, style_p0, style_p1]`
- Output: 40 γ/β values controlling style across all 4 FiLM layers

**Weight budget:** ~3.9 KB f16 (fits ≤6 KB target)

**Two data paths:**
- **Simple mode** — real photos with zeroed geometric channels (normal, depth, matid)
- **Full mode** — Blender G-buffer renders with all 20 channels populated

**Pipeline summary:**

```
photos/Blender → pack → dataset/ → train_cnn_v3.py → checkpoint.pth
                                                            │
                                           export_cnn_v3_weights.py
                                                  ┌─────────┴──────────┐
                                          cnn_v3_weights.bin    cnn_v3_film_mlp.bin
                                                  │
                                     CNNv3Effect::upload_weights()
                                                  │
                                         demo / HTML tool
```

---

## 1. Collecting Training Samples

Each sample is a directory containing 7 PNG files. The dataloader discovers samples by scanning for directories containing `albedo.png`.

### 1a. From Real Photos

**What it does:** Converts one photo into a sample with zeroed geometric channels.
The network handles this correctly because channel-dropout training (§2e) teaches it
to work with or without geometry data.

**Step 1 — Pack an input/target pair with `gen_sample`:**

```bash
cd cnn_v3/training
./gen_sample.sh /path/to/photo.png /path/to/stylized.png dataset/simple/sample_001/
```

`gen_sample.sh <input> <target> <output_dir>` is the recommended one-shot wrapper.
It calls `pack_photo_sample.py` with both `--photo` and `--target` in a single step.

**What gets written:**

| File | Content | Notes |
|------|---------|-------|
| `albedo.png` | Photo RGB uint8 | Source image |
| `normal.png` | (128, 128, 0) uint8 | Neutral "no normal" → reconstructed (0,0,1) |
| `depth.png` | All zeros uint16 | No depth data |
| `matid.png` | All zeros uint8 | No material IDs |
| `shadow.png` | 255 everywhere uint8 | Assume fully lit |
| `transp.png` | 1 − alpha uint8 | 0 = opaque |
| `target.png` | Stylized target RGBA | Ground truth for training |

**Step 2 — Verify the target:**

The network learns the mapping `albedo → target`. If you pass the same image as both
input and target, the network learns identity (useful as sanity check, not for real
training). Confirm `target.png` looks correct before running training.

**Alternative — pack without a target yet:**
```bash
python3 pack_photo_sample.py \
    --photo /path/to/photo.png \
    --output dataset/simple/sample_001/
# target.png defaults to a copy of the input; replace it before training:
cp my_stylized_version.png dataset/simple/sample_001/target.png
```

**Batch packing:**
```bash
for f in photos/*.png; do
    name=$(basename "${f%.png}")
    ./gen_sample.sh "$f" "targets/${name}_styled.png" \
        dataset/simple/sample_${name}/
done
```

**Pitfalls:**
- Input must be RGB or RGBA; grayscale photos need `.convert('RGB')` first
- `normal.png` B channel is always 0 (unused); only R and G channels carry oct-encoded XY
- `mip1`/`mip2` are computed on-the-fly by the dataloader — not stored

---

### 1b. From Blender (Full G-Buffer)

Produces all 20 feature channels including normals, depth, mat IDs, and shadow.

#### Blender requirements

- **Blender 4.5 LTS** recommended (`blender4` alias); 5.x also works
- Cycles render engine (set automatically by the script)
- Object indices set: *Properties → Object → Relations → Object Index* > 0
  for objects you want tracked in `matid` (IndexOB pass)

#### Step 1 — Render EXRs

```bash
cd cnn_v3/training
blender4 -b input_3d/scene.blend -P blender_export.py -- \
    --output tmp/renders/frames \
    --width 640 --height 360 \
    --start-frame 1 --end-frame 200 \
    --view-layer RenderLayer
```

The `--` separator is **required**. `blender_export.py` uses native
`OPEN_EXR_MULTILAYER` render output — all enabled passes are written
automatically. One file per frame: `{output}/0001.exr`, `0002.exr`, …
Render progress (`Fra:/Mem:` spam) is suppressed; per-frame status goes to stderr.

**Available flags:**

| Flag | Default | Notes |
|------|---------|-------|
| `--output PATH` | `//renders/` | Output directory; `//` = blend file directory |
| `--width N` | 640 | Render resolution |
| `--height N` | 360 | Render resolution |
| `--start-frame N` | scene start | First frame |
| `--end-frame N` | scene end | Last frame |
| `--view-layer NAME` | first layer | View layer name; pass `?` to list available layers |

**Render pass → EXR channel → CNN file:**

| Blender pass | Native EXR channel | CNN file |
|-------------|--------------------|---------|
| Combined | `Combined.R/G/B/A` | `target.png` (beauty, linear→sRGB) |
| DiffCol | `DiffCol.R/G/B` | `albedo.png` (linear→sRGB γ2.2) |
| Normal | `Normal.X/Y/Z` | `normal.png` (oct-encoded RG) |
| Depth | `Depth.Z` | `depth.png` (1/(z+1) → uint16) |
| IndexOB | `IndexOB.X` | `matid.png` (object index, uint8) |
| Shadow | `Shadow.X` | `shadow.png` (255=lit; defaults to 255 if absent) |
| Combined alpha | `Combined.A` | `transp.png` (1−alpha, 0=opaque) |

**Note on Shadow pass:** Blender's Cycles Shadow pass may be absent for scenes
without shadow-casting lights or catcher objects; `pack_blender_sample.py` defaults
to 1.0 (fully lit) when the channel is missing.

#### Step 2 — Pack EXRs into sample directories

```bash
python3 cnn_v3/training/pack_blender_sample.py \
    --exr /tmp/renders/frame_0001.exr \
    --output dataset/full/sample_0001/
```

**Dependencies:** `pip install openexr` (preferred) or `pip install imageio[freeimage]`

**Batch packing:**
```bash
for exr in /tmp/renders/frame_*.exr; do
    name=$(basename "${exr%.exr}")
    python3 pack_blender_sample.py --exr "$exr" \
        --output dataset/full/${name}/
done
```

**What gets written:**

| File | Source | Transform |
|------|--------|-----------|
| `albedo.png` | DiffCol pass | Linear → sRGB (γ=2.2), uint8 |
| `normal.png` | Normal pass | XYZ unit → octahedral RG, uint8 |
| `depth.png` | Z pass | 1/(z+1) normalized, uint16 |
| `matid.png` | IndexOB pass | Clamped [0,255], uint8 |
| `shadow.png` | Shadow pass | uint8 (255=lit) |
| `transp.png` | Combined alpha | 1−alpha, uint8 |
| `target.png` | Combined beauty | Linear → sRGB, RGBA uint8 |

**Note:** `depth_grad`, `mip1`, `mip2` are computed on-the-fly by the dataloader. `prev.rgb` is always zero during training (no temporal history for static frames).

**Pitfalls:**
- `DiffCol` pass not found → warning printed, albedo zeroed (not fatal; training continues)
- `IndexOB` all zero if Object Index not set in Blender object properties
- Alpha convention: Blender alpha=1 means opaque; `transp.png` inverts this (transp=0 opaque)
- `Shadow` pass in Cycles must be explicitly enabled in Render Properties → Passes → Effects

#### Full example (Old House scene, 2 frames)

**Step 1 — Render:**
```bash
cd cnn_v3/training
blender4 -b input_3d/"Old House 2 3D Models.blend" -P blender_export.py -- \
    --output tmp/renders/frames \
    --width 640 --height 360 \
    --start-frame 1 --end-frame 2 \
    --view-layer RenderLayer
```

**Step 2 — Pack:**
```bash
cd cnn_v3/training
for exr in tmp/renders/frames/*.exr; do
    name=$(basename "${exr%.exr}")
    python3 pack_blender_sample.py \
        --exr "$exr" \
        --output dataset/full/sample_${name}/
done
```

---

### 1c. Dataset Layout

```
dataset/
  simple/            ← photo samples, use --input-mode simple
    sample_001/
      albedo.png
      normal.png
      depth.png
      matid.png
      shadow.png
      transp.png
      target.png      ← must be replaced with stylized target
    sample_002/
    ...
  full/              ← Blender samples, use --input-mode full
    sample_0001/
    sample_0002/
    ...
```

- If `simple/` or `full/` subdir is absent the dataloader scans the root directly
- Minimum viable dataset: 1 sample (smoke test only); practical minimum ~50+ for training
- You can mix Blender and photo samples in the same subdir; the dataloader treats them identically
- `target.png` may differ in resolution from `albedo.png` — the dataloader resizes it to match albedo automatically (LANCZOS)

---

## 2. Training the U-Net + FiLM

The U-Net conv weights and FiLM MLP train **jointly** in a single run. No separate steps.

### Prerequisites

```bash
pip install torch torchvision pillow numpy opencv-python
cd cnn_v3/training
```

**With `uv` (no pip needed):** dependencies are declared inline in `train_cnn_v3.py`
and installed automatically on first run:
```bash
cd cnn_v3/training
uv run train_cnn_v3.py --input dataset/ --epochs 1 --patch-size 32 --detector random
```

### Quick-start commands

**Smoke test — 1 epoch, validates end-to-end without GPU:**
```bash
python3 train_cnn_v3.py --input dataset/ --epochs 1 \
    --patch-size 32 --detector random
```

**Standard photo training (patch-based):**
```bash
python3 train_cnn_v3.py \
    --input dataset/ \
    --input-mode simple \
    --epochs 200
```

**Blender G-buffer training:**
```bash
python3 train_cnn_v3.py \
    --input dataset/ \
    --input-mode full \
    --epochs 200
```

**Full-image mode (better global coherence, slower):**
```bash
python3 train_cnn_v3.py \
    --input dataset/ \
    --input-mode full \
    --full-image --image-size 256 \
    --epochs 500
```

### Flag reference

| Flag | Default | Notes |
|------|---------|-------|
| `--input DIR` | `training/dataset` | Dataset root; always set explicitly |
| `--input-mode` | `simple` | `simple`=photos, `full`=Blender G-buffer |
| `--epochs N` | 200 | 500 recommended for full-image mode |
| `--batch-size N` | 16 | Reduce to 4–8 on GPU OOM |
| `--lr F` | 1e-3 | Reduce to 1e-4 if loss oscillates or NaN |
| `--patch-size N` | 64 | Smaller = faster epoch, less spatial context |
| `--patches-per-image N` | 256 | Reduce for small datasets |
| `--detector` | `harris` | `random` for smoke tests; `shi-tomasi` as alternative |
| `--channel-dropout-p F` | 0.3 | Lower if all samples have geometry (Blender only) |
| `--full-image` | off | Resize full image instead of patch crops |
| `--image-size N` | 256 | Resize target; only used with `--full-image` |
| `--enc-channels` | `4,8` | Must match C++ constants if changed |
| `--film-cond-dim N` | 5 | Must match `CNNv3FiLMParams` field count in C++ |
| `--checkpoint-dir DIR` | `checkpoints/` | Set per-experiment |
| `--checkpoint-every N` | 50 | 0 to disable intermediate checkpoints |

### Architecture at startup

The model prints its parameter count:
```
Model: enc=[4, 8]  film_cond_dim=5  params=2740  (~5.4 KB f16)
```

If `params` is much higher, `--enc-channels` was changed; update C++ constants accordingly.

### Windows 10 + CUDA

**Prerequisites — run once in a CMD or PowerShell prompt:**

1. Install [Python 3.11](https://www.python.org/downloads/) (add to PATH).
2. Install the CUDA-enabled PyTorch wheel (pick the CUDA version that matches your driver — check with `nvidia-smi`):
   ```bat
   :: CUDA 12.1 (most common for RTX 20/30/40 series)
   pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

   :: CUDA 11.8 (older drivers / GTX 10xx)
   pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
   ```
3. Install remaining deps:
   ```bat
   pip install pillow numpy opencv-python
   ```
4. Verify GPU is visible:
   ```bat
   python -c "import torch; print(torch.cuda.get_device_name(0))"
   ```

**Training — from the repo root in CMD:**

```bat
cd cnn_v3\training
python train_cnn_v3.py --input dataset/ --epochs 200
```

The script auto-detects CUDA (`Device: cuda`). Paths use forward slashes on Windows — Python handles both.

**Copying the dataset from macOS/Linux:**

Use `scp`, a USB drive, or any file share. The dataset is plain PNG files — no conversion needed.

```bat
:: example: copy from a network share
robocopy \\mac\share\cnn_v3\training\dataset dataset /E
```

**Tips:**

- If you get `CUDA OOM`: add `--batch-size 4 --patch-size 32`
- `nvidia-smi` in a second window shows live VRAM usage
- Checkpoints are `.pth` files — copy them back to macOS for export (`export_cnn_v3_weights.py` runs on any platform)

### FiLM joint training

The conditioning vector `cond` is **randomised per sample** during training:
```python
cond = np.random.rand(5).astype(np.float32)   # uniform [0,1]^5
```
This covers the full input space so the MLP is well-conditioned for any beat/audio/style
combination at inference time. At inference, real values are fed from `set_film_params()`.

### Channel dropout

Applied per-sample to make the model robust to missing channels:

| Channel group | Channels | Drop probability |
|---------------|----------|-----------------|
| Geometric | normal.xy, depth, depth_grad.xy [3,4,5,6,7] | `channel_dropout_p` (default 0.3) |
| Context | mat_id, shadow, transp [8,18,19] | `channel_dropout_p × 0.67` (~0.2) |
| Temporal | prev.rgb [9,10,11] | 0.5 (always) |

This is why a model trained on Blender data also works on photos (geometry zeroed).
To disable dropout for a pure-Blender model: `--channel-dropout-p 0`.

### Checkpoints

Saved as `.pth` at `checkpoints/checkpoint_epoch_N.pth`.

Contents of each checkpoint:
- `epoch` — epoch number
- `model_state_dict` — all weights (conv + FiLM MLP)
- `optimizer_state_dict` — Adam state (not needed for export)
- `loss` — final avg batch loss
- `config` — `{enc_channels, film_cond_dim, input_mode}` — **required by export script**

The final checkpoint is always written even if `--checkpoint-every 0`.

### Diagnosing training problems

| Symptom | Likely cause | Fix |
|---------|-------------|-----|
| `RuntimeError: No samples found` | Wrong `--input` or missing `albedo.png` | Check dataset path |
| Loss stuck at epoch 1 | Dataset too small | Add more samples |
| Loss NaN from epoch 1 | Learning rate too high | Use `--lr 1e-4` |
| CUDA OOM | Batch or patch too large | `--batch-size 4 --patch-size 32` |
| Loss oscillates | LR too high late in training | Use `--lr 1e-4` or cosine schedule |
| Loss drops then plateaus | Too few samples | Add more or use `--full-image` |

---

## 3. Exporting Weights

Converts a trained `.pth` checkpoint to two raw binary files for the C++ runtime.

```bash
cd cnn_v3/training
python3 export_cnn_v3_weights.py checkpoints/checkpoint_epoch_200.pth
# writes to export/ by default

python3 export_cnn_v3_weights.py checkpoints/checkpoint_epoch_200.pth \
    --output /path/to/assets/
```

### Output files

**`cnn_v3_weights.bin`** — conv+bias weights for all 5 passes, packed as f16-pairs-in-u32:

| Layer | f16 count | Bytes |
|-------|-----------|-------|
| enc0 Conv(20→4,3×3)+bias | 724 | — |
| enc1 Conv(4→8,3×3)+bias | 296 | — |
| bottleneck Conv(8→8,1×1)+bias | 72 | — |
| dec1 Conv(16→4,3×3)+bias | 580 | — |
| dec0 Conv(8→4,3×3)+bias | 292 | — |
| **Total** | **1964 f16** | **3928 bytes** |

**`cnn_v3_film_mlp.bin`** — FiLM MLP weights as raw f32, row-major:

| Layer | Shape | f32 count |
|-------|-------|-----------|
| L0 weight | (16, 5) | 80 |
| L0 bias | (16,) | 16 |
| L1 weight | (40, 16) | 640 |
| L1 bias | (40,) | 40 |
| **Total** | | **776 f32 = 3104 bytes** |

The FiLM MLP is for CPU-side inference (future — see §4d). The U-Net weights in
`cnn_v3_weights.bin` are what you need immediately.

### f16 packing format

WGSL `get_w(buf, base, idx)` reads: `pair = buf[(base+idx)/2]`.
- Even index → low 16 bits of u32
- Odd index → high 16 bits of u32

The export script produces this layout: `u32 = u16[0::2] | (u16[1::2] << 16)`.

### Expected output

```
Checkpoint: epoch=200  loss=0.012345
  enc_channels=[4, 8]  film_cond_dim=5

cnn_v3_weights.bin
  1964 f16 values → 982 u32 → 3928 bytes
  Upload via CNNv3Effect::upload_weights(queue, data, 3928)

cnn_v3_film_mlp.bin
  L0: weight (16, 5) + bias (16,)
  L1: weight (40, 16) + bias (40,)
  776 f32 values → 3104 bytes
```

### Pitfalls

- **`enc_channels` mismatch:** if you changed `--enc-channels` during training, the layer size
  assertion in the export script fires. The C++ weight-offset constants (`kEnc0Weights` etc.)
  in `cnn_v3_effect.cc` must also be updated to match.
- **Old checkpoint missing `config`:** if `config` key is absent (checkpoint from a very early
  version), the script defaults to `enc_channels=[4,8], film_cond_dim=5`.
- **`weights_only=True`:** requires PyTorch ≥ 2.0. If you get a warning, upgrade torch.

---

## 4. Wiring into CNNv3Effect (C++)

### Class overview

`CNNv3Effect` (in `cnn_v3/src/cnn_v3_effect.h/.cc`) implements the `Effect` base class.
It owns:
- 5 compute pipelines (enc0, enc1, bottleneck, dec1, dec0)
- 5 params uniform buffers with per-pass `weight_offset` + FiLM γ/β
- 1 shared storage buffer `weights_buf_` (~4 KB, read-only across all shaders)

### Wiring in a `.seq` file

```
SEQUENCE 0 0 "Scene with CNN v3"
  EFFECT + GBufferEffect prev_cnn -> gbuf_feat0 gbuf_feat1  0 60
  EFFECT + CNNv3Effect   gbuf_feat0 gbuf_feat1 -> sink       0 60
```

Or direct C++:
```cpp
#include "cnn_v3/src/cnn_v3_effect.h"

auto cnn = std::make_shared<CNNv3Effect>(
    ctx,
    /*inputs=*/ {"gbuf_feat0", "gbuf_feat1"},
    /*outputs=*/{"cnn_output"},
    /*start=*/0.0f, /*end=*/60.0f);
```

### Uploading weights

Load `cnn_v3_weights.bin` once at startup, before the first `render()`:

```cpp
// Read binary file
std::vector<uint8_t> data;
{
    std::ifstream f("cnn_v3_weights.bin", std::ios::binary | std::ios::ate);
    data.resize(f.tellg());
    f.seekg(0);
    f.read(reinterpret_cast<char*>(data.data()), data.size());
}

// Upload to GPU
cnn->upload_weights(ctx.queue, data.data(), (uint32_t)data.size());
```

Before `upload_weights()`: all conv weights are zero, so output is `sigmoid(0) = 0.5` gray.
After: output reflects trained style.

### Setting FiLM parameters each frame

Call before `render()` each frame:

```cpp
CNNv3FiLMParams fp;
fp.beat_phase      = params.beat_phase;          // 0-1 within current beat
fp.beat_norm       = params.beat_time / 8.0f;    // normalized 8-beat cycle
fp.audio_intensity = params.audio_intensity;     // peak audio level [0,1]
fp.style_p0        = my_style_p0;                // user-defined style param
fp.style_p1        = my_style_p1;
cnn->set_film_params(fp);
cnn->render(encoder, params, nodes);
```

**Current `set_film_params` behaviour (placeholder):** applies a hardcoded linear mapping —
audio modulates gamma, beat modulates beta. This is a heuristic until `cnn_v3_film_mlp.bin`
is integrated as a CPU-side MLP.

**Future MLP inference** (when integrating `cnn_v3_film_mlp.bin`):
1. Load `cnn_v3_film_mlp.bin` → 4 matrices/biases in f32
2. Run forward pass: `h = relu(cond @ L0_W.T + L0_b); out = h @ L1_W.T + L1_b`
3. Split `out[40]` into per-layer γ/β and write into the Params structs directly

### Uniform struct layout (for debugging)

`CnnV3Params4ch` (enc0, dec1, dec0 — 64 bytes):
```
offset  0: weight_offset  u32
offset  4-31: padding     (vec3u has align=16 in WGSL)
offset 32: gamma[4]       vec4f
offset 48: beta[4]        vec4f
```

`CnnV3ParamsEnc1` (enc1 — 96 bytes): same header, then `gamma_lo/hi` at 32/48, `beta_lo/hi` at 64/80.

Static asserts in `cnn_v3_effect.h` verify exact sizes; a compile failure here means the
WGSL layout diverged from the C++ struct.

### Intermediate node names

Internal textures are named `<output[0]>_enc0`, `_enc1`, `_bottleneck`, `_dec1`.
These are declared in `declare_nodes()` at the correct fractional resolutions (W/2, W/4).
Do not reference them from outside the effect unless debugging.

### Pitfalls

- **`upload_weights` size mismatch:** the call is a raw `wgpuQueueWriteBuffer`. If the `.bin`
  was generated with different `enc_channels`, inference silently corrupts. Always verify sizes match.
- **`set_film_params` must be called before `render()`** each frame; stale shadow copies from
  the previous frame persist otherwise.
- **GBufferEffect must precede CNNv3Effect** in the same command encoder.
- **Bind groups are rebuilt each `render()`** — node texture views may change on resize.

---

## 5. Running a Demo

### Build

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/demo
```

### Expected visual output

| Weights state | FiLM state | Expected output |
|---------------|-----------|-----------------|
| Not uploaded (zero) | any | Uniform gray (all channels ≈ 0.5) |
| Uploaded | Identity (γ=1, β=0) | Stylization from conv weights only |
| Uploaded | Varying beat_phase | Per-channel gamma/beta shift visible |
| Uploaded | Full audio + beat | Full dynamic style modulation |

### Sanity checks

1. **Black output:** GBufferEffect likely didn't run. Confirm it precedes CNNv3Effect and
   that `set_scene()` was called.

2. **Uniform gray:** weights not uploaded. Check file path and that `upload_weights` was
   called before the first `render()`.

3. **Correct but static style:** `set_film_params` may be called with constant zeros.
   Animate `beat_phase` 0→1 to verify FiLM response.

4. **Resolution artefacts at enc1/bottleneck boundaries:** check that `W` and `H` are
   divisible by 4 (required by the 2-level pooling chain).

---

## 6. Parity Testing

The parity test validates that WGSL shaders produce bit-accurate results vs. the
Python/NumPy reference implementation in `gen_test_vectors.py`.

### Build and run

```bash
cmake -B build -DDEMO_BUILD_TESTS=ON
cmake --build build -j4
cd build && ./test_cnn_v3_parity
```

Two tests run:
1. **Zero-weight test:** all conv weights zero → output must equal `sigmoid(0) = 0.5`
   (deterministic, no reference vectors needed)
2. **Random-weight test:** random weights from fixed seed=42 applied to an 8×8 test
   tensor → WGSL output compared against Python-computed reference values

### Pass criteria

Tolerance: **max absolute error ≤ 1/255 = 3.92e-3** (one ULP in uint8 space)

Current results (8×8 tensors):
```
enc0 max_err = 1.95e-3 ✓
dec1 max_err = 1.95e-3 ✓
final max_err = 4.88e-4 ✓
```

### Regenerating test vectors

If you change `gen_test_vectors.py` or need to refresh the seed:

```bash
cd cnn_v3/training
python3 gen_test_vectors.py --header > ../test_vectors.h
```

Then recompile the parity test. The `--header` flag emits pure C to stdout; everything else
(self-test results) goes to stderr.

### Parity rules baked into the shaders

If results drift after shader edits, verify these invariants match the Python reference:

| Rule | WGSL | Python (`gen_test_vectors.py`) |
|------|------|-------------------------------|
| Border padding | zero-pad (not clamp) | `np.pad(..., mode='constant')` |
| Downsampling | AvgPool 2×2 exact | `0.25 * sum of 4 neighbours` |
| Upsampling | `coord / 2` integer | `min(y//2, qH-1)` nearest |
| Skip connections | channel concatenation | `np.concatenate([up, skip], axis=2)` |
| FiLM application | after conv+bias, before ReLU | `max(0, γ·x + β)` |
| Weight layout | OIHW, biases appended | `o * IN * K² + i * K² + ky*K + kx` |
| f16 quantisation | rgba16float / rgba32uint boundaries | `np.float16(out).astype(np.float32)` |

### Pitfalls

- **Test fails on null/headless backend:** the test requires a real GPU (Dawn/wgpu).
  Will error early if the WebGPU device cannot be created.
- **Consistent failure on random-weight test only:** `test_vectors.h` is out of sync.
  Regenerate with `gen_test_vectors.py --header` and recompile.
- **Consistent failure on both tests:** shader logic diverged from the parity rules above.

---

## 7. HTML WebGPU Tool

**Location:** `cnn_v3/tools/` — three files, no build step.

| File | Lines | Contents |
|------|-------|----------|
| `index.html` | 147 | HTML + CSS |
| `shaders.js` | 252 | WGSL shader constants, weight-offset constants |
| `tester.js` | 540 | `CNNv3Tester` class, event wiring |

### Usage

```bash
# Requires HTTP server (WebGPU blocked on file://)
cd /path/to/demo
python3 -m http.server 8080
# Open: http://localhost:8080/cnn_v3/tools/
```

Or on macOS with Chrome:
```bash
open -a "Google Chrome" --args --allow-file-access-from-files
open cnn_v3/tools/index.html
```

### Workflow

1. **Drop `cnn_v3_weights.bin`** onto the left "weights" drop zone.
2. **Drop a PNG or video** onto the centre canvas → CNN runs immediately.
3. _(Optional)_ **Drop `cnn_v3_film_mlp.bin`** → FiLM sliders become active.
4. Adjust **beat_phase / beat_norm / audio_int / style_p0 / style_p1** sliders → reruns on change.
5. Click layer buttons (**Feat · Enc0 · Enc1 · BN · Dec1 · Output**) in the right panel to inspect activations.
6. **Save PNG** to export the current output.

Keyboard: `[SPACE]` toggle original · `[D]` diff×10.

### Input files

| File | Format | Notes |
|------|--------|-------|
| `cnn_v3_weights.bin` | raw u32 (no header) | 982 u32 = 1964 f16 = ~3.9 KB |
| `cnn_v3_film_mlp.bin` | raw f32 | 776 f32 = 3.1 KB; optional — identity FiLM used if absent |

Both produced by `export_cnn_v3_weights.py` (§3).

### Texture chain

| Texture | Format | Size |
|---------|--------|------|
| `feat_tex0` | rgba32uint | W × H (8 f16: albedo, normal, depth, depth_grad) |
| `feat_tex1` | rgba32uint | W × H (12 u8: mat_id, prev, mip1, mip2, shadow, transp) |
| `enc0_tex` | rgba16float | W × H |
| `enc1_tex` | rgba32uint | W/2 × H/2 (8 f16 packed) |
| `bn_tex` | rgba32uint | W/4 × H/4 |
| `dec1_tex` | rgba16float | W/2 × H/2 |
| `output_tex` | rgba16float | W × H → displayed on canvas |

### Simple mode (photo input)

Albedo = image RGB, mip1/mip2 from GPU mipmaps, shadow = 1.0, transp = 1 − alpha,
all geometric channels (normal, depth, depth_grad, mat_id, prev) = 0.

### Browser requirements

- Chrome 113+ / Edge 113+ (WebGPU on by default)
- Firefox Nightly with `dom.webgpu.enabled = true`

### Pitfalls

- `rgba32uint` and `rgba16float` textures both need `STORAGE_BINDING | TEXTURE_BINDING` usage.
- Weight offsets are **f16 indices** (enc0=0, enc1=724, bn=1020, dec1=1092, dec0=1672).
- Uniform buffer layouts must match WGSL `Params` structs exactly (padding included).

---

## Appendix A — File Reference

| File | Purpose |
|------|---------|
| `cnn_v3/training/gen_sample.sh` | One-shot wrapper: pack input+target pair into sample directory |
| `cnn_v3/training/blender_export.py` | Configure Blender Cycles passes, render multi-layer EXR (Blender 3.x–5.x compatible) |
| `cnn_v3/training/pack_blender_sample.py` | EXR → sample PNG directory (7 files) |
| `cnn_v3/training/pack_photo_sample.py` | Photo → zeroed-geometry sample directory |
| `cnn_v3/training/cnn_v3_utils.py` | Dataset class, feature assembly, channel dropout, salient-point detection |
| `cnn_v3/training/train_cnn_v3.py` | CNNv3 model definition, training loop, CLI |
| `cnn_v3/training/export_cnn_v3_weights.py` | Checkpoint → `cnn_v3_weights.bin` + `cnn_v3_film_mlp.bin` |
| `cnn_v3/training/gen_test_vectors.py` | NumPy reference forward pass + C header generator |
| `cnn_v3/test_vectors.h` | Compiled-in test vectors (auto-generated, do not edit) |
| `cnn_v3/src/cnn_v3_effect.h` | C++ class, Params structs, `CNNv3FiLMParams` API |
| `cnn_v3/src/cnn_v3_effect.cc` | Effect implementation: pipelines, render, weight upload |
| `cnn_v3/src/gbuffer_effect.h/.cc` | GBufferEffect: rasterise + pack G-buffer feature textures |
| `src/tests/gpu/test_cnn_v3_parity.cc` | Per-pixel parity test (WGSL vs. Python reference) |
| `cnn_v3/docs/CNN_V3.md` | Full architecture spec (U-Net, FiLM, WGSL uniform layouts) |
| `cnn_v3/tools/index.html` | HTML tool — UI shell + CSS |
| `cnn_v3/tools/shaders.js` | HTML tool — inline WGSL shaders + weight-offset constants |
| `cnn_v3/tools/tester.js` | HTML tool — CNNv3Tester class, inference pipeline, layer viz |
| `cnn_v2/tools/cnn_v2_test/index.html` | HTML tool reference pattern (v2) |

---

## Appendix B — 20-Channel Feature Layout

| Index | Channel | Source | Encoding |
|-------|---------|--------|----------|
| 0–2 | albedo.rgb | `albedo.png` | f32 [0,1] |
| 3–4 | normal.xy | `normal.png` RG | oct-encoded f32 [0,1] |
| 5 | depth | `depth.png` | f32 [0,1] (1/(z+1)) |
| 6–7 | depth_grad.xy | computed from depth | central diff, signed |
| 8 | mat_id | `matid.png` | f32 [0,1] |
| 9–11 | prev.rgb | previous frame output | zero during training |
| 12–14 | mip1.rgb | pyrdown(albedo) | f32 [0,1] |
| 15–17 | mip2.rgb | pyrdown(mip1) | f32 [0,1] |
| 18 | shadow | `shadow.png` | f32 [0,1] (1=lit) |
| 19 | transp | `transp.png` | f32 [0,1] (0=opaque) |

**Feature texture packing** (`feat_tex0` / `feat_tex1`, both `rgba32uint`):

```
feat_tex0 (4×u32 = 8 f16 channels via pack2x16float):
  .x = pack2x16float(albedo.r,  albedo.g)
  .y = pack2x16float(albedo.b,  normal.x)
  .z = pack2x16float(normal.y,  depth)
  .w = pack2x16float(dgrad.x,   dgrad.y)

feat_tex1 (4×u32 = 12 u8 channels + padding via pack4x8unorm):
  .x = pack4x8unorm(mat_id, prev.r,  prev.g,   prev.b)
  .y = pack4x8unorm(mip1.r, mip1.g,  mip1.b,   mip2.r)
  .z = pack4x8unorm(mip2.g, mip2.b,  shadow,   transp)
  .w = 0 (unused, 8 reserved channels)
```