# CNN v3 — Complete Pipeline Playbook
U-Net + FiLM style-transfer pipeline: data collection → training → export → C++ integration → demo → parity test → HTML tool.
---
## Table of Contents
1. [Overview](#0-overview)
2. [Collecting Training Samples](#1-collecting-training-samples)
- [1a. From Real Photos](#1a-from-real-photos)
- [1b. From Blender (Full G-Buffer)](#1b-from-blender-full-g-buffer)
- [1c. Dataset Layout](#1c-dataset-layout)
3. [Training the U-Net + FiLM](#2-training-the-u-net--film)
4. [Exporting Weights](#3-exporting-weights)
5. [Wiring into CNNv3Effect (C++)](#4-wiring-into-cnnv3effect-c)
6. [Running a Demo](#5-running-a-demo)
7. [Parity Testing](#6-parity-testing)
8. [HTML WebGPU Tool](#7-html-webgpu-tool)
9. [Appendix A — File Reference](#appendix-a--file-reference)
10. [Appendix B — 20-Channel Feature Layout](#appendix-b--20-channel-feature-layout)
---
## 0. Overview
CNN v3 is a 2-level U-Net with FiLM conditioning, designed to run in real-time as a WebGPU compute effect inside the demo.
**Architecture:**
```
Input: 20-channel G-buffer feature textures (rgba32uint)
│
enc0 ──── Conv(20→4, 3×3) + FiLM + ReLU ┐ full res
│ ↘ skip │
enc1 ──── AvgPool2×2 + Conv(4→8, 3×3) + FiLM ┐ ½ res
│ ↘ skip │
bottleneck AvgPool2×2 + Conv(8→8, 1×1) + ReLU ¼ res (no FiLM)
│ │
dec1 ←── upsample×2 + cat(enc1 skip) + Conv(16→4, 3×3) + FiLM
│ │ ½ res
dec0 ←── upsample×2 + cat(enc0 skip) + Conv(8→4, 3×3) + FiLM + sigmoid
full res → RGBA output
```
**FiLM MLP:** `Linear(5→16) → ReLU → Linear(16→40)` trained jointly with U-Net.
- Input: `[beat_phase, beat_norm, audio_intensity, style_p0, style_p1]`
- Output: 40 γ/β values controlling style across all 4 FiLM layers
**Weight budget:** ~3.9 KB f16 (fits ≤6 KB target)
**Two data paths:**
- **Simple mode** — real photos with zeroed geometric channels (normal, depth, matid)
- **Full mode** — Blender G-buffer renders with all 20 channels populated
**Pipeline summary:**
```
photos/Blender → pack → dataset/ → train_cnn_v3.py → checkpoint.pth
│
export_cnn_v3_weights.py
┌─────────┴──────────┐
cnn_v3_weights.bin cnn_v3_film_mlp.bin
│
CNNv3Effect::upload_weights()
│
demo / HTML tool
```
---
## 1. Collecting Training Samples
Each sample is a directory containing 7 PNG files. The dataloader discovers samples by scanning for directories containing `albedo.png`.
### 1a. From Real Photos
**What it does:** Converts one photo into a sample with zeroed geometric channels.
The network handles this correctly because channel-dropout training (§2e) teaches it
to work with or without geometry data.
**Step 1 — Pack an input/target pair with `gen_sample`:**
```bash
cd cnn_v3/training
./gen_sample.sh /path/to/photo.png /path/to/stylized.png dataset/simple/sample_001/
```
`gen_sample.sh ` is the recommended one-shot wrapper.
It calls `pack_photo_sample.py` with both `--photo` and `--target` in a single step.
**What gets written:**
| File | Content | Notes |
|------|---------|-------|
| `albedo.png` | Photo RGB uint8 | Source image |
| `normal.png` | (128, 128, 0) uint8 | Neutral "no normal" → reconstructed (0,0,1) |
| `depth.png` | All zeros uint16 | No depth data |
| `matid.png` | All zeros uint8 | No material IDs |
| `shadow.png` | 255 everywhere uint8 | Assume fully lit |
| `transp.png` | 1 − alpha uint8 | 0 = opaque |
| `target.png` | Stylized target RGBA | Ground truth for training |
**Step 2 — Verify the target:**
The network learns the mapping `albedo → target`. If you pass the same image as both
input and target, the network learns identity (useful as sanity check, not for real
training). Confirm `target.png` looks correct before running training.
**Alternative — pack without a target yet:**
```bash
python3 pack_photo_sample.py \
--photo /path/to/photo.png \
--output dataset/simple/sample_001/
# target.png defaults to a copy of the input; replace it before training:
cp my_stylized_version.png dataset/simple/sample_001/target.png
```
**Batch packing:**
```bash
for f in photos/*.png; do
name=$(basename "${f%.png}")
./gen_sample.sh "$f" "targets/${name}_styled.png" \
dataset/simple/sample_${name}/
done
```
**Pitfalls:**
- Input must be RGB or RGBA; grayscale photos need `.convert('RGB')` first
- `normal.png` B channel is always 0 (unused); only R and G channels carry oct-encoded XY
- `mip1`/`mip2` are computed on-the-fly by the dataloader — not stored
---
### 1b. From Blender (Full G-Buffer)
Produces all 20 feature channels including normals, depth, mat IDs, and shadow.
#### Blender requirements
- **Blender 4.5 LTS** recommended (`blender4` alias); 5.x also works
- Cycles render engine (set automatically by the script)
- Object indices set: *Properties → Object → Relations → Object Index* > 0
for objects you want tracked in `matid` (IndexOB pass)
#### Step 1 — Render EXRs
```bash
cd cnn_v3/training
blender4 -b input_3d/scene.blend -P blender_export.py -- \
--output tmp/renders/frames \
--width 640 --height 360 \
--start-frame 1 --end-frame 200 \
--view-layer RenderLayer
```
The `--` separator is **required**. `blender_export.py` uses native
`OPEN_EXR_MULTILAYER` render output — all enabled passes are written
automatically. One file per frame: `{output}/0001.exr`, `0002.exr`, …
Render progress (`Fra:/Mem:` spam) is suppressed; per-frame status goes to stderr.
**Available flags:**
| Flag | Default | Notes |
|------|---------|-------|
| `--output PATH` | `//renders/` | Output directory; `//` = blend file directory |
| `--width N` | 640 | Render resolution |
| `--height N` | 360 | Render resolution |
| `--start-frame N` | scene start | First frame |
| `--end-frame N` | scene end | Last frame |
| `--view-layer NAME` | first layer | View layer name; pass `?` to list available layers |
**Render pass → EXR channel → CNN file:**
| Blender pass | Native EXR channel | CNN file |
|-------------|--------------------|---------|
| Combined | `Combined.R/G/B/A` | `target.png` (beauty, linear→sRGB) |
| DiffCol | `DiffCol.R/G/B` | `albedo.png` (linear→sRGB γ2.2) |
| Normal | `Normal.X/Y/Z` | `normal.png` (oct-encoded RG) |
| Depth | `Depth.Z` | `depth.png` (1/(z+1) → uint16) |
| IndexOB | `IndexOB.X` | `matid.png` (object index, uint8) |
| Shadow | `Shadow.X` | `shadow.png` (255=lit; defaults to 255 if absent) |
| Combined alpha | `Combined.A` | `transp.png` (1−alpha, 0=opaque) |
**Note on Shadow pass:** Blender's Cycles Shadow pass may be absent for scenes
without shadow-casting lights or catcher objects; `pack_blender_sample.py` defaults
to 1.0 (fully lit) when the channel is missing.
#### Step 2 — Pack EXRs into sample directories
```bash
python3 cnn_v3/training/pack_blender_sample.py \
--exr /tmp/renders/frame_0001.exr \
--output dataset/full/sample_0001/
```
**Dependencies:** `pip install openexr` (preferred) or `pip install imageio[freeimage]`
**Batch packing:**
```bash
for exr in /tmp/renders/frame_*.exr; do
name=$(basename "${exr%.exr}")
python3 pack_blender_sample.py --exr "$exr" \
--output dataset/full/${name}/
done
```
**What gets written:**
| File | Source | Transform |
|------|--------|-----------|
| `albedo.png` | DiffCol pass | Linear → sRGB (γ=2.2), uint8 |
| `normal.png` | Normal pass | XYZ unit → octahedral RG, uint8 |
| `depth.png` | Z pass | 1/(z+1) normalized, uint16 |
| `matid.png` | IndexOB pass | Clamped [0,255], uint8 |
| `shadow.png` | Shadow pass | uint8 (255=lit) |
| `transp.png` | Combined alpha | 1−alpha, uint8 |
| `target.png` | Combined beauty | Linear → sRGB, RGBA uint8 |
**Note:** `depth_grad`, `mip1`, `mip2` are computed on-the-fly by the dataloader. `prev.rgb` is always zero during training (no temporal history for static frames).
**Pitfalls:**
- `DiffCol` pass not found → warning printed, albedo zeroed (not fatal; training continues)
- `IndexOB` all zero if Object Index not set in Blender object properties
- Alpha convention: Blender alpha=1 means opaque; `transp.png` inverts this (transp=0 opaque)
- `Shadow` pass in Cycles must be explicitly enabled in Render Properties → Passes → Effects
#### Full example (Old House scene, 2 frames)
**Step 1 — Render:**
```bash
cd cnn_v3/training
blender4 -b input_3d/"Old House 2 3D Models.blend" -P blender_export.py -- \
--output tmp/renders/frames \
--width 640 --height 360 \
--start-frame 1 --end-frame 2 \
--view-layer RenderLayer
```
**Step 2 — Pack:**
```bash
cd cnn_v3/training
for exr in tmp/renders/frames/*.exr; do
name=$(basename "${exr%.exr}")
python3 pack_blender_sample.py \
--exr "$exr" \
--output dataset/full/sample_${name}/
done
```
---
### 1c. Dataset Layout
```
dataset/
simple/ ← photo samples, use --input-mode simple
sample_001/
albedo.png
normal.png
depth.png
matid.png
shadow.png
transp.png
target.png ← must be replaced with stylized target
sample_002/
...
full/ ← Blender samples, use --input-mode full
sample_0001/
sample_0002/
...
```
- If `simple/` or `full/` subdir is absent the dataloader scans the root directly
- Minimum viable dataset: 1 sample (smoke test only); practical minimum ~50+ for training
- You can mix Blender and photo samples in the same subdir; the dataloader treats them identically
- `target.png` may differ in resolution from `albedo.png` — the dataloader resizes it to match albedo automatically (LANCZOS)
---
## 2. Training the U-Net + FiLM
The U-Net conv weights and FiLM MLP train **jointly** in a single run. No separate steps.
### Prerequisites
```bash
pip install torch torchvision pillow numpy opencv-python
cd cnn_v3/training
```
**With `uv` (no pip needed):** dependencies are declared inline in `train_cnn_v3.py`
and installed automatically on first run:
```bash
cd cnn_v3/training
uv run train_cnn_v3.py --input dataset/ --epochs 1 --patch-size 32 --detector random
```
### Quick-start commands
**Smoke test — 1 epoch, validates end-to-end without GPU:**
```bash
python3 train_cnn_v3.py --input dataset/ --epochs 1 \
--patch-size 32 --detector random
```
**Standard photo training (patch-based):**
```bash
python3 train_cnn_v3.py \
--input dataset/ \
--input-mode simple \
--epochs 200
```
**Blender G-buffer training:**
```bash
python3 train_cnn_v3.py \
--input dataset/ \
--input-mode full \
--epochs 200
```
**Full-image mode (better global coherence, slower):**
```bash
python3 train_cnn_v3.py \
--input dataset/ \
--input-mode full \
--full-image --image-size 256 \
--epochs 500
```
### Flag reference
| Flag | Default | Notes |
|------|---------|-------|
| `--input DIR` | `training/dataset` | Dataset root; always set explicitly |
| `--input-mode` | `simple` | `simple`=photos, `full`=Blender G-buffer |
| `--epochs N` | 200 | 500 recommended for full-image mode |
| `--batch-size N` | 16 | Reduce to 4–8 on GPU OOM |
| `--lr F` | 1e-3 | Reduce to 1e-4 if loss oscillates or NaN |
| `--patch-size N` | 64 | Smaller = faster epoch, less spatial context |
| `--patches-per-image N` | 256 | Reduce for small datasets |
| `--detector` | `harris` | `random` for smoke tests; `shi-tomasi` as alternative |
| `--channel-dropout-p F` | 0.3 | Lower if all samples have geometry (Blender only) |
| `--full-image` | off | Resize full image instead of patch crops |
| `--image-size N` | 256 | Resize target; only used with `--full-image` |
| `--enc-channels` | `4,8` | Must match C++ constants if changed |
| `--film-cond-dim N` | 5 | Must match `CNNv3FiLMParams` field count in C++ |
| `--checkpoint-dir DIR` | `checkpoints/` | Set per-experiment |
| `--checkpoint-every N` | 50 | 0 to disable intermediate checkpoints |
### Architecture at startup
The model prints its parameter count:
```
Model: enc=[4, 8] film_cond_dim=5 params=2740 (~5.4 KB f16)
```
If `params` is much higher, `--enc-channels` was changed; update C++ constants accordingly.
### Windows 10 + CUDA
**Prerequisites — run once in a CMD or PowerShell prompt:**
1. Install [Python 3.11](https://www.python.org/downloads/) (add to PATH).
2. Install the CUDA-enabled PyTorch wheel (pick the CUDA version that matches your driver — check with `nvidia-smi`):
```bat
:: CUDA 12.1 (most common for RTX 20/30/40 series)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
:: CUDA 11.8 (older drivers / GTX 10xx)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
```
3. Install remaining deps:
```bat
pip install pillow numpy opencv-python
```
4. Verify GPU is visible:
```bat
python -c "import torch; print(torch.cuda.get_device_name(0))"
```
**Training — from the repo root in CMD:**
```bat
cd cnn_v3\training
python train_cnn_v3.py --input dataset/ --epochs 200
```
The script auto-detects CUDA (`Device: cuda`). Paths use forward slashes on Windows — Python handles both.
**Copying the dataset from macOS/Linux:**
Use `scp`, a USB drive, or any file share. The dataset is plain PNG files — no conversion needed.
```bat
:: example: copy from a network share
robocopy \\mac\share\cnn_v3\training\dataset dataset /E
```
**Tips:**
- If you get `CUDA OOM`: add `--batch-size 4 --patch-size 32`
- `nvidia-smi` in a second window shows live VRAM usage
- Checkpoints are `.pth` files — copy them back to macOS for export (`export_cnn_v3_weights.py` runs on any platform)
### FiLM joint training
The conditioning vector `cond` is **randomised per sample** during training:
```python
cond = np.random.rand(5).astype(np.float32) # uniform [0,1]^5
```
This covers the full input space so the MLP is well-conditioned for any beat/audio/style
combination at inference time. At inference, real values are fed from `set_film_params()`.
### Channel dropout
Applied per-sample to make the model robust to missing channels:
| Channel group | Channels | Drop probability |
|---------------|----------|-----------------|
| Geometric | normal.xy, depth, depth_grad.xy [3,4,5,6,7] | `channel_dropout_p` (default 0.3) |
| Context | mat_id, shadow, transp [8,18,19] | `channel_dropout_p × 0.67` (~0.2) |
| Temporal | prev.rgb [9,10,11] | 0.5 (always) |
This is why a model trained on Blender data also works on photos (geometry zeroed).
To disable dropout for a pure-Blender model: `--channel-dropout-p 0`.
### Checkpoints
Saved as `.pth` at `checkpoints/checkpoint_epoch_N.pth`.
Contents of each checkpoint:
- `epoch` — epoch number
- `model_state_dict` — all weights (conv + FiLM MLP)
- `optimizer_state_dict` — Adam state (not needed for export)
- `loss` — final avg batch loss
- `config` — `{enc_channels, film_cond_dim, input_mode}` — **required by export script**
The final checkpoint is always written even if `--checkpoint-every 0`.
### Diagnosing training problems
| Symptom | Likely cause | Fix |
|---------|-------------|-----|
| `RuntimeError: No samples found` | Wrong `--input` or missing `albedo.png` | Check dataset path |
| Loss stuck at epoch 1 | Dataset too small | Add more samples |
| Loss NaN from epoch 1 | Learning rate too high | Use `--lr 1e-4` |
| CUDA OOM | Batch or patch too large | `--batch-size 4 --patch-size 32` |
| Loss oscillates | LR too high late in training | Use `--lr 1e-4` or cosine schedule |
| Loss drops then plateaus | Too few samples | Add more or use `--full-image` |
---
## 3. Exporting Weights
Converts a trained `.pth` checkpoint to two raw binary files for the C++ runtime.
```bash
cd cnn_v3/training
python3 export_cnn_v3_weights.py checkpoints/checkpoint_epoch_200.pth
# writes to export/ by default
python3 export_cnn_v3_weights.py checkpoints/checkpoint_epoch_200.pth \
--output /path/to/assets/
```
### Output files
**`cnn_v3_weights.bin`** — conv+bias weights for all 5 passes, packed as f16-pairs-in-u32:
| Layer | f16 count | Bytes |
|-------|-----------|-------|
| enc0 Conv(20→4,3×3)+bias | 724 | — |
| enc1 Conv(4→8,3×3)+bias | 296 | — |
| bottleneck Conv(8→8,1×1)+bias | 72 | — |
| dec1 Conv(16→4,3×3)+bias | 580 | — |
| dec0 Conv(8→4,3×3)+bias | 292 | — |
| **Total** | **1964 f16** | **3928 bytes** |
**`cnn_v3_film_mlp.bin`** — FiLM MLP weights as raw f32, row-major:
| Layer | Shape | f32 count |
|-------|-------|-----------|
| L0 weight | (16, 5) | 80 |
| L0 bias | (16,) | 16 |
| L1 weight | (40, 16) | 640 |
| L1 bias | (40,) | 40 |
| **Total** | | **776 f32 = 3104 bytes** |
The FiLM MLP is for CPU-side inference (future — see §4d). The U-Net weights in
`cnn_v3_weights.bin` are what you need immediately.
### f16 packing format
WGSL `get_w(buf, base, idx)` reads: `pair = buf[(base+idx)/2]`.
- Even index → low 16 bits of u32
- Odd index → high 16 bits of u32
The export script produces this layout: `u32 = u16[0::2] | (u16[1::2] << 16)`.
### Expected output
```
Checkpoint: epoch=200 loss=0.012345
enc_channels=[4, 8] film_cond_dim=5
cnn_v3_weights.bin
1964 f16 values → 982 u32 → 3928 bytes
Upload via CNNv3Effect::upload_weights(queue, data, 3928)
cnn_v3_film_mlp.bin
L0: weight (16, 5) + bias (16,)
L1: weight (40, 16) + bias (40,)
776 f32 values → 3104 bytes
```
### Pitfalls
- **`enc_channels` mismatch:** if you changed `--enc-channels` during training, the layer size
assertion in the export script fires. The C++ weight-offset constants (`kEnc0Weights` etc.)
in `cnn_v3_effect.cc` must also be updated to match.
- **Old checkpoint missing `config`:** if `config` key is absent (checkpoint from a very early
version), the script defaults to `enc_channels=[4,8], film_cond_dim=5`.
- **`weights_only=True`:** requires PyTorch ≥ 2.0. If you get a warning, upgrade torch.
---
## 4. Wiring into CNNv3Effect (C++)
### Class overview
`CNNv3Effect` (in `cnn_v3/src/cnn_v3_effect.h/.cc`) implements the `Effect` base class.
It owns:
- 5 compute pipelines (enc0, enc1, bottleneck, dec1, dec0)
- 5 params uniform buffers with per-pass `weight_offset` + FiLM γ/β
- 1 shared storage buffer `weights_buf_` (~4 KB, read-only across all shaders)
### Wiring in a `.seq` file
```
SEQUENCE 0 0 "Scene with CNN v3"
EFFECT + GBufferEffect prev_cnn -> gbuf_feat0 gbuf_feat1 0 60
EFFECT + CNNv3Effect gbuf_feat0 gbuf_feat1 -> sink 0 60
```
Or direct C++:
```cpp
#include "cnn_v3/src/cnn_v3_effect.h"
auto cnn = std::make_shared(
ctx,
/*inputs=*/ {"gbuf_feat0", "gbuf_feat1"},
/*outputs=*/{"cnn_output"},
/*start=*/0.0f, /*end=*/60.0f);
```
### Uploading weights
Load `cnn_v3_weights.bin` once at startup, before the first `render()`:
```cpp
// Read binary file
std::vector data;
{
std::ifstream f("cnn_v3_weights.bin", std::ios::binary | std::ios::ate);
data.resize(f.tellg());
f.seekg(0);
f.read(reinterpret_cast(data.data()), data.size());
}
// Upload to GPU
cnn->upload_weights(ctx.queue, data.data(), (uint32_t)data.size());
```
Before `upload_weights()`: all conv weights are zero, so output is `sigmoid(0) = 0.5` gray.
After: output reflects trained style.
### Setting FiLM parameters each frame
Call before `render()` each frame:
```cpp
CNNv3FiLMParams fp;
fp.beat_phase = params.beat_phase; // 0-1 within current beat
fp.beat_norm = params.beat_time / 8.0f; // normalized 8-beat cycle
fp.audio_intensity = params.audio_intensity; // peak audio level [0,1]
fp.style_p0 = my_style_p0; // user-defined style param
fp.style_p1 = my_style_p1;
cnn->set_film_params(fp);
cnn->render(encoder, params, nodes);
```
**Current `set_film_params` behaviour (placeholder):** applies a hardcoded linear mapping —
audio modulates gamma, beat modulates beta. This is a heuristic until `cnn_v3_film_mlp.bin`
is integrated as a CPU-side MLP.
**Future MLP inference** (when integrating `cnn_v3_film_mlp.bin`):
1. Load `cnn_v3_film_mlp.bin` → 4 matrices/biases in f32
2. Run forward pass: `h = relu(cond @ L0_W.T + L0_b); out = h @ L1_W.T + L1_b`
3. Split `out[40]` into per-layer γ/β and write into the Params structs directly
### Uniform struct layout (for debugging)
`CnnV3Params4ch` (enc0, dec1, dec0 — 64 bytes):
```
offset 0: weight_offset u32
offset 4-31: padding (vec3u has align=16 in WGSL)
offset 32: gamma[4] vec4f
offset 48: beta[4] vec4f
```
`CnnV3ParamsEnc1` (enc1 — 96 bytes): same header, then `gamma_lo/hi` at 32/48, `beta_lo/hi` at 64/80.
Static asserts in `cnn_v3_effect.h` verify exact sizes; a compile failure here means the
WGSL layout diverged from the C++ struct.
### Intermediate node names
Internal textures are named `