diff options
| author | skal <pascal.massimino@gmail.com> | 2026-03-21 10:27:50 +0100 |
|---|---|---|
| committer | skal <pascal.massimino@gmail.com> | 2026-03-21 10:27:50 +0100 |
| commit | e343021ac007549c76e58b27a361b11dd3f6a136 (patch) | |
| tree | a855b76dcc428752a09cbd192eabd16931baf804 /cnn_v3/docs | |
| parent | 1e8ccfc67c264ce054c59257ee7c17ec4a584a9e (diff) | |
feat(cnn_v3): export script + HOW_TO_CNN.md playbook
- export_cnn_v3_weights.py: .pth → cnn_v3_weights.bin (f16 packed u32) + cnn_v3_film_mlp.bin (f32)
- HOW_TO_CNN.md: full pipeline playbook (data collection, training, export, C++ wiring, parity, HTML tool)
- TODO.md: mark export script done
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Diffstat (limited to 'cnn_v3/docs')
| -rw-r--r-- | cnn_v3/docs/HOW_TO_CNN.md | 816 |
1 files changed, 816 insertions, 0 deletions
diff --git a/cnn_v3/docs/HOW_TO_CNN.md b/cnn_v3/docs/HOW_TO_CNN.md new file mode 100644 index 0000000..8c41ab0 --- /dev/null +++ b/cnn_v3/docs/HOW_TO_CNN.md @@ -0,0 +1,816 @@ +# CNN v3 — Complete Pipeline Playbook + +U-Net + FiLM style-transfer pipeline: data collection → training → export → C++ integration → demo → parity test → HTML tool. + +--- + +## Table of Contents + +1. [Overview](#0-overview) +2. [Collecting Training Samples](#1-collecting-training-samples) + - [1a. From Real Photos](#1a-from-real-photos) + - [1b. From Blender (Full G-Buffer)](#1b-from-blender-full-g-buffer) + - [1c. Dataset Layout](#1c-dataset-layout) +3. [Training the U-Net + FiLM](#2-training-the-u-net--film) +4. [Exporting Weights](#3-exporting-weights) +5. [Wiring into CNNv3Effect (C++)](#4-wiring-into-cnnv3effect-c) +6. [Running a Demo](#5-running-a-demo) +7. [Parity Testing](#6-parity-testing) +8. [HTML WebGPU Tool](#7-html-webgpu-tool) +9. [Appendix A — File Reference](#appendix-a--file-reference) +10. [Appendix B — 20-Channel Feature Layout](#appendix-b--20-channel-feature-layout) + +--- + +## 0. Overview + +CNN v3 is a 2-level U-Net with FiLM conditioning, designed to run in real-time as a WebGPU compute effect inside the demo. + +**Architecture:** + +``` +Input: 20-channel G-buffer feature textures (rgba32uint) + │ + enc0 ──── Conv(20→4, 3×3) + FiLM + ReLU ┐ full res + │ ↘ skip │ + enc1 ──── AvgPool2×2 + Conv(4→8, 3×3) + FiLM ┐ ½ res + │ ↘ skip │ + bottleneck AvgPool2×2 + Conv(8→8, 1×1) + ReLU ¼ res (no FiLM) + │ │ + dec1 ←── upsample×2 + cat(enc1 skip) + Conv(16→4, 3×3) + FiLM + │ │ ½ res + dec0 ←── upsample×2 + cat(enc0 skip) + Conv(8→4, 3×3) + FiLM + sigmoid + full res → RGBA output +``` + +**FiLM MLP:** `Linear(5→16) → ReLU → Linear(16→40)` trained jointly with U-Net. +- Input: `[beat_phase, beat_norm, audio_intensity, style_p0, style_p1]` +- Output: 40 γ/β values controlling style across all 4 FiLM layers + +**Weight budget:** ~3.9 KB f16 (fits ≤6 KB target) + +**Two data paths:** +- **Simple mode** — real photos with zeroed geometric channels (normal, depth, matid) +- **Full mode** — Blender G-buffer renders with all 20 channels populated + +**Pipeline summary:** + +``` +photos/Blender → pack → dataset/ → train_cnn_v3.py → checkpoint.pth + │ + export_cnn_v3_weights.py + ┌─────────┴──────────┐ + cnn_v3_weights.bin cnn_v3_film_mlp.bin + │ + CNNv3Effect::upload_weights() + │ + demo / HTML tool +``` + +--- + +## 1. Collecting Training Samples + +Each sample is a directory containing 7 PNG files. The dataloader discovers samples by scanning for directories containing `albedo.png`. + +### 1a. From Real Photos + +**What it does:** Converts one photo into a sample with zeroed geometric channels. +The network handles this correctly because channel-dropout training (§2e) teaches it +to work with or without geometry data. + +**Step 1 — Pack the photo:** +```bash +cd cnn_v3/training +python3 pack_photo_sample.py \ + --photo /path/to/photo.png \ + --output dataset/simple/sample_001/ +``` + +**What gets written:** + +| File | Content | Notes | +|------|---------|-------| +| `albedo.png` | Photo RGB uint8 | Source image | +| `normal.png` | (128, 128, 0) uint8 | Neutral "no normal" → reconstructed (0,0,1) | +| `depth.png` | All zeros uint16 | No depth data | +| `matid.png` | All zeros uint8 | No material IDs | +| `shadow.png` | 255 everywhere uint8 | Assume fully lit | +| `transp.png` | 1 − alpha uint8 | 0 = opaque | +| `target.png` | Copy of photo RGBA | **Placeholder — must be replaced** | + +**Step 2 — Provide a styled target:** + +`target.png` defaults to the input photo (identity style). You must replace it with +your stylized ground truth before training: + +```bash +cp my_stylized_version.png dataset/simple/sample_001/target.png +``` + +The network learns the mapping `albedo → target`. If target = albedo, the network +learns identity (useful as sanity check, not for real training). + +**Batch packing:** +```bash +for f in photos/*.png; do + name=$(basename "${f%.png}") + python3 pack_photo_sample.py --photo "$f" \ + --output dataset/simple/sample_${name}/ +done +``` + +**Pitfalls:** +- Input must be RGB or RGBA; grayscale photos need `.convert('RGB')` first +- `normal.png` B channel is always 0 (unused); only R and G channels carry oct-encoded XY +- `mip1`/`mip2` are computed on-the-fly by the dataloader — not stored + +--- + +### 1b. From Blender (Full G-Buffer) + +Produces all 20 feature channels including normals, depth, mat IDs, and shadow. + +#### Blender requirements + +- Blender 3.x+, Cycles render engine +- Object indices set: *Properties → Object → Relations → Object Index* must be > 0 + for objects you want tracked in `matid` (IndexOB pass) + +#### Step 1 — Render EXRs + +```bash +blender -b scene.blend -P cnn_v3/training/blender_export.py -- \ + --output /tmp/renders/frame_### \ + --width 640 --height 360 \ + --start-frame 1 --end-frame 200 +``` + +The `--` separator is **required**; arguments after it are passed to the Python script, +not to Blender. Each `#` in `--output` is replaced by a zero-padded frame digit. + +**Available flags:** + +| Flag | Default | Notes | +|------|---------|-------| +| `--output PATH` | `//renders/frame_###` | `//` = blend file directory; `###` = frame padding | +| `--width N` | 640 | Render resolution | +| `--height N` | 360 | Render resolution | +| `--start-frame N` | scene start | First frame | +| `--end-frame N` | scene end | Last frame | + +**Render pass → CNN channel mapping:** + +| Blender pass | EXR channels | CNN use | +|-------------|-------------|---------| +| Combined | `.R .G .B .A` | `target.png` (beauty, sRGB-converted) | +| DiffCol | `.R .G .B` | `albedo.png` (linear → sRGB gamma 2.2) | +| Normal | `.X .Y .Z` | `normal.png` (world-space, oct-encoded to RG) | +| Z | `.R` | `depth.png` (mapped as 1/(z+1) → uint16) | +| IndexOB | `.R` | `matid.png` (object index, clamped uint8) | +| Shadow | `.R` | `shadow.png` (255 = lit, 0 = shadowed) | +| Combined alpha | `.A` | `transp.png` (inverted: 0 = opaque) | + +**Pitfall:** Blender `Normal` pass uses `.X .Y .Z` channel names in the EXR, not `.R .G .B`. +`pack_blender_sample.py` handles both naming conventions automatically. + +#### Step 2 — Pack EXRs into sample directories + +```bash +python3 cnn_v3/training/pack_blender_sample.py \ + --exr /tmp/renders/frame_0001.exr \ + --output dataset/full/sample_0001/ +``` + +**Dependencies:** `pip install openexr` (preferred) or `pip install imageio[freeimage]` + +**Batch packing:** +```bash +for exr in /tmp/renders/frame_*.exr; do + name=$(basename "${exr%.exr}") + python3 pack_blender_sample.py --exr "$exr" \ + --output dataset/full/${name}/ +done +``` + +**What gets written:** + +| File | Source | Transform | +|------|--------|-----------| +| `albedo.png` | DiffCol pass | Linear → sRGB (γ=2.2), uint8 | +| `normal.png` | Normal pass | XYZ unit → octahedral RG, uint8 | +| `depth.png` | Z pass | 1/(z+1) normalized, uint16 | +| `matid.png` | IndexOB pass | Clamped [0,255], uint8 | +| `shadow.png` | Shadow pass | uint8 (255=lit) | +| `transp.png` | Combined alpha | 1−alpha, uint8 | +| `target.png` | Combined beauty | Linear → sRGB, RGBA uint8 | + +**Note:** `depth_grad`, `mip1`, `mip2` are computed on-the-fly by the dataloader. `prev.rgb` is always zero during training (no temporal history for static frames). + +**Pitfalls:** +- `DiffCol` pass not found → warning printed, albedo zeroed (not fatal; training continues) +- `IndexOB` all zero if Object Index not set in Blender object properties +- Alpha convention: Blender alpha=1 means opaque; `transp.png` inverts this (transp=0 opaque) +- `Shadow` pass in Cycles must be explicitly enabled in Render Properties → Passes → Effects + +--- + +### 1c. Dataset Layout + +``` +dataset/ + simple/ ← photo samples, use --input-mode simple + sample_001/ + albedo.png + normal.png + depth.png + matid.png + shadow.png + transp.png + target.png ← must be replaced with stylized target + sample_002/ + ... + full/ ← Blender samples, use --input-mode full + sample_0001/ + sample_0002/ + ... +``` + +- If `simple/` or `full/` subdir is absent the dataloader scans the root directly +- Minimum viable dataset: 1 sample (smoke test only); practical minimum ~50+ for training +- You can mix Blender and photo samples in the same subdir; the dataloader treats them identically + +--- + +## 2. Training the U-Net + FiLM + +The U-Net conv weights and FiLM MLP train **jointly** in a single run. No separate steps. + +### Prerequisites + +```bash +pip install torch torchvision pillow numpy opencv-python +cd cnn_v3/training +``` + +### Quick-start commands + +**Smoke test — 1 epoch, validates end-to-end without GPU:** +```bash +python3 train_cnn_v3.py --input dataset/ --epochs 1 \ + --patch-size 32 --detector random +``` + +**Standard photo training (patch-based):** +```bash +python3 train_cnn_v3.py \ + --input dataset/ \ + --input-mode simple \ + --epochs 200 +``` + +**Blender G-buffer training:** +```bash +python3 train_cnn_v3.py \ + --input dataset/ \ + --input-mode full \ + --epochs 200 +``` + +**Full-image mode (better global coherence, slower):** +```bash +python3 train_cnn_v3.py \ + --input dataset/ \ + --input-mode full \ + --full-image --image-size 256 \ + --epochs 500 +``` + +### Flag reference + +| Flag | Default | Notes | +|------|---------|-------| +| `--input DIR` | `training/dataset` | Dataset root; always set explicitly | +| `--input-mode` | `simple` | `simple`=photos, `full`=Blender G-buffer | +| `--epochs N` | 200 | 500 recommended for full-image mode | +| `--batch-size N` | 16 | Reduce to 4–8 on GPU OOM | +| `--lr F` | 1e-3 | Reduce to 1e-4 if loss oscillates or NaN | +| `--patch-size N` | 64 | Smaller = faster epoch, less spatial context | +| `--patches-per-image N` | 256 | Reduce for small datasets | +| `--detector` | `harris` | `random` for smoke tests; `shi-tomasi` as alternative | +| `--channel-dropout-p F` | 0.3 | Lower if all samples have geometry (Blender only) | +| `--full-image` | off | Resize full image instead of patch crops | +| `--image-size N` | 256 | Resize target; only used with `--full-image` | +| `--enc-channels` | `4,8` | Must match C++ constants if changed | +| `--film-cond-dim N` | 5 | Must match `CNNv3FiLMParams` field count in C++ | +| `--checkpoint-dir DIR` | `checkpoints/` | Set per-experiment | +| `--checkpoint-every N` | 50 | 0 to disable intermediate checkpoints | + +### Architecture at startup + +The model prints its parameter count: +``` +Model: enc=[4, 8] film_cond_dim=5 params=2097 (~3.9 KB f16) +``` + +If `params` is much higher, `--enc-channels` was changed; update C++ constants accordingly. + +### FiLM joint training + +The conditioning vector `cond` is **randomised per sample** during training: +```python +cond = np.random.rand(5).astype(np.float32) # uniform [0,1]^5 +``` +This covers the full input space so the MLP is well-conditioned for any beat/audio/style +combination at inference time. At inference, real values are fed from `set_film_params()`. + +### Channel dropout + +Applied per-sample to make the model robust to missing channels: + +| Channel group | Channels | Drop probability | +|---------------|----------|-----------------| +| Geometric | normal.xy, depth, depth_grad.xy [3,4,5,6,7] | `channel_dropout_p` (default 0.3) | +| Context | mat_id, shadow, transp [8,18,19] | `channel_dropout_p × 0.67` (~0.2) | +| Temporal | prev.rgb [9,10,11] | 0.5 (always) | + +This is why a model trained on Blender data also works on photos (geometry zeroed). +To disable dropout for a pure-Blender model: `--channel-dropout-p 0`. + +### Checkpoints + +Saved as `.pth` at `checkpoints/checkpoint_epoch_N.pth`. + +Contents of each checkpoint: +- `epoch` — epoch number +- `model_state_dict` — all weights (conv + FiLM MLP) +- `optimizer_state_dict` — Adam state (not needed for export) +- `loss` — final avg batch loss +- `config` — `{enc_channels, film_cond_dim, input_mode}` — **required by export script** + +The final checkpoint is always written even if `--checkpoint-every 0`. + +### Diagnosing training problems + +| Symptom | Likely cause | Fix | +|---------|-------------|-----| +| `RuntimeError: No samples found` | Wrong `--input` or missing `albedo.png` | Check dataset path | +| Loss stuck at epoch 1 | Dataset too small | Add more samples | +| Loss NaN from epoch 1 | Learning rate too high | Use `--lr 1e-4` | +| CUDA OOM | Batch or patch too large | `--batch-size 4 --patch-size 32` | +| Loss oscillates | LR too high late in training | Use `--lr 1e-4` or cosine schedule | +| Loss drops then plateaus | Too few samples | Add more or use `--full-image` | + +--- + +## 3. Exporting Weights + +Converts a trained `.pth` checkpoint to two raw binary files for the C++ runtime. + +```bash +cd cnn_v3/training +python3 export_cnn_v3_weights.py checkpoints/checkpoint_epoch_200.pth +# writes to export/ by default + +python3 export_cnn_v3_weights.py checkpoints/checkpoint_epoch_200.pth \ + --output /path/to/assets/ +``` + +### Output files + +**`cnn_v3_weights.bin`** — conv+bias weights for all 5 passes, packed as f16-pairs-in-u32: + +| Layer | f16 count | Bytes | +|-------|-----------|-------| +| enc0 Conv(20→4,3×3)+bias | 724 | — | +| enc1 Conv(4→8,3×3)+bias | 296 | — | +| bottleneck Conv(8→8,1×1)+bias | 72 | — | +| dec1 Conv(16→4,3×3)+bias | 580 | — | +| dec0 Conv(8→4,3×3)+bias | 292 | — | +| **Total** | **1964 f16** | **3928 bytes** | + +**`cnn_v3_film_mlp.bin`** — FiLM MLP weights as raw f32, row-major: + +| Layer | Shape | f32 count | +|-------|-------|-----------| +| L0 weight | (16, 5) | 80 | +| L0 bias | (16,) | 16 | +| L1 weight | (40, 16) | 640 | +| L1 bias | (40,) | 40 | +| **Total** | | **776 f32 = 3104 bytes** | + +The FiLM MLP is for CPU-side inference (future — see §4d). The U-Net weights in +`cnn_v3_weights.bin` are what you need immediately. + +### f16 packing format + +WGSL `get_w(buf, base, idx)` reads: `pair = buf[(base+idx)/2]`. +- Even index → low 16 bits of u32 +- Odd index → high 16 bits of u32 + +The export script produces this layout: `u32 = u16[0::2] | (u16[1::2] << 16)`. + +### Expected output + +``` +Checkpoint: epoch=200 loss=0.012345 + enc_channels=[4, 8] film_cond_dim=5 + +cnn_v3_weights.bin + 1964 f16 values → 982 u32 → 3928 bytes + Upload via CNNv3Effect::upload_weights(queue, data, 3928) + +cnn_v3_film_mlp.bin + L0: weight (16, 5) + bias (16,) + L1: weight (40, 16) + bias (40,) + 776 f32 values → 3104 bytes +``` + +### Pitfalls + +- **`enc_channels` mismatch:** if you changed `--enc-channels` during training, the layer size + assertion in the export script fires. The C++ weight-offset constants (`kEnc0Weights` etc.) + in `cnn_v3_effect.cc` must also be updated to match. +- **Old checkpoint missing `config`:** if `config` key is absent (checkpoint from a very early + version), the script defaults to `enc_channels=[4,8], film_cond_dim=5`. +- **`weights_only=True`:** requires PyTorch ≥ 2.0. If you get a warning, upgrade torch. + +--- + +## 4. Wiring into CNNv3Effect (C++) + +### Class overview + +`CNNv3Effect` (in `cnn_v3/src/cnn_v3_effect.h/.cc`) implements the `Effect` base class. +It owns: +- 5 compute pipelines (enc0, enc1, bottleneck, dec1, dec0) +- 5 params uniform buffers with per-pass `weight_offset` + FiLM γ/β +- 1 shared storage buffer `weights_buf_` (~4 KB, read-only across all shaders) + +### Wiring in a `.seq` file + +``` +SEQUENCE 0 0 "Scene with CNN v3" + EFFECT + GBufferEffect prev_cnn -> gbuf_feat0 gbuf_feat1 0 60 + EFFECT + CNNv3Effect gbuf_feat0 gbuf_feat1 -> sink 0 60 +``` + +Or direct C++: +```cpp +#include "cnn_v3/src/cnn_v3_effect.h" + +auto cnn = std::make_shared<CNNv3Effect>( + ctx, + /*inputs=*/ {"gbuf_feat0", "gbuf_feat1"}, + /*outputs=*/{"cnn_output"}, + /*start=*/0.0f, /*end=*/60.0f); +``` + +### Uploading weights + +Load `cnn_v3_weights.bin` once at startup, before the first `render()`: + +```cpp +// Read binary file +std::vector<uint8_t> data; +{ + std::ifstream f("cnn_v3_weights.bin", std::ios::binary | std::ios::ate); + data.resize(f.tellg()); + f.seekg(0); + f.read(reinterpret_cast<char*>(data.data()), data.size()); +} + +// Upload to GPU +cnn->upload_weights(ctx.queue, data.data(), (uint32_t)data.size()); +``` + +Before `upload_weights()`: all conv weights are zero, so output is `sigmoid(0) = 0.5` gray. +After: output reflects trained style. + +### Setting FiLM parameters each frame + +Call before `render()` each frame: + +```cpp +CNNv3FiLMParams fp; +fp.beat_phase = params.beat_phase; // 0-1 within current beat +fp.beat_norm = params.beat_time / 8.0f; // normalized 8-beat cycle +fp.audio_intensity = params.audio_intensity; // peak audio level [0,1] +fp.style_p0 = my_style_p0; // user-defined style param +fp.style_p1 = my_style_p1; +cnn->set_film_params(fp); +cnn->render(encoder, params, nodes); +``` + +**Current `set_film_params` behaviour (placeholder):** applies a hardcoded linear mapping — +audio modulates gamma, beat modulates beta. This is a heuristic until `cnn_v3_film_mlp.bin` +is integrated as a CPU-side MLP. + +**Future MLP inference** (when integrating `cnn_v3_film_mlp.bin`): +1. Load `cnn_v3_film_mlp.bin` → 4 matrices/biases in f32 +2. Run forward pass: `h = relu(cond @ L0_W.T + L0_b); out = h @ L1_W.T + L1_b` +3. Split `out[40]` into per-layer γ/β and write into the Params structs directly + +### Uniform struct layout (for debugging) + +`CnnV3Params4ch` (enc0, dec1, dec0 — 64 bytes): +``` +offset 0: weight_offset u32 +offset 4-31: padding (vec3u has align=16 in WGSL) +offset 32: gamma[4] vec4f +offset 48: beta[4] vec4f +``` + +`CnnV3ParamsEnc1` (enc1 — 96 bytes): same header, then `gamma_lo/hi` at 32/48, `beta_lo/hi` at 64/80. + +Static asserts in `cnn_v3_effect.h` verify exact sizes; a compile failure here means the +WGSL layout diverged from the C++ struct. + +### Intermediate node names + +Internal textures are named `<output[0]>_enc0`, `_enc1`, `_bottleneck`, `_dec1`. +These are declared in `declare_nodes()` at the correct fractional resolutions (W/2, W/4). +Do not reference them from outside the effect unless debugging. + +### Pitfalls + +- **`upload_weights` size mismatch:** the call is a raw `wgpuQueueWriteBuffer`. If the `.bin` + was generated with different `enc_channels`, inference silently corrupts. Always verify sizes match. +- **`set_film_params` must be called before `render()`** each frame; stale shadow copies from + the previous frame persist otherwise. +- **GBufferEffect must precede CNNv3Effect** in the same command encoder. +- **Bind groups are rebuilt each `render()`** — node texture views may change on resize. + +--- + +## 5. Running a Demo + +### Build + +```bash +cmake -B build -DCMAKE_BUILD_TYPE=Release +cmake --build build -j$(nproc) +./build/demo +``` + +### Expected visual output + +| Weights state | FiLM state | Expected output | +|---------------|-----------|-----------------| +| Not uploaded (zero) | any | Uniform gray (all channels ≈ 0.5) | +| Uploaded | Identity (γ=1, β=0) | Stylization from conv weights only | +| Uploaded | Varying beat_phase | Per-channel gamma/beta shift visible | +| Uploaded | Full audio + beat | Full dynamic style modulation | + +### Sanity checks + +1. **Black output:** GBufferEffect likely didn't run. Confirm it precedes CNNv3Effect and + that `set_scene()` was called. + +2. **Uniform gray:** weights not uploaded. Check file path and that `upload_weights` was + called before the first `render()`. + +3. **Correct but static style:** `set_film_params` may be called with constant zeros. + Animate `beat_phase` 0→1 to verify FiLM response. + +4. **Resolution artefacts at enc1/bottleneck boundaries:** check that `W` and `H` are + divisible by 4 (required by the 2-level pooling chain). + +--- + +## 6. Parity Testing + +The parity test validates that WGSL shaders produce bit-accurate results vs. the +Python/NumPy reference implementation in `gen_test_vectors.py`. + +### Build and run + +```bash +cmake -B build -DDEMO_BUILD_TESTS=ON +cmake --build build -j4 +cd build && ./test_cnn_v3_parity +``` + +Two tests run: +1. **Zero-weight test:** all conv weights zero → output must equal `sigmoid(0) = 0.5` + (deterministic, no reference vectors needed) +2. **Random-weight test:** random weights from fixed seed=42 applied to an 8×8 test + tensor → WGSL output compared against Python-computed reference values + +### Pass criteria + +Tolerance: **max absolute error ≤ 1/255 = 3.92e-3** (one ULP in uint8 space) + +Current results (8×8 tensors): +``` +enc0 max_err = 1.95e-3 ✓ +dec1 max_err = 1.95e-3 ✓ +final max_err = 4.88e-4 ✓ +``` + +### Regenerating test vectors + +If you change `gen_test_vectors.py` or need to refresh the seed: + +```bash +cd cnn_v3/training +python3 gen_test_vectors.py --header > ../test_vectors.h +``` + +Then recompile the parity test. The `--header` flag emits pure C to stdout; everything else +(self-test results) goes to stderr. + +### Parity rules baked into the shaders + +If results drift after shader edits, verify these invariants match the Python reference: + +| Rule | WGSL | Python (`gen_test_vectors.py`) | +|------|------|-------------------------------| +| Border padding | zero-pad (not clamp) | `np.pad(..., mode='constant')` | +| Downsampling | AvgPool 2×2 exact | `0.25 * sum of 4 neighbours` | +| Upsampling | `coord / 2` integer | `min(y//2, qH-1)` nearest | +| Skip connections | channel concatenation | `np.concatenate([up, skip], axis=2)` | +| FiLM application | after conv+bias, before ReLU | `max(0, γ·x + β)` | +| Weight layout | OIHW, biases appended | `o * IN * K² + i * K² + ky*K + kx` | +| f16 quantisation | rgba16float / rgba32uint boundaries | `np.float16(out).astype(np.float32)` | + +### Pitfalls + +- **Test fails on null/headless backend:** the test requires a real GPU (Dawn/wgpu). + Will error early if the WebGPU device cannot be created. +- **Consistent failure on random-weight test only:** `test_vectors.h` is out of sync. + Regenerate with `gen_test_vectors.py --header` and recompile. +- **Consistent failure on both tests:** shader logic diverged from the parity rules above. + +--- + +## 7. HTML WebGPU Tool + +### Current state + +There is no dedicated CNN v3 HTML tool yet. +The CNN v2 tool (`cnn_v2/tools/cnn_v2_test/index.html`) is the reference pattern. + +### CNN v2 tool as reference + +The v2 tool is a single self-contained HTML file demonstrating: +- Inline WGSL shaders (no build step) +- Drag-and-drop `.bin` weight loading +- Image/video file input +- Intermediate layer visualisation +- View modes: CNN output / original / diff×10 +- Side panel with per-layer weight statistics + +A v3 tool follows the same pattern with a more complex texture chain. + +### What a CNN v3 HTML tool requires + +**WGSL shaders to inline** (resolve `#include "cnn_v3/common"` via JS string substitution): + +```js +const common = `/* contents of cnn_v3_common.wgsl */`; +const enc0_src = enc0_template.replace('#include "cnn_v3/common"', common); +``` + +**Texture chain:** + +| Texture | Format | Size | +|---------|--------|------| +| feat_tex0 (input) | rgba32uint | W × H | +| feat_tex1 (input) | rgba32uint | W × H | +| enc0_tex | rgba16float | W × H | +| enc1_tex | rgba32uint | W/2 × H/2 | +| bottleneck_tex | rgba32uint | W/4 × H/4 | +| dec1_tex | rgba16float | W/2 × H/2 | +| output_tex | rgba16float | W × H | + +`rgba32uint` textures cannot be sampled; use `textureLoad` — already done in the shaders. + +**Weight loading:** + +```js +const resp = await fetch('cnn_v3_weights.bin'); +const buf = await resp.arrayBuffer(); +const gpu_buf = device.createBuffer({ + size: buf.byteLength, + usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST +}); +device.queue.writeBuffer(gpu_buf, 0, buf); +``` + +**FiLM MLP inference (JS-side):** + +```js +// Load cnn_v3_film_mlp.bin as Float32Array +const mlp = new Float32Array(await (await fetch('cnn_v3_film_mlp.bin')).arrayBuffer()); +const L0_W = mlp.subarray(0, 80); // (16×5) row-major +const L0_b = mlp.subarray(80, 96); +const L1_W = mlp.subarray(96, 736); // (40×16) row-major +const L1_b = mlp.subarray(736, 776); + +function mlp_forward(cond5) { + // h = relu(L0_W @ cond + L0_b) + const h = new Float32Array(16); + for (let o = 0; o < 16; o++) { + let s = L0_b[o]; + for (let i = 0; i < 5; i++) s += L0_W[o * 5 + i] * cond5[i]; + h[o] = Math.max(0, s); + } + // out = L1_W @ h + L1_b + const out = new Float32Array(40); + for (let o = 0; o < 40; o++) { + let s = L1_b[o]; + for (let i = 0; i < 16; i++) s += L1_W[o * 16 + i] * h[i]; + out[o] = s; + } + return out; // [γenc0×4, βenc0×4, γenc1×8, βenc1×8, γdec1×4, βdec1×4, γdec0×4, βdec0×4] +} +``` + +The 40 outputs split into per-layer γ/β and uploaded to the 5 params uniform buffers +before each compute dispatch. + +**Input feature assembly from a photo:** + +For simple (photo-only) mode, build `feat_tex0` and `feat_tex1` from the image data: +- `feat_tex0`: pack albedo RGB (f16×3), normal XY (128,128 neutral → 0.0 in oct), depth (0), depth_grad (0,0) as `pack2x16float` into rgba32uint +- `feat_tex1`: pack mat_id (0), prev.rgb (0,0,0), mip1.rgb, mip2.rgb, shadow (1.0), transp (0) as `pack4x8unorm` into rgba32uint + +See `cnn_v3/shaders/gbuf_pack.wgsl` for the exact packing layout (mirrors `GBufferEffect`). + +### Serving locally + +Chrome requires a real HTTP server for WebGPU (not `file://`): + +```bash +python3 -m http.server 8080 +# Open: http://localhost:8080/cnn_v3/tools/cnn_v3_test/index.html +``` + +### Browser requirements + +- Chrome 113+ with WebGPU enabled (default on desktop) +- Firefox Nightly with `dom.webgpu.enabled = true` +- Required features: check `device.features.has('shader-f16')` for f16 support; + fall back to f32 accumulation if absent + +### Pitfalls + +- `rgba32uint` requires `STORAGE` + `TEXTURE_BINDING` usage flags; missing either causes bind group creation failure +- WGSL `#include "cnn_v3/common"` must be resolved via JS string replace before passing to `device.createShaderModule()` +- Workgroup dispatch: `Math.ceil(W / 8)` × `Math.ceil(H / 8)` — same formula as C++ +- Cross-origin image loading requires CORS headers or same-origin hosting + +--- + +## Appendix A — File Reference + +| File | Purpose | +|------|---------| +| `cnn_v3/training/blender_export.py` | Configure Blender Cycles passes, render multi-layer EXR | +| `cnn_v3/training/pack_blender_sample.py` | EXR → sample PNG directory (7 files) | +| `cnn_v3/training/pack_photo_sample.py` | Photo → zeroed-geometry sample directory | +| `cnn_v3/training/cnn_v3_utils.py` | Dataset class, feature assembly, channel dropout, salient-point detection | +| `cnn_v3/training/train_cnn_v3.py` | CNNv3 model definition, training loop, CLI | +| `cnn_v3/training/export_cnn_v3_weights.py` | Checkpoint → `cnn_v3_weights.bin` + `cnn_v3_film_mlp.bin` | +| `cnn_v3/training/gen_test_vectors.py` | NumPy reference forward pass + C header generator | +| `cnn_v3/test_vectors.h` | Compiled-in test vectors (auto-generated, do not edit) | +| `cnn_v3/src/cnn_v3_effect.h` | C++ class, Params structs, `CNNv3FiLMParams` API | +| `cnn_v3/src/cnn_v3_effect.cc` | Effect implementation: pipelines, render, weight upload | +| `cnn_v3/src/gbuffer_effect.h/.cc` | GBufferEffect: rasterise + pack G-buffer feature textures | +| `src/tests/gpu/test_cnn_v3_parity.cc` | Per-pixel parity test (WGSL vs. Python reference) | +| `cnn_v3/docs/CNN_V3.md` | Full architecture spec (U-Net, FiLM, WGSL uniform layouts) | +| `cnn_v2/tools/cnn_v2_test/index.html` | HTML tool reference pattern (v2) | + +--- + +## Appendix B — 20-Channel Feature Layout + +| Index | Channel | Source | Encoding | +|-------|---------|--------|----------| +| 0–2 | albedo.rgb | `albedo.png` | f32 [0,1] | +| 3–4 | normal.xy | `normal.png` RG | oct-encoded f32 [0,1] | +| 5 | depth | `depth.png` | f32 [0,1] (1/(z+1)) | +| 6–7 | depth_grad.xy | computed from depth | central diff, signed | +| 8 | mat_id | `matid.png` | f32 [0,1] | +| 9–11 | prev.rgb | previous frame output | zero during training | +| 12–14 | mip1.rgb | pyrdown(albedo) | f32 [0,1] | +| 15–17 | mip2.rgb | pyrdown(mip1) | f32 [0,1] | +| 18 | shadow | `shadow.png` | f32 [0,1] (1=lit) | +| 19 | transp | `transp.png` | f32 [0,1] (0=opaque) | + +**Feature texture packing** (`feat_tex0` / `feat_tex1`, both `rgba32uint`): + +``` +feat_tex0 (4×u32 = 8 f16 channels via pack2x16float): + .x = pack2x16float(albedo.r, albedo.g) + .y = pack2x16float(albedo.b, normal.x) + .z = pack2x16float(normal.y, depth) + .w = pack2x16float(dgrad.x, dgrad.y) + +feat_tex1 (4×u32 = 12 u8 channels + padding via pack4x8unorm): + .x = pack4x8unorm(mat_id, prev.r, prev.g, prev.b) + .y = pack4x8unorm(mip1.r, mip1.g, mip1.b, mip2.r) + .z = pack4x8unorm(mip2.g, mip2.b, shadow, transp) + .w = 0 (unused, 8 reserved channels) +``` |
