5 files changed, 98 insertions, 86 deletions
diff --git a/cnn_v3/docs/CNN_V3.md b/cnn_v3/docs/CNN_V3.md
index d775e2b..a197a1d 100644
--- a/cnn_v3/docs/CNN_V3.md
+++ b/cnn_v3/docs/CNN_V3.md
@@ -19,7 +19,7 @@ CNN v3 is a next-generation post-processing effect using:
 - Training from both Blender renders and real photos
 - Strict test framework: per-pixel bit-exact validation across all implementations
 
-**Status:** Phases 1–5 complete. Parity validated (max_err=4.88e-4 ≤ 1/255). Next: `train_cnn_v3.py` for FiLM MLP training.
+**Status:** Phases 1–7 complete. Architecture upgraded to enc_channels=[8,16] for improved capacity. Parity test and runtime updated. Next: training pass.
 
 ---
 
@@ -52,14 +52,14 @@ A small MLP takes a conditioning vector `c` and outputs all γ/β:
 c = [beat_phase, beat_time/8, audio_intensity, style_p0, style_p1]   (5D)
     ↓ Linear(5 → 16) → ReLU
     ↓ Linear(16 → N_film_params)
-    → [γ_enc0(4ch), β_enc0(4ch), γ_enc1(8ch), β_enc1(8ch),
-       γ_dec1(4ch), β_dec1(4ch), γ_dec0(4ch), β_dec0(4ch)]
-       = 2 × (4+8+4+4) = 40 parameters output
+    → [γ_enc0(8ch), β_enc0(8ch), γ_enc1(16ch), β_enc1(16ch),
+       γ_dec1(8ch), β_dec1(8ch), γ_dec0(4ch), β_dec0(4ch)]
+       = 2 × (8+16+8+4) = 72 parameters output
 ```
 
 **Runtime cost:** trivial (one MLP forward pass per frame, CPU-side).
 **Training:** jointly trained with U-Net — backprop through FiLM to MLP.
-**Size:** MLP weights ~(5×16 + 16×40) × 2 bytes f16 ≈ 1.4 KB.
+**Size:** MLP weights ~(5×16 + 16×72) × 2 bytes f16 ≈ 2.5 KB.
 
 **Why FiLM instead of just uniform parameters?**
 - γ/β are per-channel, enabling fine-grained style control
@@ -318,22 +318,25 @@ All f16, little-endian, same packing as v2 (`pack2x16float`).
 
 ## Size Budget
 
-**CNN v3 target: ≤ 6 KB weights**
+**CNN v3 target: ≤ 6 KB weights (conv only); current arch prioritises quality**
 
-**Implemented architecture (fits ≤ 4 KB):**
+**Implemented architecture (enc_channels=[8,16] — ~15.3 KB conv f16):**
 
 | Component | Weights | Bias | Total f16 |
 |-----------|---------|------|-----------|
-| enc0: Conv(20→4, 3×3) | 20×4×9=720 | +4 | 724 |
-| enc1: Conv(4→8, 3×3) | 4×8×9=288 | +8 | 296 |
-| bottleneck: Conv(8→8, 3×3, dil=2) | 8×8×9=576 | +8 | 584 |
-| dec1: Conv(16→4, 3×3) | 16×4×9=576 | +4 | 580 |
-| dec0: Conv(8→4, 3×3) | 8×4×9=288 | +4 | 292 |
-| FiLM MLP (5→16→40) | 5×16+16×40=720 | +16+40 | 776 |
-| **Total conv** | | | **~4.84 KB f16** |
+| enc0: Conv(20→8, 3×3) | 20×8×9=1440 | +8 | 1448 |
+| enc1: Conv(8→16, 3×3) | 8×16×9=1152 | +16 | 1168 |
+| bottleneck: Conv(16→16, 3×3, dil=2) | 16×16×9=2304 | +16 | 2320 |
+| dec1: Conv(32→8, 3×3) | 32×8×9=2304 | +8 | 2312 |
+| dec0: Conv(16→4, 3×3) | 16×4×9=576 | +4 | 580 |
+| **Total conv** | | | **7828 f16 = ~15.3 KB** |
+| FiLM MLP (5→16→72) | 5×16+16×72=1232 | +16+72 | 1320 |
+| **Total incl. MLP** | | | **9148 f16 = ~17.9 KB** |
 
-Skip connections: dec1 input = 8ch (bottleneck) + 8ch (enc1 skip) = 16ch.
-dec0 input = 4ch (dec1) + 4ch (enc0 skip) = 8ch.
+Skip connections: dec1 input = 16ch (bottleneck up) + 16ch (enc1 skip) = 32ch.
+dec0 input = 8ch (dec1 up) + 8ch (enc0 skip) = 16ch.
+
+**Smaller variant (enc_channels=[4,8] — ~4.84 KB conv f16):** fits 6 KB target but has lower representational capacity. Train with `--enc-channels 4,8` if size-critical.
 
 ---
 
@@ -507,7 +510,7 @@ All tests: max per-pixel per-channel absolute error ≤ 1/255 (PyTorch f32 vs We
 
 ```python
 class CNNv3(nn.Module):
-    def __init__(self, enc_channels=[4,8], film_cond_dim=5):
+    def __init__(self, enc_channels=[8,16], film_cond_dim=5):
         super().__init__()
         # Encoder
         self.enc = nn.ModuleList([
@@ -681,11 +684,11 @@ Parity results:
 
 ```
 Pass 0: pack_gbuffer.wgsl         — assemble G-buffer channels into storage texture
-Pass 1: cnn_v3_enc0.wgsl          — encoder level 0 (20→4ch, 3×3)
-Pass 2: cnn_v3_enc1.wgsl          — encoder level 1 (4→8ch, 3×3) + downsample
-Pass 3: cnn_v3_bottleneck.wgsl    — bottleneck (8→8, 3×3, dilation=2)
-Pass 4: cnn_v3_dec1.wgsl          — decoder level 1: upsample + skip + (16→4, 3×3)
-Pass 5: cnn_v3_dec0.wgsl          — decoder level 0: upsample + skip + (8→4, 3×3)
+Pass 1: cnn_v3_enc0.wgsl          — encoder level 0 (20→8ch, 3×3)
+Pass 2: cnn_v3_enc1.wgsl          — encoder level 1 (8→16ch, 3×3) + downsample
+Pass 3: cnn_v3_bottleneck.wgsl    — bottleneck (16→16, 3×3, dilation=2)
+Pass 4: cnn_v3_dec1.wgsl          — decoder level 1: upsample + skip + (32→8, 3×3)
+Pass 5: cnn_v3_dec0.wgsl          — decoder level 0: upsample + skip + (16→4, 3×3)
 Pass 6: cnn_v3_output.wgsl        — sigmoid + composite to framebuffer
 ```
 
@@ -788,11 +791,11 @@ Status bar shows which channels are loaded.
 | Shader | Replaces | Notes |
 |--------|----------|-------|
 | `PACK_SHADER` | `STATIC_SHADER` | 20ch into feat_tex0 + feat_tex1 (rgba32uint each) |
-| `ENC0_SHADER` | part of `CNN_SHADER` | Conv(20→4, 3×3) + FiLM + ReLU; writes enc0_tex |
-| `ENC1_SHADER` | | Conv(4→8, 3×3) + FiLM + ReLU + avg_pool2×2; writes enc1_tex (half-res) |
-| `BOTTLENECK_SHADER` | | Conv(8→8, 3×3, dilation=2) + ReLU; writes bn_tex |
-| `DEC1_SHADER` | | nearest upsample×2 + concat(bn, enc1_skip) + Conv(16→4, 3×3) + FiLM + ReLU |
-| `DEC0_SHADER` | | nearest upsample×2 + concat(dec1, enc0_skip) + Conv(8→4, 3×3) + FiLM + ReLU |
+| `ENC0_SHADER` | part of `CNN_SHADER` | Conv(20→8, 3×3) + FiLM + ReLU; writes enc0_tex (rgba32uint, 8ch) |
+| `ENC1_SHADER` | | Conv(8→16, 3×3) + FiLM + ReLU + avg_pool2×2; writes enc1_lo+enc1_hi (2× rgba32uint, 16ch split) |
+| `BOTTLENECK_SHADER` | | Conv(16→16, 3×3, dilation=2) + ReLU; writes bn_lo+bn_hi (2× rgba32uint, 16ch split) |
+| `DEC1_SHADER` | | nearest upsample×2 + concat(bn, enc1_skip) + Conv(32→8, 3×3) + FiLM + ReLU; writes dec1_tex (rgba32uint, 8ch) |
+| `DEC0_SHADER` | | nearest upsample×2 + concat(dec1, enc0_skip) + Conv(16→4, 3×3) + FiLM + ReLU; writes rgba16float |
 | `OUTPUT_SHADER` | | Conv(4→4, 1×1) + sigmoid → composites to canvas |
 
 FiLM γ/β computed JS-side from sliders (tiny MLP forward pass in JS), uploaded as uniform.
@@ -805,15 +808,15 @@ FiLM γ/β computed JS-side from sliders (tiny MLP forward pass in JS), uploaded
 |------|------|--------|----------|
 | `feat_tex0` | W×H | rgba32uint | feature buffer slots 0–7 (f16) |
 | `feat_tex1` | W×H | rgba32uint | feature buffer slots 8–19 (u8+spare) |
-| `enc0_tex` | W×H | rgba32uint | 4 channels f16 (enc0 output, skip) |
-| `enc1_tex` | W/2×H/2 | rgba32uint | 8 channels f16 (enc1 out, skip) — 2 texels per pixel |
-| `bn_tex` | W/2×H/2 | rgba32uint | 8 channels f16 (bottleneck output) |
-| `dec1_tex` | W×H | rgba32uint | 4 channels f16 (dec1 output) |
-| `dec0_tex` | W×H | rgba32uint | 4 channels f16 (dec0 output) |
+| `enc0_tex` | W×H | rgba32uint | 8 channels f16 (enc0 output, skip) |
+| `enc1_lo` + `enc1_hi` | W/2×H/2 each | rgba32uint | 16 channels f16 split (enc1 out, skip) |
+| `bn_lo` + `bn_hi` | W/4×H/4 each | rgba32uint | 16 channels f16 split (bottleneck output) |
+| `dec1_tex` | W/2×H/2 | rgba32uint | 8 channels f16 (dec1 output) |
+| `dec0_tex` | W×H | rgba16float | 4 channels f16 (final RGBA output) |
 | `prev_tex` | W×H | rgba16float | previous CNN output (temporal, `F16X8`) |
 
-Skip connections: enc0_tex and enc1_tex are **kept alive** across the full forward pass
-(not ping-ponged away). DEC1 and DEC0 read them directly.
+Skip connections: enc0_tex (8ch) and enc1_lo/enc1_hi (16ch split) are **kept alive** across the
+full forward pass (not ping-ponged away). DEC1 and DEC0 read them directly.
 
 ---
 
@@ -856,7 +859,7 @@ python3 -m http.server 8000
 
 Ordered for parallel execution where possible. Phases 1 and 2 are independent.
 
-**Architecture locked:** enc_channels = [4, 8]. See Size Budget for weight counts.
+**Architecture:** enc_channels = [8, 16]. See Size Budget for weight counts.
 
 ---
 
@@ -881,7 +884,7 @@ before the real G-buffer exists. Wire real G-buffer in Phase 5.
 
 **1a. PyTorch model**
 - [ ] `cnn_v3/training/train_cnn_v3.py`
-  - [ ] `CNNv3` class: U-Net [4,8], FiLM MLP (5→16→48), channel dropout
+  - [ ] `CNNv3` class: U-Net [8,16], FiLM MLP (5→16→72), channel dropout
   - [ ] `GBufferDataset`: loads 20-channel feature tensors from packed PNGs
   - [ ] Training loop, checkpointing, grayscale/RGBA loss option
 
@@ -919,11 +922,11 @@ no batch norm at inference, `#include` existing snippets where possible.
   - writes feat_tex0 (f16×8) + feat_tex1 (u8×12, spare)
 
 **2b. U-Net compute shaders**
-- [ ] `src/effects/cnn_v3_enc0.wgsl` — Conv(20→4, 3×3) + FiLM + ReLU
-- [ ] `src/effects/cnn_v3_enc1.wgsl` — Conv(4→8, 3×3) + FiLM + ReLU + avg_pool 2×2
-- [ ] `src/effects/cnn_v3_bottleneck.wgsl` — Conv(8→8, 1×1) + FiLM + ReLU
-- [ ] `src/effects/cnn_v3_dec1.wgsl` — nearest upsample×2 + concat enc1_skip + Conv(16→4, 3×3) + FiLM + ReLU
-- [ ] `src/effects/cnn_v3_dec0.wgsl` — nearest upsample×2 + concat enc0_skip + Conv(8→4, 3×3) + FiLM + ReLU
+- [ ] `src/effects/cnn_v3_enc0.wgsl` — Conv(20→8, 3×3) + FiLM + ReLU
+- [ ] `src/effects/cnn_v3_enc1.wgsl` — Conv(8→16, 3×3) + FiLM + ReLU + avg_pool 2×2
+- [ ] `src/effects/cnn_v3_bottleneck.wgsl` — Conv(16→16, 3×3, dilation=2) + ReLU
+- [ ] `src/effects/cnn_v3_dec1.wgsl` — nearest upsample×2 + concat enc1_skip + Conv(32→8, 3×3) + FiLM + ReLU
+- [ ] `src/effects/cnn_v3_dec0.wgsl` — nearest upsample×2 + concat enc0_skip + Conv(16→4, 3×3) + FiLM + ReLU
 - [ ] `src/effects/cnn_v3_output.wgsl` — Conv(4→4, 1×1) + sigmoid → composite to framebuffer
 
 Reuse from existing shaders:
@@ -941,7 +944,7 @@ Reuse from existing shaders:
 - [ ] `src/effects/cnn_v3_effect.h` — class declaration
   - textures: feat_tex0, feat_tex1, enc0_tex, enc1_tex (half-res), bn_tex (half-res), dec1_tex, dec0_tex
   - **`WGPUTexture prev_cnn_tex_`** — persistent RGBA8, owned by effect, initialized black
-  - `FilmParams` uniform buffer (γ/β for 4 levels = 48 floats = 192 bytes)
+  - `FilmParams` uniform buffer (γ/β for 4 levels = 72 floats = 288 bytes)
   - FiLM MLP weights (loaded from .bin, run CPU-side per frame)
 
 - [ ] `src/effects/cnn_v3_effect.cc` — implementation
diff --git a/cnn_v3/docs/HOWTO.md b/cnn_v3/docs/HOWTO.md
index 9a3efdf..ff8793f 100644
--- a/cnn_v3/docs/HOWTO.md
+++ b/cnn_v3/docs/HOWTO.md
@@ -267,22 +267,30 @@ Two source files:
 ```bash
 cd cnn_v3/training
 
-# Patch-based (default) — 64×64 patches around Harris corners
-python3 train_cnn_v3.py \
+# Recommended: [8,16] channels + multi-scale loss (matches runtime)
+uv run python3 train_cnn_v3.py \
     --input dataset/ \
-    --input-mode simple \
-    --epochs 200
+    --enc-channels 8,16 \
+    --epochs 5000 \
+    --checkpoint-dir checkpoints_8_16
 
 # Full-image mode (resizes to 256×256)
-python3 train_cnn_v3.py \
+uv run python3 train_cnn_v3.py \
     --input dataset/ \
-    --input-mode full \
+    --enc-channels 8,16 \
     --full-image --image-size 256 \
-    --epochs 500
+    --epochs 5000
+
+# Size-budget variant [4,8] (fits 6 KB)
+uv run python3 train_cnn_v3.py \
+    --input dataset/ \
+    --enc-channels 4,8 \
+    --epochs 5000
 
 # Quick smoke test: 1 epoch, small patches, random detector
-python3 train_cnn_v3.py \
+uv run python3 train_cnn_v3.py \
     --input dataset/ --epochs 1 \
+    --enc-channels 8,16 \
     --patch-size 32 --detector random
 ```
 
@@ -318,7 +326,7 @@ All other flags (`--epochs`, `--lr`, `--checkpoint-dir`, `--enc-channels`, etc.)
 | `--detector` | `harris` | `harris` \| `shi-tomasi` \| `fast` \| `gradient` \| `random` |
 | `--channel-dropout-p F` | `0.3` | Dropout prob for geometric channels |
 | `--full-image` | off | Resize full image instead of cropping patches |
-| `--enc-channels C` | `4,8` | Encoder channel counts, comma-separated |
+| `--enc-channels C` | `4,8` | Encoder channel counts: `8,16` (current default runtime), `4,8` (size budget) |
 | `--film-cond-dim N` | `5` | FiLM conditioning input size |
 | `--epochs N` | `200` | Training epochs |
 | `--batch-size N` | `16` | Batch size |
@@ -397,6 +405,7 @@ Test vectors generated by `cnn_v3/training/gen_test_vectors.py` (PyTorch referen
 | 5 — Parity validation | ✅ Done | test_cnn_v3_parity.cc, max_err=4.88e-4 |
 | 6 — FiLM MLP training | ✅ Done | train_cnn_v3.py + cnn_v3_utils.py written |
 | 7 — G-buffer visualizer (C++) | ✅ Done | GBufViewEffect, 36/36 tests pass |
+| 8 — Architecture upgrade [8,16] | ✅ Done | enc_channels=[8,16], multi-scale loss, 16ch textures split into lo/hi pairs |
 | 7 — Sample loader (web tool) | ✅ Done | "Load sample directory" in cnn_v3/tools/ |
 
 ---
@@ -408,10 +417,10 @@ The common snippet provides `get_w()` and `unpack_8ch()`.
 
 | Pass | Shader | Input(s) | Output | Dims |
 |------|--------|----------|--------|------|
-| enc0 | `cnn_v3_enc0.wgsl` | feat_tex0+feat_tex1 (20ch) | enc0_tex rgba16float (4ch) | full |
-| enc1 | `cnn_v3_enc1.wgsl` | enc0_tex (AvgPool2×2 inline) | enc1_tex rgba32uint (8ch) | ½ |
-| bottleneck | `cnn_v3_bottleneck.wgsl` | enc1_tex (AvgPool2×2 inline) | bottleneck_tex rgba32uint (8ch) | ¼ |
-| dec1 | `cnn_v3_dec1.wgsl` | bottleneck_tex + enc1_tex (skip) | dec1_tex rgba16float (4ch) | ½ |
+| enc0 | `cnn_v3_enc0.wgsl` | feat_tex0+feat_tex1 (20ch) | enc0_tex rgba32uint (8ch) | full |
+| enc1 | `cnn_v3_enc1.wgsl` | enc0_tex (AvgPool2×2 inline) | enc1_lo+enc1_hi rgba32uint (16ch split) | ½ |
+| bottleneck | `cnn_v3_bottleneck.wgsl` | enc1_lo+enc1_hi (AvgPool2×2 inline) | bn_lo+bn_hi rgba32uint (16ch split) | ¼ |
+| dec1 | `cnn_v3_dec1.wgsl` | bn_lo+bn_hi + enc1_lo+enc1_hi (skip) | dec1_tex rgba32uint (8ch) | ½ |
 | dec0 | `cnn_v3_dec0.wgsl` | dec1_tex + enc0_tex (skip) | output_tex rgba16float (4ch) | full |
 
 **Parity rules baked into the shaders:**
@@ -437,12 +446,12 @@ FiLM γ/β are computed CPU-side by the FiLM MLP (Phase 4) and uploaded each fra
 **Weight offsets** (f16 units, including bias):
 | Layer | Weights | Bias | Total f16 |
 |-------|---------|------|-----------|
-| enc0  | 20×4×9=720 | +4 | 724 |
-| enc1  | 4×8×9=288  | +8 | 296 |
-| bottleneck | 8×8×9=576 | +8 | 584 |
-| dec1  | 16×4×9=576 | +4 | 580 |
-| dec0  | 8×4×9=288  | +4 | 292 |
-| **Total** | | | **2476 f16 = ~4.84 KB** |
+| enc0  | 20×8×9=1440 | +8  | 1448 |
+| enc1  | 8×16×9=1152 | +16 | 1168 |
+| bottleneck | 16×16×9=2304 | +16 | 2320 |
+| dec1  | 32×8×9=2304 | +8  | 2312 |
+| dec0  | 16×4×9=576  | +4  | 580  |
+| **Total** | | | **7828 f16 = ~15.3 KB** |
 
 **Asset IDs** (registered in `workspaces/main/assets.txt` + `src/effects/shaders.cc`):
 `SHADER_CNN_V3_COMMON`, `SHADER_CNN_V3_ENC0`, `SHADER_CNN_V3_ENC1`,
diff --git a/cnn_v3/docs/HOW_TO_CNN.md b/cnn_v3/docs/HOW_TO_CNN.md
index 09db97c..11ed260 100644
--- a/cnn_v3/docs/HOW_TO_CNN.md
+++ b/cnn_v3/docs/HOW_TO_CNN.md
@@ -358,7 +358,7 @@ uv run train_cnn_v3.py \
 
 The model prints its parameter count:
 ```
-Model: enc=[4, 8]  film_cond_dim=5  params=3252  (~6.4 KB f16)
+Model: enc=[8, 16]  film_cond_dim=5  params=9148  (~17.9 KB f16)
 ```
 
 If `params` is much higher, `--enc-channels` was changed; update C++ constants accordingly.
@@ -492,12 +492,12 @@ WEIGHTS_CNN_V3_FILM_MLP, BINARY, weights/cnn_v3_film_mlp.bin, "CNN v3 FiLM MLP w
 
 | Layer | f16 count | Bytes |
 |-------|-----------|-------|
-| enc0 Conv(20→4,3×3)+bias | 724 | — |
-| enc1 Conv(4→8,3×3)+bias | 296 | — |
-| bottleneck Conv(8→8,3×3,dil=2)+bias | 584 | — |
-| dec1 Conv(16→4,3×3)+bias | 580 | — |
-| dec0 Conv(8→4,3×3)+bias | 292 | — |
-| **Total** | **2476 f16** | **4952 bytes** |
+| enc0 Conv(20→8,3×3)+bias | 1448 | — |
+| enc1 Conv(8→16,3×3)+bias | 1168 | — |
+| bottleneck Conv(16→16,3×3,dil=2)+bias | 2320 | — |
+| dec1 Conv(32→8,3×3)+bias | 2312 | — |
+| dec0 Conv(16→4,3×3)+bias | 580 | — |
+| **Total** | **7828 f16** | **15656 bytes** |
 
 **`cnn_v3_film_mlp.bin`** — FiLM MLP weights as raw f32, row-major:
 
@@ -505,9 +505,9 @@ WEIGHTS_CNN_V3_FILM_MLP, BINARY, weights/cnn_v3_film_mlp.bin, "CNN v3 FiLM MLP w
 |-------|-------|-----------|
 | L0 weight | (16, 5) | 80 |
 | L0 bias | (16,) | 16 |
-| L1 weight | (40, 16) | 640 |
-| L1 bias | (40,) | 40 |
-| **Total** | | **776 f32 = 3104 bytes** |
+| L1 weight | (72, 16) | 1152 |
+| L1 bias | (72,) | 72 |
+| **Total** | | **1320 f32 = 5280 bytes** |
 
 The FiLM MLP is for CPU-side inference (future — see §4d). The U-Net weights in
 `cnn_v3_weights.bin` are what you need immediately.
@@ -524,16 +524,16 @@ The export script produces this layout: `u32 = u16[0::2] | (u16[1::2] << 16)`.
 
 ```
 Checkpoint: epoch=200  loss=0.012345
-  enc_channels=[4, 8]  film_cond_dim=5
+  enc_channels=[8, 16]  film_cond_dim=5
 
 cnn_v3_weights.bin
-  2476 f16 values → 1238 u32 → 4952 bytes
-  Upload via CNNv3Effect::upload_weights(queue, data, 4952)
+  7828 f16 values → 3914 u32 → 15656 bytes
+  Upload via CNNv3Effect::upload_weights(queue, data, 15656)
 
 cnn_v3_film_mlp.bin
   L0: weight (16, 5) + bias (16,)
-  L1: weight (40, 16) + bias (40,)
-  776 f32 values → 3104 bytes
+  L1: weight (72, 16) + bias (72,)
+  1320 f32 values → 5280 bytes
 ```
 
 ### Pitfalls
@@ -542,7 +542,7 @@ cnn_v3_film_mlp.bin
   assertion in the export script fires. The C++ weight-offset constants (`kEnc0Weights` etc.)
   in `cnn_v3_effect.cc` must also be updated to match.
 - **Old checkpoint missing `config`:** if `config` key is absent (checkpoint from a very early
-  version), the script defaults to `enc_channels=[4,8], film_cond_dim=5`.
+  version), the script defaults to `enc_channels=[8,16], film_cond_dim=5`.
 - **`weights_only=True`:** requires PyTorch ≥ 2.0. If you get a warning, upgrade torch.
 
 ---
diff --git a/cnn_v3/docs/cnn_v3_architecture.png b/cnn_v3/docs/cnn_v3_architecture.png
index 2116c2b..474f488 100644
--- a/cnn_v3/docs/cnn_v3_architecture.png
+++ b/cnn_v3/docs/cnn_v3_architecture.png
diff --git a/cnn_v3/docs/gen_architecture_png.py b/cnn_v3/docs/gen_architecture_png.py
index bd60a97..1c2ff65 100644
--- a/cnn_v3/docs/gen_architecture_png.py
+++ b/cnn_v3/docs/gen_architecture_png.py
@@ -108,20 +108,20 @@ def dim_label(x, y, txt):
 box(EX, Y_IN,  BW, BH_IO, C_IO,   'G-Buffer Features',
     '20 channels  ·  full res')
 
-box(EX, Y_E0,  BW, BH,    C_ENC,  'enc0  Conv(20→4, 3×3) + FiLM + ReLU',
-    'full res  ·  4 ch')
+box(EX, Y_E0,  BW, BH,    C_ENC,  'enc0  Conv(20→8, 3×3) + FiLM + ReLU',
+    'full res  ·  8 ch')
 
-box(EX, Y_E1,  BW, BH,    C_ENC,  'enc1  Conv(4→8, 3×3) + FiLM + ReLU',
-    '½ res  ·  8 ch  ·  (AvgPool↓ on input)')
+box(EX, Y_E1,  BW, BH,    C_ENC,  'enc1  Conv(8→16, 3×3) + FiLM + ReLU',
+    '½ res  ·  16 ch  ·  (AvgPool↓ on input)')
 
 box(BX, Y_BN,  BW_BN, BH_BN, C_BN,
-    'bottleneck  Conv(8→8, 3×3, dilation=2) + ReLU',
-    '¼ res  ·  8 ch  ·  no FiLM  ·  effective RF ≈ 10 px @ ½res')
+    'bottleneck  Conv(16→16, 3×3, dilation=2) + ReLU',
+    '¼ res  ·  16 ch  ·  no FiLM  ·  effective RF ≈ 10 px @ ½res')
 
-box(DX, Y_D1,  BW, BH,    C_DEC,  'dec1  Conv(16→4, 3×3) + FiLM + ReLU',
-    '½ res  ·  4 ch  ·  (upsample↑ + cat enc1 skip)')
+box(DX, Y_D1,  BW, BH,    C_DEC,  'dec1  Conv(32→8, 3×3) + FiLM + ReLU',
+    '½ res  ·  8 ch  ·  (upsample↑ + cat enc1 skip)')
 
-box(DX, Y_D0,  BW, BH,    C_DEC,  'dec0  Conv(8→4, 3×3) + FiLM + sigmoid',
+box(DX, Y_D0,  BW, BH,    C_DEC,  'dec0  Conv(16→4, 3×3) + FiLM + sigmoid',
     'full res  ·  4 ch  ·  (upsample↑ + cat enc0 skip)')
 
 box(DX, Y_OUT, BW, BH_IO, C_IO,   'RGBA Output',