9 files changed, 726 insertions, 130 deletions
diff --git a/doc/AI_RULES.md b/doc/AI_RULES.md
index d18a0cc..1a4ee78 100644
--- a/doc/AI_RULES.md
+++ b/doc/AI_RULES.md
@@ -5,3 +5,22 @@
 - Prefer small, reviewable commits
 - All `cmake --build` commands must use the `-j4` option for parallel building.
 - after a task, a 'big' final commit should contain a short handoff tag like "handoff(Gemini):..." if you're gemini-cli, or "handoff(Claude): ..." if you're claude-code.
+
+## Adding Visual Effects
+
+**IMPORTANT:** When adding new visual effects, follow the complete workflow in `doc/EFFECT_WORKFLOW.md`.
+
+**Required steps (must complete ALL):**
+1. Create effect files (.h, .cc, .wgsl)
+2. Add shader to `workspaces/main/assets.txt`
+3. Add `.cc` to CMakeLists.txt GPU_SOURCES (BOTH sections: headless and normal)
+4. Include header in `src/gpu/demo_effects.h`
+5. Add to timeline with `EFFECT +` (priority modifier is REQUIRED)
+6. Add to test list in `src/tests/gpu/test_demo_effects.cc`
+7. Build and verify: `cmake --build build -j4 && cd build && ./test_demo_effects`
+
+**Common mistakes to avoid:**
+- Missing priority modifier in timeline (`EFFECT` must be `EFFECT +`, `EFFECT =`, or `EFFECT -`)
+- Adding `.cc` to only one CMakeLists.txt section (need BOTH headless and normal)
+- Wrong asset ID (check assets.txt entry name → `ASSET_SHADER_<NAME>`)
+- Forgetting to add to test file
diff --git a/doc/CNN_DEBUG.md b/doc/CNN_DEBUG.md
new file mode 100644
index 0000000..dba0b60
--- /dev/null
+++ b/doc/CNN_DEBUG.md
@@ -0,0 +1,43 @@
+# CNN Effect Black Screen Bug - Resolution (2026-02)
+
+## Problem
+CNN post-processing effect showed black screen when activated at 11.50s, despite scene rendering correctly before CNN started.
+
+## Root Causes
+
+### Bug 1: Framebuffer Capture Timing
+**Location**: `src/gpu/effect.cc`
+**Issue**: Capture ran INSIDE post-effect loop after ping-pong buffer swaps. CNN layers 1+ captured wrong buffer (output being written to, not scene).
+**Fix**: Moved capture before loop starts (lines 308-346). Capture now copies `framebuffer_a` to `captured_frame` auxiliary texture ONCE before any post-effects run.
+
+### Bug 2: Missing Uniforms Update ⚠️ CRITICAL
+**Location**: `src/gpu/effects/cnn_effect.cc`
+**Issue**: `CNNEffect::update_bind_group()` never updated `uniforms_` buffer. `uniforms.resolution` uninitialized (0,0 or garbage) → UV calculation `p.xy / uniforms.resolution` produced NaN → all texture samples black.
+**Fix**: Added uniforms update before bind group creation (lines 132-142):
+```cpp
+const CommonPostProcessUniforms u = {
+    .resolution = {(float)width_, (float)height_},
+    .aspect_ratio = (float)width_ / (float)height_,
+    .time = 0.0f,
+    .beat = 0.0f,
+    .audio_intensity = 0.0f,
+};
+uniforms_.update(ctx_.queue, u);
+```
+
+## Key Lessons
+
+1. **All post-process effects MUST update `uniforms_` buffer** - Required for UV calculations and shader parameters
+2. **Framebuffer capture timing is critical** - Must happen before post-chain ping-pong starts
+3. **Uninitialized uniforms cause silent failures** - Produces black output without validation errors
+4. **Post-effects must render or chain breaks** - `loadOp=Load` preserves previous (black) content if no draw call executes
+
+## Files Modified
+- `src/gpu/effect.cc`: Lines 308-346 (capture timing)
+- `src/gpu/effects/cnn_effect.cc`: Lines 132-142 (uniforms update)
+
+## Verification
+Test: `demo64k --seek 11.5`
+- ✅ Scene visible with RotatingCube
+- ✅ CNN stylization applied
+- ✅ All 3 layers process with correct original texture reference
diff --git a/doc/CNN_EFFECT.md b/doc/CNN_EFFECT.md
index 9045739..d51c187 100644
--- a/doc/CNN_EFFECT.md
+++ b/doc/CNN_EFFECT.md
@@ -6,157 +6,281 @@ Neural network-based stylization for rendered scenes.
 
 ## Overview
 
-The CNN effect applies trainable convolutional neural network layers to post-process 3D rendered output, enabling artistic stylization (e.g., painterly, sketch, cel-shaded effects) with minimal runtime overhead.
+Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead.
 
 **Key Features:**
-- Multi-layer convolutions (3×3, 5×5, 7×7 kernels)
+- Position-aware layer 0 (coordinate input for vignetting, edge effects)
+- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining
+- Original input available to all layers via framebuffer capture
+- Configurable final blend with original scene
 - Modular WGSL shader architecture
-- Hardcoded weights (trained offline)
-- Residual connections for stable learning
+- Hardcoded weights (trained offline via PyTorch)
 - ~5-8 KB binary footprint
 
 ---
 
 ## Architecture
 
-### File Structure
-
-```
-src/gpu/effects/
-  cnn_effect.h            # CNNEffect class
-  cnn_effect.cc           # Implementation
+### RGBD → Grayscale Pipeline
 
-workspaces/main/shaders/cnn/
-  cnn_activation.wgsl     # Activation functions (tanh, ReLU, sigmoid, leaky_relu)
-  cnn_conv3x3.wgsl        # 3×3 convolution
-  cnn_conv5x5.wgsl        # 5×5 convolution
-  cnn_conv7x7.wgsl        # 7×7 convolution
-  cnn_weights_generated.wgsl  # Weight arrays (generated by training script)
-  cnn_layer.wgsl          # Main shader (composes above snippets)
-```
+**Input:** RGBD (RGB + inverse depth D=1/z)
+**Output:** Grayscale (1 channel)
+**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
 
-### Shader Composition
+**Architecture:**
+- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD
+- **Final layer (N-1):** Conv2d(7→1) - output grayscale
 
-`cnn_layer.wgsl` uses `#include` directives (resolved by `ShaderComposer`):
 ```wgsl
-#include "common_uniforms"
-#include "cnn_activation"
-#include "cnn_conv3x3"
-#include "cnn_weights_generated"
+// Inner layers: 7→4 (RGBD output)
+fn cnn_conv3x3_7to4(
+  tex: texture_2d<f32>,
+  samp: sampler,
+  uv: vec2<f32>,
+  resolution: vec2<f32>,
+  original: vec4<f32>,                     # Original RGBD [-1,1]
+  weights: array<array<f32, 8>, 36>       # 9 pos × 4 out × (7 weights + bias)
+) -> vec4<f32>
+
+// Final layer: 7→1 (grayscale output)
+fn cnn_conv3x3_7to1(
+  tex: texture_2d<f32>,
+  samp: sampler,
+  uv: vec2<f32>,
+  resolution: vec2<f32>,
+  original: vec4<f32>,
+  weights: array<array<f32, 8>, 9>        # 9 pos × (7 weights + bias)
+) -> f32
 ```
 
----
+**Input normalization:**
+- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1]
+- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1]
+- **Grayscale** computed from normalized RGBD: `0.2126*R + 0.7152*G + 0.0722*B`
+- **Inter-layer data** stays in [-1,1] (no denormalization)
+- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1]
 
-## Usage
+**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer
 
-### C++ Integration
+### Multi-Layer Architecture
 
-```cpp
-#include "gpu/effects/cnn_effect.h"
+CNNEffect supports multi-layer networks via automatic effect chaining:
 
-// Create effect (1 layer for now, expandable to 4)
-auto cnn = std::make_shared<CNNEffect>(ctx, /*num_layers=*/1);
+1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7`
+2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2)
+3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"`
+4. **Original input binding**: All layers access original via `@binding(4)`
+5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)`
 
-// Add to timeline
-timeline.add_effect(cnn, start_time, end_time);
-```
+**Framebuffer Capture API:**
+- `Effect::needs_framebuffer_capture()` - effect requests pre-capture
+- MainSequence automatically blits input → `"captured_frame"` auxiliary texture
+- Generic mechanism usable by any effect
 
-### Timeline Example
+### File Structure
 
 ```
-SEQUENCE 10.0 0
-  EFFECT CNNEffect 10.0 15.0 0  # Apply CNN stylization for 5 seconds
+src/gpu/effects/
+  cnn_effect.h/cc         # CNNEffect class + framebuffer capture
+
+workspaces/main/shaders/cnn/
+  cnn_activation.wgsl     # tanh, ReLU, sigmoid, leaky_relu
+  cnn_conv3x3.wgsl        # 3×3 convolution (standard + coord-aware)
+  cnn_conv5x5.wgsl        # 5×5 convolution (standard + coord-aware)
+  cnn_conv7x7.wgsl        # 7×7 convolution (standard + coord-aware)
+  cnn_weights_generated.wgsl  # Weight arrays (auto-generated by train_cnn.py)
+  cnn_layer.wgsl          # Main shader with layer switches (auto-generated by train_cnn.py)
 ```
 
 ---
 
-## Training Workflow (Planned)
+## Training Workflow
+
+### 1. Prepare Training Data
+
+Collect input/target image pairs:
+- **Input:** RGBA (RGB + depth as alpha channel, D=1/z)
+- **Target:** Grayscale stylized output
 
-**Step 1: Prepare Training Data**
 ```bash
-# Collect before/after image pairs
-# - Before: Raw 3D render
-# - After: Target artistic style (hand-painted, filtered, etc.)
+training/input/img_000.png   # RGBA render (RGB + depth)
+training/output/img_000.png  # Grayscale target
 ```
 
-**Step 2: Train Network**
+**Note:** Input images must be RGBA where alpha = inverse depth (1/z)
+
+### 2. Train Network
+
 ```bash
-python scripts/train_cnn.py \
-  --input rendered_scene.png \
-  --target stylized_scene.png \
+python3 training/train_cnn.py \
+  --input training/input \
+  --target training/output \
+  --layers 1 \
+  --kernel-sizes 3 \
+  --epochs 500 \
+  --checkpoint-every 50
+```
+
+**Multi-layer example (3 layers with varying kernel sizes):**
+```bash
+python3 training/train_cnn.py \
+  --input training/input \
+  --target training/output \
   --layers 3 \
-  --kernel_sizes 3,5,3 \
-  --epochs 100
+  --kernel-sizes 3,5,3 \
+  --epochs 1000 \
+  --checkpoint-every 100
+```
+
+**Note:** Training script auto-generates:
+- `cnn_weights_generated.wgsl` - weight arrays for all layers
+- `cnn_layer.wgsl` - shader with layer switches and original input binding
+
+**Resume from checkpoint:**
+```bash
+python3 training/train_cnn.py \
+  --input training/input \
+  --target training/output \
+  --resume training/checkpoints/checkpoint_epoch_200.pth
 ```
 
-**Step 3: Export Weights**
-```python
-# scripts/train_cnn.py automatically generates:
-# workspaces/main/shaders/cnn/cnn_weights_generated.wgsl
+**Export WGSL from checkpoint (no training):**
+```bash
+python3 training/train_cnn.py \
+  --export-only training/checkpoints/checkpoint_epoch_200.pth \
+  --output workspaces/main/shaders/cnn/cnn_weights_generated.wgsl
 ```
 
-**Step 4: Rebuild**
+### 3. Rebuild Demo
+
+Training script auto-generates both `cnn_weights_generated.wgsl` and `cnn_layer.wgsl`:
 ```bash
 cmake --build build -j4
+./build/demo64k
 ```
 
 ---
 
-## Implementation Details
+## Usage
+
+### C++ Integration
 
-### Convolution Function Signature
+**Single layer (manual):**
+```cpp
+#include "gpu/effects/cnn_effect.h"
 
-```wgsl
-fn cnn_conv3x3(
-  tex: texture_2d<f32>,
-  samp: sampler,
-  uv: vec2<f32>,
-  resolution: vec2<f32>,
-  weights: array<mat4x4<f32>, 9>,  # 9 samples × 4×4 matrix
-  bias: vec4<f32>
-) -> vec4<f32>
+CNNEffectParams p;
+p.layer_index = 0;
+p.total_layers = 1;
+p.blend_amount = 1.0f;
+auto cnn = std::make_shared<CNNEffect>(ctx, p);
+timeline.add_effect(cnn, start_time, end_time);
 ```
 
-- Samples 9 pixels (3×3 neighborhood)
-- Applies 4×4 weight matrix per sample (RGBA channels)
-- Returns weighted sum + bias (pre-activation)
+**Multi-layer (automatic via timeline compiler):**
 
-### Weight Storage
+Use timeline syntax - `seq_compiler` expands to multiple instances.
 
-Weights are stored as WGSL constants:
-```wgsl
-const weights_layer0: array<mat4x4<f32>, 9> = array(
-  mat4x4<f32>(1.0, 0.0, 0.0, 0.0, ...),  # Center pixel
-  mat4x4<f32>(0.0, 0.0, 0.0, 0.0, ...),  # Neighbor 1
-  // ... 7 more matrices
-);
-const bias_layer0 = vec4<f32>(0.0, 0.0, 0.0, 0.0);
+### Timeline Examples
+
+**Single-layer CNN (full stylization):**
+```
+SEQUENCE 10.0 0
+  EFFECT + Hybrid3DEffect 0.00 5.00
+  EFFECT + CNNEffect 0.50 5.00 layers=1
 ```
 
-### Residual Connection
+**Multi-layer CNN with blend:**
+```
+SEQUENCE 10.0 0
+  EFFECT + Hybrid3DEffect 0.00 5.00
+  EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7
+```
 
-Final layer adds original input:
-```wgsl
-if (params.use_residual != 0) {
-  let input = textureSample(txt, smplr, uv);
-  result = input + result * 0.3;  # Blend 30% stylization
+Expands to:
+```cpp
+// Layer 0 (captures original, blend=1.0)
+{
+  CNNEffectParams p;
+  p.layer_index = 0;
+  p.total_layers = 3;
+  p.blend_amount = 1.0f;
+  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1);
+}
+// Layer 1 (blend=1.0)
+{
+  CNNEffectParams p;
+  p.layer_index = 1;
+  p.total_layers = 3;
+  p.blend_amount = 1.0f;
+  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2);
+}
+// Layer 2 (final blend=0.7)
+{
+  CNNEffectParams p;
+  p.layer_index = 2;
+  p.total_layers = 3;
+  p.blend_amount = 0.7f;
+  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3);
 }
 ```
 
 ---
 
-## Multi-Layer Rendering (Future)
+## Shader Structure
 
-For N layers, use ping-pong textures:
+**Bindings:**
+```wgsl
+@group(0) @binding(0) var smplr: sampler;
+@group(0) @binding(1) var txt: texture_2d<f32>;              // Current layer input
+@group(0) @binding(2) var<uniform> uniforms: CommonUniforms;
+@group(0) @binding(3) var<uniform> params: CNNLayerParams;
+@group(0) @binding(4) var original_input: texture_2d<f32>;   // Layer 0 input (captured)
+```
+
+**Fragment shader logic:**
+```wgsl
+@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> {
+    let uv = p.xy / uniforms.resolution;
+    let input = textureSample(txt, smplr, uv);               // Layer N-1 output
+    let original = textureSample(original_input, smplr, uv); // Layer 0 input
+
+    var result = vec4<f32>(0.0);
+
+    if (params.layer_index == 0) {
+        result = cnn_conv3x3_with_coord(txt, smplr, uv, uniforms.resolution,
+                                        rgba_weights_layer0, coord_weights_layer0, bias_layer0);
+        result = cnn_tanh(result);
+    }
+    // ... other layers
 
+    // Blend with ORIGINAL input (not previous layer)
+    return mix(original, result, params.blend_amount);
+}
 ```
-Pass 0: input → temp_a (conv + activate)
-Pass 1: temp_a → temp_b (conv + activate)
-Pass 2: temp_b → temp_a (conv + activate)
-Pass 3: temp_a → screen (conv + activate + residual)
+
+**Weight Storage:**
+
+**Inner layers (7→4 RGBD output):**
+```wgsl
+// Structure: array<array<f32, 8>, 36>
+// 9 positions × 4 output channels, each with 7 weights + bias
+const weights_layer0: array<array<f32, 8>, 36> = array(
+  array<f32, 8>(w0_r, w0_g, w0_b, w0_d, w0_u, w0_v, w0_gray, bias0),  // pos0_ch0
+  array<f32, 8>(w1_r, w1_g, w1_b, w1_d, w1_u, w1_v, w1_gray, bias1),  // pos0_ch1
+  // ... 34 more entries
+);
 ```
 
-**Current Status:** Single-layer implementation. Multi-pass infrastructure ready but not exposed.
+**Final layer (7→1 grayscale output):**
+```wgsl
+// Structure: array<array<f32, 8>, 9>
+// 9 positions, each with 7 weights + bias
+const weights_layerN: array<array<f32, 8>, 9> = array(
+  array<f32, 8>(w0_r, w0_g, w0_b, w0_d, w0_u, w0_v, w0_gray, bias0),  // pos0
+  // ... 8 more entries
+);
+```
 
 ---
 
@@ -164,60 +288,72 @@ Pass 3: temp_a → screen (conv + activate + residual)
 
 | Component | Size | Notes |
 |-----------|------|-------|
-| `cnn_activation.wgsl` | ~200 B | 4 activation functions |
-| `cnn_conv3x3.wgsl` | ~400 B | 3×3 convolution logic |
-| `cnn_conv5x5.wgsl` | ~600 B | 5×5 convolution logic |
-| `cnn_conv7x7.wgsl` | ~800 B | 7×7 convolution logic |
-| `cnn_layer.wgsl` | ~800 B | Main shader |
-| `cnn_effect.cc` | ~300 B | C++ implementation |
-| **Weights (variable)** | **2-6 KB** | Depends on network depth/width |
-| **Total** | **5-9 KB** | Acceptable for 64k demo |
+| Activation functions | ~200 B | 4 functions |
+| Conv3x3 (standard + coord) | ~500 B | Both variants |
+| Conv5x5 (standard + coord) | ~700 B | Both variants |
+| Conv7x7 (standard + coord) | ~900 B | Both variants |
+| Main shader | ~800 B | Layer composition |
+| C++ implementation | ~300 B | Effect class |
+| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) |
+| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes |
+| **Total** | **5-9 KB** | Acceptable for 64k |
 
-**Optimization Strategies:**
+**Optimization strategies:**
 - Quantize weights (float32 → int8)
 - Prune near-zero weights
-- Share weights across layers
-- Use separable convolutions (not yet implemented)
+- Use separable convolutions
 
 ---
 
 ## Testing
 
 ```bash
-# Run effect test
-./build/test_demo_effects
-
-# Visual test in demo
-./build/demo64k  # CNN appears in timeline if added
+./build/test_demo_effects  # CNN construction/shader tests
+./build/demo64k            # Visual test
 ```
 
-**Test Coverage:**
-- Construction/initialization
-- Shader compilation
-- Bind group creation
-- Render pass execution
-
 ---
 
+## Blend Parameter Behavior
+
+**blend_amount** controls final compositing with original:
+- `blend=0.0`: Pure original (no CNN effect)
+- `blend=0.5`: 50% original + 50% CNN
+- `blend=1.0`: Pure CNN output (full stylization)
+
+**Important:** Blend uses captured layer 0 input, not previous layer output.
+
+**Example use cases:**
+- `blend=1.0`: Full stylization (default)
+- `blend=0.7`: Subtle effect preserving original details
+- `blend=0.3`: Light artistic touch
+
 ## Troubleshooting
 
 **Shader compilation fails:**
 - Check `cnn_weights_generated.wgsl` syntax
-- Verify all snippets registered in `shaders.cc::InitShaderComposer()`
+- Verify snippets registered in `shaders.cc::InitShaderComposer()`
+- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`)
 
 **Black/corrupted output:**
-- Weights likely untrained (using placeholder identity)
-- Check residual blending factor (0.3 default)
+- Weights untrained (identity placeholder)
+- Check `captured_frame` auxiliary texture is registered
+- Verify layer priorities in timeline are sequential
+
+**Wrong blend result:**
+- Ensure layer 0 has `needs_framebuffer_capture() == true`
+- Check MainSequence framebuffer capture logic
+- Verify `original_input` binding is populated
 
-**Performance issues:**
-- Reduce kernel sizes (7×7 → 3×3)
-- Decrease layer count
-- Profile with `--hot-reload` to measure frame time
+**Training loss not decreasing:**
+- Lower learning rate (`--learning-rate 0.0001`)
+- More epochs (`--epochs 1000`)
+- Check input/target image alignment
 
 ---
 
 ## References
 
-- **Shader Composition:** `doc/SEQUENCE.md` (shader parameters)
-- **Effect System:** `src/gpu/effect.h` (Effect base class)
-- **Training (external):** TensorFlow/PyTorch CNN tutorials
+- **Training Script:** `training/train_cnn.py`
+- **Shader Composition:** `doc/SEQUENCE.md`
+- **Effect System:** `src/gpu/effect.h`
diff --git a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md b/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md
new file mode 100644
index 0000000..4c13693
--- /dev/null
+++ b/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md
@@ -0,0 +1,134 @@
+# CNN RGBD→Grayscale Architecture Implementation
+
+## Summary
+
+Implemented CNN architecture upgrade: RGBD input → grayscale output with 7-channel augmented input.
+
+## Changes Made
+
+### Architecture
+
+**Input:** RGBD (4 channels: RGB + inverse depth D=1/z)
+**Output:** Grayscale (1 channel)
+**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
+
+**Layer Configuration:**
+- Inner layers (0..N-2): Conv2d(7→4) - output RGBD with tanh activation
+- Final layer (N-1): Conv2d(7→1) - output grayscale, no activation
+
+### Input Normalization (all to [-1,1])
+
+- **RGBD:** `(rgbd - 0.5) * 2`
+- **UV coords:** `(uv - 0.5) * 2`
+- **Grayscale:** `(0.2126*R + 0.7152*G + 0.0722*B - 0.5) * 2`
+
+**Rationale:** Zero-centered inputs for tanh activation, better gradient flow.
+
+### Modified Files
+
+**Training (`/Users/skal/demo/training/train_cnn.py`):**
+1. Removed `CoordConv2d` class
+2. Updated `SimpleCNN`:
+   - Inner layers: `Conv2d(7, 4)` - RGBD output
+   - Final layer: `Conv2d(7, 1)` - grayscale output
+3. Updated `forward()`:
+   - Normalize RGBD/coords/gray to [-1,1]
+   - Concatenate 7-channel input for each layer
+   - Apply tanh (inner) or none (final)
+   - Denormalize final output
+4. Updated `export_weights_to_wgsl()`:
+   - Inner: `array<array<f32, 8>, 36>` (9 pos × 4 ch × 8 values)
+   - Final: `array<array<f32, 8>, 9>` (9 pos × 8 values)
+5. Updated `generate_layer_shader()`:
+   - Use `cnn_conv3x3_7to4` for inner layers
+   - Use `cnn_conv3x3_7to1` for final layer
+   - Denormalize outputs from [-1,1] to [0,1]
+6. Updated `ImagePairDataset`:
+   - Load RGBA input (was RGB)
+
+**Shaders (`/Users/skal/demo/workspaces/main/shaders/cnn/cnn_conv3x3.wgsl`):**
+1. Added `cnn_conv3x3_7to4()`:
+   - 7-channel input: [RGBD, uv_x, uv_y, gray]
+   - 4-channel output: RGBD
+   - Weights: `array<array<f32, 8>, 36>`
+2. Added `cnn_conv3x3_7to1()`:
+   - 7-channel input: [RGBD, uv_x, uv_y, gray]
+   - 1-channel output: grayscale
+   - Weights: `array<array<f32, 8>, 9>`
+
+**Documentation (`/Users/skal/demo/doc/CNN_EFFECT.md`):**
+1. Updated architecture section with RGBD→grayscale pipeline
+2. Updated training data requirements (RGBA input)
+3. Updated weight storage format
+
+### No C++ Changes
+
+CNNLayerParams and bind groups remain unchanged.
+
+## Data Flow
+
+1. Layer 0 captures original RGBD to `captured_frame`
+2. Each layer:
+   - Samples previous layer output (RGBD in [0,1])
+   - Normalizes RGBD to [-1,1]
+   - Computes UV coords and grayscale, normalizes to [-1,1]
+   - Concatenates 7-channel input
+   - Applies convolution with layer-specific weights
+   - Outputs RGBD (inner) or grayscale (final) in [-1,1]
+   - Applies tanh (inner only)
+   - Denormalizes to [0,1] for texture storage
+   - Blends with original
+
+## Next Steps
+
+1. **Prepare RGBD training data:**
+   - Input: RGBA images (RGB + depth in alpha)
+   - Target: Grayscale stylized output
+
+2. **Train network:**
+   ```bash
+   python3 training/train_cnn.py \
+     --input training/input \
+     --target training/output \
+     --layers 3 \
+     --epochs 1000
+   ```
+
+3. **Verify generated shaders:**
+   - Check `cnn_weights_generated.wgsl` structure
+   - Check `cnn_layer.wgsl` uses new conv functions
+
+4. **Test in demo:**
+   ```bash
+   cmake --build build -j4
+   ./build/demo64k
+   ```
+
+## Design Rationale
+
+**Why [-1,1] normalization?**
+- Centered inputs for tanh (operates best around 0)
+- Better gradient flow
+- Standard ML practice for normalized data
+
+**Why RGBD throughout vs RGB?**
+- Depth information propagates through network
+- Enables depth-aware stylization
+- Consistent 4-channel processing
+
+**Why 7-channel input?**
+- Coordinates: position-dependent effects (vignettes)
+- Grayscale: luminance-aware processing
+- RGBD: full color+depth information
+- Enables richer feature learning
+
+## Testing Checklist
+
+- [ ] Train network with RGBD input data
+- [ ] Verify `cnn_weights_generated.wgsl` structure
+- [ ] Verify `cnn_layer.wgsl` uses `7to4`/`7to1` functions
+- [ ] Build demo without errors
+- [ ] Visual test: inner layers show RGBD evolution
+- [ ] Visual test: final layer produces grayscale
+- [ ] Visual test: blending works correctly
+- [ ] Compare quality with previous RGB→RGB architecture
diff --git a/doc/COMPLETED.md b/doc/COMPLETED.md
index d1c89af..2336f62 100644
--- a/doc/COMPLETED.md
+++ b/doc/COMPLETED.md
@@ -29,6 +29,22 @@ Detailed historical documents have been moved to `doc/archive/` for reference:
 
 Use `read @doc/archive/FILENAME.md` to access archived documents.
 
+## Recently Completed (February 10, 2026)
+
+- [x] **WGPU Boilerplate Factorization**
+    - **Goal**: Reduce repetitive WGPU code via builder pattern helpers
+    - **Implementation**:
+      - Created `BindGroupLayoutBuilder` and `BindGroupBuilder` for declarative bind group creation
+      - Created `RenderPipelineBuilder` to simplify pipeline setup with ShaderComposer integration
+      - Created `SamplerCache` singleton to deduplicate sampler instances
+      - Refactored `post_process_helper.cc`, `cnn_effect.cc`, `rotating_cube_effect.cc`
+    - **Result**:
+      - Bind group creation: 19 instances reduced from 14→4 lines each
+      - Pipeline creation: 30-50 lines reduced to 8 lines
+      - Sampler deduplication: 6 instances → cached
+      - Total: -122 lines boilerplate, binary size unchanged (6.3M debug)
+      - Tests pass, prevents binding index errors
+
 ## Recently Completed (February 9, 2026)
 
 - [x] **External Library Size Measurement (Task #76)**
diff --git a/doc/CONTRIBUTING.md b/doc/CONTRIBUTING.md
index 9cd785b..98df873 100644
--- a/doc/CONTRIBUTING.md
+++ b/doc/CONTRIBUTING.md
@@ -65,12 +65,15 @@ See `doc/CODING_STYLE.md` for detailed examples.
 ## Development Protocols
 
 ### Adding Visual Effect
-1. Implement `Effect` subclass in `src/gpu/demo_effects.cc`
-2. Add to workspace `timeline.seq` (e.g., `workspaces/main/timeline.seq`)
-3. **Update `test_demo_effects.cc`**:
-   - Add to test list
-   - Increment `EXPECTED_*_COUNT`
-4. Verify:
+1. Create effect class files (use `tools/shadertoy/convert_shadertoy.py` or templates)
+2. Add shader to `workspaces/main/assets.txt`
+3. Add effect `.cc` file to `CMakeLists.txt` GPU_SOURCES (both sections)
+4. Include header in `src/gpu/demo_effects.h`
+5. Add to workspace `timeline.seq` (e.g., `workspaces/main/timeline.seq`)
+6. **Update `src/tests/gpu/test_demo_effects.cc`**:
+   - Add to `post_process_effects` list (lines 80-93) or `scene_effects` list (lines 125-137)
+   - Example: `{"MyEffect", std::make_shared<MyEffect>(fixture.ctx())},`
+7. Verify:
 ```bash
 cmake -S . -B build -DDEMO_BUILD_TESTS=ON
 cmake --build build -j4 --target test_demo_effects
diff --git a/doc/EFFECT_WORKFLOW.md b/doc/EFFECT_WORKFLOW.md
new file mode 100644
index 0000000..45c47b7
--- /dev/null
+++ b/doc/EFFECT_WORKFLOW.md
@@ -0,0 +1,228 @@
+# Effect Creation Workflow
+
+**Target Audience:** AI coding agents and developers
+
+Automated checklist for adding new visual effects to the demo.
+
+---
+
+## Quick Reference
+
+**For ShaderToy conversions:** Use `tools/shadertoy/convert_shadertoy.py` then follow steps 3-8 below.
+
+**For custom effects:** Follow all steps 1-8.
+
+---
+
+## Step-by-Step Workflow
+
+### 1. Create Effect Files
+
+**Location:**
+- Header: `src/gpu/effects/<effect_name>_effect.h`
+- Implementation: `src/gpu/effects/<effect_name>_effect.cc`
+- Shader: `workspaces/main/shaders/<effect_name>.wgsl`
+
+**Naming Convention:**
+- Class name: `<EffectName>Effect` (e.g., `TunnelEffect`, `PlasmaEffect`)
+- Files: `<effect_name>_effect.*` (snake_case)
+
+**Base Class:**
+- Post-process effects: inherit from `PostProcessEffect`
+- Scene effects: inherit from `Effect`
+
+**Template:** See `tools/shadertoy/template.*` or use `convert_shadertoy.py`
+
+### 2. Add Shader to Assets
+
+**File:** `workspaces/main/assets.txt`
+
+**Format:**
+```
+SHADER_<UPPER_SNAKE_NAME>, NONE, shaders/<effect_name>.wgsl, "Effect description"
+```
+
+**Example:**
+```
+SHADER_TUNNEL, NONE, shaders/tunnel.wgsl, "Tunnel effect shader"
+```
+
+**Asset ID:** Will be `AssetId::ASSET_SHADER_<UPPER_SNAKE_NAME>` in C++
+
+### 3. Add to CMakeLists.txt
+
+**File:** `CMakeLists.txt`
+
+**Action:** Add `src/gpu/effects/<effect_name>_effect.cc` to **BOTH** GPU_SOURCES sections:
+- Headless mode section (around line 141-167)
+- Normal mode section (around line 171-197)
+
+**Location:** After similar effects (post-process with post-process, scene with scene)
+
+**Example:**
+```cmake
+# In headless section (line ~152):
+        src/gpu/effects/solarize_effect.cc
+        src/gpu/effects/tunnel_effect.cc        # <-- Add here
+        src/gpu/effects/chroma_aberration_effect.cc
+
+# In normal section (line ~183):
+        src/gpu/effects/solarize_effect.cc
+        src/gpu/effects/tunnel_effect.cc        # <-- Add here
+        src/gpu/effects/chroma_aberration_effect.cc
+```
+
+### 4. Include in demo_effects.h
+
+**File:** `src/gpu/demo_effects.h`
+
+**Action:** Add include directive:
+```cpp
+#include "gpu/effects/<effect_name>_effect.h"
+```
+
+**Location:** Alphabetically with other effect includes
+
+### 5. Add to Timeline
+
+**File:** `workspaces/main/timeline.seq`
+
+**Format:**
+```
+SEQUENCE <start_time> <priority>
+  EFFECT <+|=|-> <EffectName>Effect <local_start> <local_end> [params...]
+```
+
+**Priority Modifiers (REQUIRED):**
+- `+` : Increment priority
+- `=` : Same priority as previous effect
+- `-` : Decrement priority (for backgrounds)
+
+**Example:**
+```
+SEQUENCE 0.0 0
+  EFFECT + TunnelEffect 0.0 10.0
+```
+
+**Common Mistake:** Missing priority modifier (`+`, `=`, `-`) after EFFECT keyword
+
+### 6. Update Tests
+
+**File:** `src/tests/gpu/test_demo_effects.cc`
+
+**Action:** Add effect to appropriate list:
+
+**Post-Process Effects (lines 80-93):**
+```cpp
+{"TunnelEffect", std::make_shared<TunnelEffect>(fixture.ctx())},
+```
+
+**Scene Effects (lines 125-137):**
+```cpp
+{"TunnelEffect", std::make_shared<TunnelEffect>(fixture.ctx())},
+```
+
+**3D Effects:** If requires Renderer3D, add to `requires_3d` check (line 148-151)
+
+### 7. Build and Test
+
+```bash
+# Full build
+cmake --build build -j4
+
+# Run effect tests
+cmake -S . -B build -DDEMO_BUILD_TESTS=ON
+cmake --build build -j4 --target test_demo_effects
+cd build && ./test_demo_effects
+
+# Run all tests
+cd build && ctest
+```
+
+### 8. Verify
+
+**Checklist:**
+- [ ] Effect compiles without errors
+- [ ] Effect appears in timeline
+- [ ] test_demo_effects passes
+- [ ] Effect renders correctly: `./build/demo64k`
+- [ ] No shader compilation errors
+- [ ] Follows naming conventions
+
+---
+
+## Common Issues
+
+### Build Error: "no member named 'ASSET_..._SHADER'"
+
+**Cause:** Shader not in assets.txt or wrong asset ID name
+
+**Fix:**
+1. Check `workspaces/main/assets.txt` has shader entry
+2. Asset ID is `ASSET_` + uppercase entry name (e.g., `SHADER_TUNNEL` → `ASSET_SHADER_TUNNEL`)
+
+### Build Error: "undefined symbol for architecture"
+
+**Cause:** Effect not in CMakeLists.txt GPU_SOURCES
+
+**Fix:** Add `.cc` file to BOTH sections (headless and normal mode)
+
+### Timeline Parse Error: "Expected '+', '=', or '-'"
+
+**Cause:** Missing priority modifier after EFFECT keyword
+
+**Fix:** Use `EFFECT +`, `EFFECT =`, or `EFFECT -` (never just `EFFECT`)
+
+### Test Failure: Effect not in test list
+
+**Cause:** Effect not added to test_demo_effects.cc
+
+**Fix:** Add to `post_process_effects` or `scene_effects` list
+
+---
+
+## Automation Script Example
+
+```bash
+#!/bin/bash
+# Example automation for AI agents
+
+EFFECT_NAME="$1"  # CamelCase (e.g., "Tunnel")
+SNAKE_NAME=$(echo "$EFFECT_NAME" | sed 's/\([A-Z]\)/_\L\1/g' | sed 's/^_//')
+UPPER_NAME=$(echo "$SNAKE_NAME" | tr '[:lower:]' '[:upper:]')
+
+echo "Creating effect: $EFFECT_NAME"
+echo "  Snake case: $SNAKE_NAME"
+echo "  Upper case: $UPPER_NAME"
+
+# 1. Generate files (if using ShaderToy)
+# ./tools/shadertoy/convert_shadertoy.py shader.txt "$EFFECT_NAME"
+
+# 2. Add to assets.txt
+echo "SHADER_${UPPER_NAME}, NONE, shaders/${SNAKE_NAME}.wgsl, \"${EFFECT_NAME} effect\"" \
+    >> workspaces/main/assets.txt
+
+# 3. Add to CMakeLists.txt (both sections)
+# Use Edit tool to add to both GPU_SOURCES sections
+
+# 4. Add include to demo_effects.h
+# Use Edit tool to add #include line
+
+# 5. Add to timeline.seq
+# Use Edit tool to add EFFECT line with priority modifier
+
+# 6. Add to test file
+# Use Edit tool to add to appropriate test list
+
+# 7. Build
+cmake --build build -j4
+```
+
+---
+
+## See Also
+
+- `tools/shadertoy/README.md` - ShaderToy conversion guide
+- `doc/SEQUENCE.md` - Timeline format documentation
+- `doc/CONTRIBUTING.md` - General contribution guidelines
+- `src/gpu/effects/` - Existing effect examples
diff --git a/doc/HOWTO.md b/doc/HOWTO.md
index bdc0214..a57a161 100644
--- a/doc/HOWTO.md
+++ b/doc/HOWTO.md
@@ -86,12 +86,29 @@ make run_util_tests     # Utility tests
 
 ---
 
+## Training
+
+```bash
+./training/train_cnn.py --layers 3 --kernel_sizes 3,5,3 --epochs 10000 --batch_size 8 --input training/input/ --target training/output/ --checkpoint-every 1000
+```
+
+Generate shaders from checkpoint:
+```bash
+./training/train_cnn.py --export-only training/checkpoints/checkpoint_epoch_7000.pth
+```
+
+**Note:** Kernel sizes must match shader functions:
+- 3×3 kernel → `cnn_conv3x3_7to4` (36 weights: 9 pos × 4 channels)
+- 5×5 kernel → `cnn_conv5x5_7to4` (100 weights: 25 pos × 4 channels)
+
+---
+
 ## Timeline
 
 Edit `workspaces/main/timeline.seq`:
 ```text
 SEQUENCE 0.0 0
-  EFFECT HeptagonEffect 0.0 60.0 0
+  EFFECT + HeptagonEffect 0.0 60.0 0
 ```
 Rebuild to apply. See `doc/SEQUENCE.md`.
 
diff --git a/doc/RECIPE.md b/doc/RECIPE.md
index 6404391..d563027 100644
--- a/doc/RECIPE.md
+++ b/doc/RECIPE.md
@@ -157,8 +157,8 @@ void MyEffect::render(WGPUTextureView prev, WGPUTextureView target,
 
 **.seq syntax:**
 ```
-EFFECT MyEffect 0.0 10.0 strength=0.5 speed=3.0
-EFFECT MyEffect 10.0 20.0 strength=2.0  # speed keeps previous value
+EFFECT + MyEffect 0.0 10.0 strength=0.5 speed=3.0
+EFFECT = MyEffect 10.0 20.0 strength=2.0  # speed keeps previous value
 ```
 
 **Example:** `src/gpu/effects/flash_effect.cc`, `src/gpu/effects/chroma_aberration_effect.cc`