7 files changed, 1176 insertions, 0 deletions
diff --git a/cnn_v1/docs/CNN.md b/cnn_v1/docs/CNN.md
new file mode 100644
index 0000000..5d9a667
--- /dev/null
+++ b/cnn_v1/docs/CNN.md
@@ -0,0 +1,79 @@
+# Convolutional Neural Net Shader (CNN) post-processing
+
+**Status:** ✅ Foundation implemented (single-layer, expandable to multi-pass)
+
+## Idea
+
+Have the input 3d scene be processed by a multi-layer CNN trained on the side.
+Input: some rendered scene.
+Output: 'stylized' scene with CNN post-processing.
+
+**See `CNN_V1_EFFECT.md` for implementation details, usage, and API reference.**
+
+## Shader implementation
+
+### input / output
+
+Need 1 texture buffer per CNN layer.
+Input (r,g,b,1/z) for layer 0 (render 3d scene), or output from layer N-1 for layer N.
+output: (r,g,b, alpha). Don't need the 1/z information (can be fetched from input)
+
+### size of one layer
+
+Notation:
+S: the number of input samples from layer N-1.
+Example: 3x3 input -> S = 3x3 = 9. 
+
+Each S samples is 4 values (r,g,b, w=1/z).
+
+Each sample is processed by a mat4 matrix. 4 input => 4 output.
+
+Weight matrix = S x mat4
+
+Final bias: 4 values.
+
+WGSL code example: See file CNN.shader
+
+### Layers
+
+we need 3 or 4 layer ?
+Several different shaders for each layer.
+Ping-pong for input/output texture buffer between each layers?
+
+## Implementation Status
+
+**Completed:**
+- ✅ Modular WGSL shader architecture (6 snippet files)
+- ✅ CNNEffect C++ class (single-layer rendering)
+- ✅ ShaderComposer integration (#include resolution)
+- ✅ Asset registration (7 new shader assets)
+- ✅ Test coverage (test_demo_effects.cc)
+- ✅ Placeholder identity weights for testing
+
+**Size:** ~3-4 KB shader code + ~2-4 KB weights = **5-8 KB total**
+
+**Pending:**
+- ⏳ Training script (`scripts/train_cnn.py`) to generate real weights
+- ⏳ Multi-layer rendering with ping-pong textures
+- ⏳ Weight quantization for size optimization
+
+---
+
+## Training (To Be Implemented)
+
+The layer weight/bias data are hard-coded in the shaders.
+Training workflow:
+
+1. Prepare image pairs (before: raw render, after: target style)
+2. Run `python scripts/train_cnn.py --input scene.png --target stylized.png`
+3. Script generates `cnn_weights_generated.wgsl`
+4. Rebuild: `cmake --build build -j4`
+
+**Reference:** File `CNN.py` contains training example (needs adaptation).
+
+Need a repository of reference image pairs (before/after) for training and validation.
+Each input image is randomly sampled into 3×3 patch of (r,g,b,1/z) input samples.
+And trained to match the (r,g,b,a) output.
+
+Training generates the .wgsl code for layers' shaders.
+
diff --git a/cnn_v1/docs/CNN_BIAS_FIX_2026-02.md b/cnn_v1/docs/CNN_BIAS_FIX_2026-02.md
new file mode 100644
index 0000000..26db8eb
--- /dev/null
+++ b/cnn_v1/docs/CNN_BIAS_FIX_2026-02.md
@@ -0,0 +1,85 @@
+# CNN Bias Accumulation Fix (2026-02-11)
+
+## Problem
+Bias was being added multiple times in shader convolution loops (once per kernel position), causing mismatch between PyTorch training and WGSL inference.
+
+## Root Cause
+**Location**: `training/train_cnn.py:381, 398`
+
+When exporting weights to WGSL, bias was replicated for every kernel position. The shader loops through positions doing:
+```wgsl
+sum += dot(weights[pos], rgbd) + dot(weights[pos+1], in1);  // in1.w = 1.0
+```
+
+For 3×3 kernel (9 positions), bias added 9×. For 5×5, added 25×.
+
+## Fix
+Divide bias by `num_positions` during export:
+```python
+# Final layer (7→1)
+v1.append(f"{bias[0] / num_positions:.6f}")
+
+# Inner layers (7→4)
+v1.append(f"{bias[out_c] / num_positions:.6f}")
+```
+
+Shader accumulates bias × num_positions = original bias (correct).
+
+---
+
+## Additional Improvements
+
+### 1. RGBA Output Support
+**train_cnn.py**: Now saves 4-channel RGBA PNG preserving alpha from input:
+```python
+alpha = img_tensor[0, 3:4, :, :].permute(1, 2, 0).numpy()
+output_rgba = np.concatenate([output, alpha], axis=2)
+Image.fromarray((output_rgba * 255).astype(np.uint8), mode='RGBA')
+```
+
+Intermediate layers also save RGBA if 4-channel.
+
+### 2. Debug Hex Output
+**Both tools** support `--debug-hex` to print first 8 pixels as hex:
+```bash
+./training/train_cnn.py --infer input.png --export-only checkpoint.pth --debug-hex
+./build/cnn_test input.png output.png --debug-hex
+```
+
+Output format: `[0] 0xRRGGBBAA` for pixel-level comparison.
+
+### 3. Cleanup
+Removed sRGB/linear_png debug code from `cnn_test.cc` (simplified PNG saving).
+
+---
+
+## Files Modified
+- `training/train_cnn.py`: Bias fix, RGBA output, --debug-hex
+- `tools/cnn_test.cc`: --debug-hex, remove linear_png
+- `workspaces/main/shaders/cnn/cnn_weights_generated.wgsl`: Regenerated with fixed bias
+
+## Testing
+```bash
+# Train with fixed export
+./training/train_cnn.py --input training/input/ --target training/output/ \
+  --layers 3 --kernel_sizes 3,3,3 --epochs 5000
+
+# Generate ground truth
+./training/train_cnn.py --infer input.png --export-only checkpoint.pth \
+  --output ground_truth.png --debug-hex
+
+# Run GPU tool
+./build/cnn_test input.png tool_output.png --debug-hex
+
+# Compare hex output for first 8 pixels
+```
+
+---
+
+## Status
+✅ Bias accumulation bug fixed
+✅ RGBA output with alpha preservation
+✅ Debug hex comparison tool
+✅ Weights regenerated
+
+Commit: `8ff8c56`
diff --git a/cnn_v1/docs/CNN_DEBUG.md b/cnn_v1/docs/CNN_DEBUG.md
new file mode 100644
index 0000000..ba220a0
--- /dev/null
+++ b/cnn_v1/docs/CNN_DEBUG.md
@@ -0,0 +1,43 @@
+# CNN Effect Black Screen Bug - Resolution (2026-02)
+
+## Problem
+CNN post-processing effect showed black screen when activated at 11.50s, despite scene rendering correctly before CNN started.
+
+## Root Causes
+
+### Bug 1: Framebuffer Capture Timing
+**Location**: `src/gpu/effect.cc`
+**Issue**: Capture ran INSIDE post-effect loop after ping-pong buffer swaps. CNN layers 1+ captured wrong buffer (output being written to, not scene).
+**Fix**: Moved capture before loop starts (lines 308-346). Capture now copies `framebuffer_a` to `captured_frame` auxiliary texture ONCE before any post-effects run.
+
+### Bug 2: Missing Uniforms Update ⚠️ CRITICAL
+**Location**: `src/effects/cnn_effect.cc`
+**Issue**: `CNNEffect::update_bind_group()` never updated `uniforms_` buffer. `uniforms.resolution` uninitialized (0,0 or garbage) → UV calculation `p.xy / uniforms.resolution` produced NaN → all texture samples black.
+**Fix**: Added uniforms update before bind group creation (lines 132-142):
+```cpp
+const CommonPostProcessUniforms u = {
+    .resolution = {(float)width_, (float)height_},
+    .aspect_ratio = (float)width_ / (float)height_,
+    .time = 0.0f,
+    .beat = 0.0f,
+    .audio_intensity = 0.0f,
+};
+uniforms_.update(ctx_.queue, u);
+```
+
+## Key Lessons
+
+1. **All post-process effects MUST update `uniforms_` buffer** - Required for UV calculations and shader parameters
+2. **Framebuffer capture timing is critical** - Must happen before post-chain ping-pong starts
+3. **Uninitialized uniforms cause silent failures** - Produces black output without validation errors
+4. **Post-effects must render or chain breaks** - `loadOp=Load` preserves previous (black) content if no draw call executes
+
+## Files Modified
+- `src/gpu/effect.cc`: Lines 308-346 (capture timing)
+- `src/effects/cnn_effect.cc`: Lines 132-142 (uniforms update)
+
+## Verification
+Test: `demo64k --seek 11.5`
+- ✅ Scene visible with RotatingCube
+- ✅ CNN stylization applied
+- ✅ All 3 layers process with correct original texture reference
diff --git a/cnn_v1/docs/CNN_FLATTEN_ANALYSIS.md b/cnn_v1/docs/CNN_FLATTEN_ANALYSIS.md
new file mode 100644
index 0000000..8664157
--- /dev/null
+++ b/cnn_v1/docs/CNN_FLATTEN_ANALYSIS.md
@@ -0,0 +1,189 @@
+# CNN Shader Flatten Mode - Technical Analysis
+
+**Status:** Analysis complete - flatten mode NOT RECOMMENDED
+
+**Date:** February 2026
+
+---
+
+## Context
+
+Current CNN architecture uses **3 sequential render passes** (linear chaining):
+- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer
+- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer
+- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original
+
+Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers.
+
+---
+
+## Current Architecture
+
+**Shader Structure:**
+- 1 pipeline with layer branching (`layer_index` uniform)
+- 5 bindings: sampler, input texture, uniforms, layer params, original capture
+- Total shader size: ~8 KB (snippets + weights)
+
+**Performance Profile:**
+- 3 render pass dispatches
+- 2 framebuffer writes + reads between layers
+- Memory bandwidth: ~2× framebuffer size per layer
+- Register pressure: Low (per-layer isolation)
+
+**Weight Buffer:** 290 vec4s (4.6 KB) - already unified
+
+---
+
+## Flatten Approaches Evaluated
+
+### Option A: Full Flatten (All 3 Layers)
+
+**Cascading Receptive Field:**
+
+To compute final output at position (x, y):
+- Layer 2 needs 3×3 neighborhood of Layer 1 outputs
+- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs
+- Each Layer 0 output needs 5×5 neighborhood of input samples
+
+**Effective input sampling:** 9×9 pixels (vs current 5×5 max)
+
+**Intermediate Storage (per thread/pixel):**
+```
+Layer 0 outputs: 5×5 positions × 4 channels = 100 floats
+Layer 1 outputs: 3×3 positions × 4 channels =  36 floats
+                                   TOTAL = 136 floats (544 bytes)
+```
+
+**GPU Register Pressure:**
+- Modern GPUs: 32-64 KB registers per SM, shared across warps
+- 544 bytes/thread → max 64 threads/SM (**low occupancy**)
+- Current multi-pass: ~4-8 bytes/thread (high occupancy)
+
+**Pros:**
+- 1 dispatch vs 3 (reduce CPU overhead)
+- Zero framebuffer bandwidth between layers
+
+**Cons:**
+- **Severe register pressure** (10-20× increase)
+- Reduced occupancy → potential performance loss
+- Complex shader (harder debug, larger binary)
+- 9×9 input sampling
+
+**Assessment:** ❌ **Not Recommended**
+Register cost outweighs bandwidth savings.
+
+---
+
+### Option B: Partial Flatten (Layers 1 + 2)
+
+Keep Layer 0 separate, flatten only Layers 1 and 2.
+
+**Pass Structure:**
+1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer
+2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader
+
+**Intermediate Storage:**
+```
+Layer 0 samples: 3×3 × 4 = 36 floats (read once)
+Layer 1 outputs: 3×3 × 4 = 36 floats (computed)
+                 TOTAL = 72 floats (288 bytes)
+```
+
+**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs
+
+**Pros:**
+- 2 passes vs 3 (33% reduction)
+- 1 framebuffer write saved
+- More manageable register usage
+
+**Cons:**
+- Still significant register pressure (288 bytes vs ~8 bytes baseline)
+- Medium complexity increase
+- Layer 0 (heaviest kernel) still separate
+
+**Assessment:** ⚠️ **Marginal Benefit**
+Saves 1 pass but register cost still high.
+
+---
+
+### Option C: Keep Current Multi-Pass ✅
+
+**Rationale:**
+- Current architecture well-suited to GPU design (high throughput via parallelism)
+- Minimal register usage → high occupancy → hides memory latency
+- Framebuffer bandwidth cost < register pressure cost
+- Clean separation aids debugging/iteration
+- Modular (easy to add/remove layers)
+
+**Alternative Optimizations (if bandwidth critical):**
+1. Merge passes via render pass load/store ops (Vulkan subpasses)
+2. Reduce intermediate channel count (4→3 or 2)
+3. Hybrid: Compute shaders + workgroup shared memory
+4. Layer pruning (2-layer vs 3-layer quality comparison)
+
+---
+
+## Recommendation
+
+**✅ Keep current multi-pass architecture**
+
+### Decision Matrix
+
+| Factor | Multi-Pass | Partial Flatten | Full Flatten |
+|--------|-----------|----------------|--------------|
+| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme |
+| Occupancy | ✅ High | ⚠️ Medium | ❌ Low |
+| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest |
+| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High |
+| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard |
+| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest |
+
+**Modern GPU Architecture Favors:**
+- High parallelism (many small threads) over complex threads
+- Hiding latency via occupancy over minimizing operations
+- Memory bandwidth via caching, not elimination
+
+---
+
+## Alternative: Compute Shader + Shared Memory
+
+**If bandwidth becomes critical:**
+- Use compute shader with workgroup shared memory
+- Load tile + halos into shared memory (9×9 input samples)
+- Compute all 3 layers for tile interior (avoids redundant sampling)
+- Requires explicit synchronization (`workgroupBarrier`)
+
+**Trade-offs:**
+- ✅ Low register pressure + low bandwidth
+- ❌ Compute pipeline complexity (no render pass integration)
+- ❌ Tile edge handling
+- ❌ Larger code size
+
+---
+
+## Conclusion
+
+Current 3-pass architecture is **appropriate for demo64k**:
+- Size-efficient (modular shaders)
+- Performance adequate (bandwidth not bottleneck)
+- Maintainable (clean layer isolation)
+
+**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint.
+
+### Size Optimization Alternatives (Better ROI)
+
+If size optimization critical, focus on:
+1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization)
+2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s)
+3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels)
+
+These yield better size/performance than shader architecture changes.
+
+---
+
+## References
+
+- `CNN_V1_EFFECT.md` - CNN implementation details
+- `CNN.md` - High-level CNN design
+- `../src/cnn_effect.cc` - Current implementation
+- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets
diff --git a/cnn_v1/docs/CNN_RGBD_GRAYSCALE_SUMMARY.md b/cnn_v1/docs/CNN_RGBD_GRAYSCALE_SUMMARY.md
new file mode 100644
index 0000000..3439f2c
--- /dev/null
+++ b/cnn_v1/docs/CNN_RGBD_GRAYSCALE_SUMMARY.md
@@ -0,0 +1,136 @@
+# CNN RGBD→Grayscale Architecture Implementation
+
+## Summary
+
+Implemented CNN architecture upgrade: RGBD input → grayscale output with 7-channel augmented input.
+
+## Changes Made
+
+### Architecture
+
+**Input:** RGBD (4 channels: RGB + inverse depth D=1/z)
+**Output:** Grayscale (1 channel)
+**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
+
+**Layer Configuration:**
+- Inner layers (0..N-2): Conv2d(7→4) - output RGBD with tanh activation
+- Final layer (N-1): Conv2d(7→1) - output grayscale, no activation
+
+### Input Normalization (all to [-1,1])
+
+- **RGBD:** `(rgbd - 0.5) * 2`
+- **UV coords:** `(uv - 0.5) * 2`
+- **Grayscale:** `dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722))` (computed once, passed as parameter)
+
+**Rationale:** Zero-centered inputs for tanh activation, better gradient flow.
+
+### Modified Files
+
+**Training (`/Users/skal/demo/training/train_cnn.py`):**
+1. Removed `CoordConv2d` class
+2. Updated `SimpleCNN`:
+   - Inner layers: `Conv2d(7, 4)` - RGBD output
+   - Final layer: `Conv2d(7, 1)` - grayscale output
+3. Updated `forward()`:
+   - Normalize RGBD/coords/gray to [-1,1]
+   - Concatenate 7-channel input for each layer
+   - Apply tanh (inner) or none (final)
+   - Denormalize final output
+4. Updated `export_weights_to_wgsl()`:
+   - Inner: `array<array<f32, 8>, 36>` (9 pos × 4 ch × 8 values)
+   - Final: `array<array<f32, 8>, 9>` (9 pos × 8 values)
+5. Updated `generate_layer_shader()`:
+   - Use `cnn_conv3x3_7to4` for inner layers
+   - Use `cnn_conv3x3_7to1` for final layer
+   - Denormalize outputs from [-1,1] to [0,1]
+6. Updated `ImagePairDataset`:
+   - Load RGBA input (was RGB)
+
+**Shaders (`/Users/skal/demo/workspaces/main/shaders/cnn/cnn_conv3x3.wgsl`):**
+1. Added `cnn_conv3x3_7to4()`:
+   - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter)
+   - 4-channel output: RGBD
+   - Weights: `array<array<f32, 8>, 36>`
+2. Added `cnn_conv3x3_7to1()`:
+   - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter)
+   - 1-channel output: grayscale
+   - Weights: `array<array<f32, 8>, 9>`
+3. Optimized: gray computed once in caller using `dot()`, not per-function
+
+**Documentation (`/Users/skal/demo/doc/CNN_EFFECT.md`):**
+1. Updated architecture section with RGBD→grayscale pipeline
+2. Updated training data requirements (RGBA input)
+3. Updated weight storage format
+
+### No C++ Changes
+
+CNNLayerParams and bind groups remain unchanged.
+
+## Data Flow
+
+1. Layer 0 captures original RGBD to `captured_frame`
+2. Each layer:
+   - Samples previous layer output (RGBD in [0,1])
+   - Normalizes RGBD to [-1,1]
+   - Computes gray once using `dot()` (fs_main level)
+   - Normalizes UV coords to [-1,1] (inside conv functions)
+   - Concatenates 7-channel input
+   - Applies convolution with layer-specific weights
+   - Outputs RGBD (inner) or grayscale (final) in [-1,1]
+   - Applies tanh (inner only)
+   - Denormalizes to [0,1] for texture storage
+   - Blends with original
+
+## Next Steps
+
+1. **Prepare RGBD training data:**
+   - Input: RGBA images (RGB + depth in alpha)
+   - Target: Grayscale stylized output
+
+2. **Train network:**
+   ```bash
+   python3 training/train_cnn.py \
+     --input training/input \
+     --target training/output \
+     --layers 3 \
+     --epochs 1000
+   ```
+
+3. **Verify generated shaders:**
+   - Check `cnn_weights_generated.wgsl` structure
+   - Check `cnn_layer.wgsl` uses new conv functions
+
+4. **Test in demo:**
+   ```bash
+   cmake --build build -j4
+   ./build/demo64k
+   ```
+
+## Design Rationale
+
+**Why [-1,1] normalization?**
+- Centered inputs for tanh (operates best around 0)
+- Better gradient flow
+- Standard ML practice for normalized data
+
+**Why RGBD throughout vs RGB?**
+- Depth information propagates through network
+- Enables depth-aware stylization
+- Consistent 4-channel processing
+
+**Why 7-channel input?**
+- Coordinates: position-dependent effects (vignettes)
+- Grayscale: luminance-aware processing
+- RGBD: full color+depth information
+- Enables richer feature learning
+
+## Testing Checklist
+
+- [ ] Train network with RGBD input data
+- [ ] Verify `cnn_weights_generated.wgsl` structure
+- [ ] Verify `cnn_layer.wgsl` uses `7to4`/`7to1` functions
+- [ ] Build demo without errors
+- [ ] Visual test: inner layers show RGBD evolution
+- [ ] Visual test: final layer produces grayscale
+- [ ] Visual test: blending works correctly
+- [ ] Compare quality with previous RGB→RGB architecture
diff --git a/cnn_v1/docs/CNN_TEST_TOOL.md b/cnn_v1/docs/CNN_TEST_TOOL.md
new file mode 100644
index 0000000..4307894
--- /dev/null
+++ b/cnn_v1/docs/CNN_TEST_TOOL.md
@@ -0,0 +1,244 @@
+# CNN Shader Testing Tool
+
+Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Supports both CNN v1 (render pipeline) and v2 (compute, storage buffer).
+
+---
+
+## Purpose
+
+- Validate trained weights against ground truth
+- Debug CNN layer behavior in isolation
+- Generate test outputs for training workflow
+- Match Python training script's inference mode
+
+---
+
+## Architecture
+
+**Two implementations:**
+
+1. **CNN v1** (render pipeline, texture atlas weights)
+   - 3 fixed layers
+   - RGBA16Float intermediates
+   - BGRA8Unorm final output
+
+2. **CNN v2** (compute shaders, storage buffer weights)
+   - Dynamic layer count from binary
+   - 7D static features (RGBD + UV + sin + bias)
+   - RGBA32Uint packed f16 intermediates
+   - Storage buffer: ~3-5 KB weights
+
+**Core GPU utility:** `src/gpu/texture_readback.{h,cc}`
+- Synchronous texture-to-CPU readback
+- Supports RGBA16Float, RGBA32Uint, BGRA8Unorm
+- Protected with STRIP_ALL (0 bytes in release)
+
+---
+
+## Usage
+
+```bash
+cnn_test input.png output.png [OPTIONS]
+
+OPTIONS:
+  --cnn-version N          CNN version: 1 (default) or 2 (ignored with --weights)
+  --weights PATH           Load weights from .bin (forces CNN v2, overrides layer config)
+  --blend F                Final blend amount (0.0-1.0, default: 1.0)
+  --format ppm|png         Output format (default: png)
+  --layers N               Number of CNN layers (1-10, v1 only, default: 3, ignored with --weights)
+  --save-intermediates DIR Save intermediate layers to directory
+  --debug-hex              Print first 8 pixels as hex (debug)
+  --help                   Show usage
+```
+
+**Examples:**
+```bash
+# CNN v1 (render pipeline, 3 layers)
+./build/cnn_test input.png output.png --cnn-version 1
+
+# CNN v2 (compute, storage buffer, uses asset system weights)
+./build/cnn_test input.png output.png --cnn-version 2
+
+# CNN v2 with runtime weight loading (loads layer config from .bin)
+./build/cnn_test input.png output.png --weights checkpoints/checkpoint_epoch_100.pth.bin
+
+# 50% blend with original (v2)
+./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5
+
+# Debug hex dump
+./build/cnn_test input.png output.png --cnn-version 2 --debug-hex
+```
+
+**Important:** When using `--weights`, the layer count and kernel sizes are read from the binary file header, overriding any `--layers` or `--cnn-version` arguments.
+
+---
+
+## Implementation Details
+
+### Core Readback Utility
+
+**File:** `src/gpu/texture_readback.{h,cc}`
+
+**Function:**
+```cpp
+std::vector<uint8_t> read_texture_pixels(
+    WGPUInstance instance,
+    WGPUDevice device,
+    WGPUTexture texture,
+    int width,
+    int height);
+```
+
+**Features:**
+- Returns BGRA8 format (4 bytes per pixel)
+- Synchronous blocking operation
+- Cross-platform async callback handling (Win32 vs Native API)
+- Automatic staging buffer creation and cleanup
+
+**Refactored OffscreenRenderTarget:**
+```cpp
+std::vector<uint8_t> OffscreenRenderTarget::read_pixels() {
+#if !defined(STRIP_ALL)
+  return read_texture_pixels(instance_, device_, texture_, width_, height_);
+#else
+  return std::vector<uint8_t>();
+#endif
+}
+```
+
+### CNN v1 Pipeline (Render)
+
+**Fixed 3-layer architecture:**
+- Ping-pong RGBA16Float textures
+- CNNLayerParams (binding 3): layer_index, blend_amount
+- Shader composer resolves #include directives
+
+### CNN v2 Pipeline (Compute)
+
+**Dynamic layer architecture:**
+1. **Static features compute:** Generate 7D features (RGBD + UV + sin + bias)
+2. **Layer computes:** N layers from binary weights (3-5 typically)
+   - Storage buffer weights (read-only)
+   - RGBA32Uint packed f16 textures (ping-pong)
+   - CNNv2LayerParams: kernel_size, channels, weight_offset, blend
+3. **Readback:** RGBA32Uint → f16 decode → u8 clamp
+
+**Binary format:** Header (20B) + layer info (20B×N) + f16 weights
+
+**Weight Loading:**
+- **Without `--weights`:** Loads from asset system (`ASSET_WEIGHTS_CNN_V2`)
+- **With `--weights PATH`:** Loads from external `.bin` file (e.g., checkpoint exports)
+  - Layer count and kernel sizes parsed from binary header
+  - Overrides any `--layers` or `--cnn-version` arguments
+  - Enables runtime testing of training checkpoints without rebuild
+
+---
+
+## Build Integration
+
+**CMakeLists.txt:**
+
+1. Added `src/gpu/texture_readback.cc` to GPU_SOURCES (both sections)
+2. Tool target:
+```cmake
+add_executable(cnn_test
+    tools/cnn_test.cc
+    src/tests/common/webgpu_test_fixture.cc
+    src/tests/common/offscreen_render_target.cc
+    ${PLATFORM_SOURCES}
+    ${GEN_DEMO_CC})
+
+target_link_libraries(cnn_test PRIVATE
+    gpu util procedural ${DEMO_LIBS})
+
+add_dependencies(cnn_test generate_demo_assets)
+
+target_compile_definitions(cnn_test PRIVATE
+    STB_IMAGE_IMPLEMENTATION
+    STB_IMAGE_WRITE_IMPLEMENTATION)
+```
+
+**Build:**
+```bash
+cmake -S . -B build -DDEMO_BUILD_TOOLS=ON
+cmake --build build -j4
+```
+
+---
+
+## Validation Workflow (CNN v2)
+
+### 1. Train and Export
+```bash
+# Train and export weights
+./scripts/train_cnn_v2_full.sh --epochs 200 --batch-size 16
+```
+
+### 2. Tool Inference
+```bash
+# Run tool with v2
+./build/cnn_test training/input/img_000.png output.png --cnn-version 2
+```
+
+### 3. Visual Comparison
+Compare output.png with training/target_X/img_000.png
+
+---
+
+## Status
+
+**CNN v1:** Builds and runs, produces incorrect output (all white). Use CNNEffect in demo for visual validation.
+
+**CNN v2:** ⚠️ Partially functional. Readback works but output differs from HTML validation tool.
+- Loads binary weights from `workspaces/main/weights/cnn_v2_weights.bin`
+- Matches CNNv2Effect architecture
+- **Known Issue:** Visual output differs from `tools/cnn_v2_test/index.html` despite matching shader code
+- Root cause under investigation (weight indexing? texture sampling? activation clamping?)
+- Use HTML tool (`tools/cnn_v2_test/index.html`) for accurate validation
+
+---
+
+## Technical Notes (Readback Fix)
+
+**Original Bug:** Buffer mapping returned `WGPUMapAsyncStatus_Unknown` (status=5)
+
+**Root Cause:** Callback mode mismatch
+- Used `WGPUCallbackMode_WaitAnyOnly` (fires only during `wgpuInstanceWaitAny`)
+- Called `wgpuInstanceProcessEvents` in wait loop (wrong API for this mode)
+- Callback never fired → timeout → empty buffer
+
+**Fix Applied:**
+1. Changed callback mode to `WGPUCallbackMode_AllowProcessEvents`
+2. Replaced `wgpuInstanceProcessEvents` with `wgpuDevicePoll(device, true, nullptr)`
+3. Added pre-mapping device poll to ensure copy completes
+
+**Relevant Code:** `src/gpu/texture_readback.cc` lines 97-110
+
+**Reference:** WebGPU spec - Asynchronous Operations, Callback Modes
+
+---
+
+## Limitations
+
+- **CNN v1:** Produces incorrect output, use for debugging only
+- **Single image:** Batch processing requires shell loop
+- **No real-time preview:** Offline processing only
+- **PNG input:** stb_image (JPEG/PNG/BMP/TGA also supported)
+
+---
+
+## Technical Notes
+
+**CNN v2 f16 decoding:**
+- RGBA32Uint texture stores 8×f16 as 4×u32
+- Custom decoder: extract u16, decode f16→f32, clamp [0,1]→u8
+- Handles denormals, infinity, NaN
+
+**Cross-platform:**
+- macOS, Linux (native WebGPU)
+- Windows (mingw-w64 cross-compile)
+
+**Size impact:**
+- Debug/STRIP_ALL=OFF: compiled
+- STRIP_ALL=ON: 0 bytes (compiled out)
+- FINAL_STRIP=ON: tool not built
diff --git a/cnn_v1/docs/CNN_V1_EFFECT.md b/cnn_v1/docs/CNN_V1_EFFECT.md
new file mode 100644
index 0000000..40f095e
--- /dev/null
+++ b/cnn_v1/docs/CNN_V1_EFFECT.md
@@ -0,0 +1,400 @@
+# CNN Post-Processing Effect
+
+Neural network-based stylization for rendered scenes.
+
+---
+
+## Overview
+
+Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead.
+
+**Key Features:**
+- Position-aware layer 0 (coordinate input for vignetting, edge effects)
+- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining
+- Original input available to all layers via framebuffer capture
+- Configurable final blend with original scene
+- Modular WGSL shader architecture
+- Hardcoded weights (trained offline via PyTorch)
+- ~5-8 KB binary footprint
+
+---
+
+## Architecture
+
+### RGBD → Grayscale Pipeline
+
+**Input:** RGBD (RGB + inverse depth D=1/z)
+**Output:** Grayscale (1 channel)
+**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
+
+**Architecture:**
+- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD
+- **Final layer (N-1):** Conv2d(7→1) - output grayscale
+
+```wgsl
+// Inner layers: 7→4 (RGBD output, vec4-optimized)
+fn cnn_conv3x3_7to4(
+  tex: texture_2d<f32>,
+  samp: sampler,
+  uv: vec2<f32>,
+  resolution: vec2<f32>,
+  gray: f32,                               # Grayscale [-1,1]
+  weights: array<vec4<f32>, 72>           # 9 pos × 4 ch × 2 vec4 (8 floats per filter)
+) -> vec4<f32>
+
+// Final layer: 7→1 (grayscale output, vec4-optimized)
+fn cnn_conv3x3_7to1(
+  tex: texture_2d<f32>,
+  samp: sampler,
+  uv: vec2<f32>,
+  resolution: vec2<f32>,
+  gray: f32,
+  weights: array<vec4<f32>, 18>           # 9 pos × 2 vec4 (8 floats per filter)
+) -> f32
+```
+
+**Input normalization:**
+- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1]
+- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1]
+- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))`
+- **Inter-layer data** stays in [-1,1] (no denormalization)
+- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1]
+
+**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer
+
+### Multi-Layer Architecture
+
+CNNEffect supports multi-layer networks via automatic effect chaining:
+
+1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7`
+2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2)
+3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"`
+4. **Original input binding**: All layers access original via `@binding(4)`
+5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)`
+
+**Framebuffer Capture API:**
+- `Effect::needs_framebuffer_capture()` - effect requests pre-capture
+- MainSequence automatically blits input → `"captured_frame"` auxiliary texture
+- Generic mechanism usable by any effect
+
+### File Structure
+
+```
+src/effects/
+  cnn_effect.h/cc         # CNNEffect class + framebuffer capture
+
+workspaces/main/shaders/cnn/
+  cnn_activation.wgsl     # tanh, ReLU, sigmoid, leaky_relu
+  cnn_conv3x3.wgsl        # 3×3 convolution (standard + coord-aware)
+  cnn_conv5x5.wgsl        # 5×5 convolution (standard + coord-aware)
+  cnn_conv7x7.wgsl        # 7×7 convolution (standard + coord-aware)
+  cnn_weights_generated.wgsl  # Weight arrays (auto-generated by train_cnn.py)
+  cnn_layer.wgsl          # Main shader with layer switches (auto-generated by train_cnn.py)
+```
+
+---
+
+## Training Workflow
+
+### 1. Prepare Training Data
+
+Input/target image pairs:
+```
+training/input/img_000.png   # RGBA (RGB + alpha)
+training/output/img_000.png  # Grayscale target
+```
+
+**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily.
+
+### 2. Train Network
+
+**Patch-based (Recommended)** - Preserves natural pixel scale:
+```bash
+python3 training/train_cnn.py \
+  --input training/input --target training/output \
+  --patch-size 32 --patches-per-image 64 --detector harris \
+  --layers 3 --kernel-sizes 3,5,3 \
+  --epochs 5000 --batch-size 16 --checkpoint-every 1000
+```
+
+**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges)
+
+**Full-image (Legacy)** - Resizes to 256×256:
+```bash
+python3 training/train_cnn.py \
+  --input training/input --target training/output \
+  --layers 3 --kernel-sizes 3,5,3 \
+  --epochs 10000 --batch-size 8 --checkpoint-every 1000
+```
+
+**Auto-generates:**
+- `cnn_weights_generated.wgsl` - Weight arrays
+- `cnn_layer.wgsl` - Layer shader
+
+### 3. Export & Validate
+
+```bash
+# Export shaders
+./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth
+
+# Generate ground truth
+./training/train_cnn.py --infer input.png \
+  --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png
+```
+
+### 4. Rebuild Demo
+
+```bash
+cmake --build build -j4 && ./build/demo64k
+```
+
+---
+
+## Usage
+
+### C++ Integration
+
+**Single layer (manual):**
+```cpp
+#include "effects/cnn_effect.h"
+
+CNNEffectParams p;
+p.layer_index = 0;
+p.total_layers = 1;
+p.blend_amount = 1.0f;
+auto cnn = std::make_shared<CNNEffect>(ctx, p);
+timeline.add_effect(cnn, start_time, end_time);
+```
+
+**Multi-layer (automatic via timeline compiler):**
+
+Use timeline syntax - `seq_compiler` expands to multiple instances.
+
+### Timeline Examples
+
+**Single-layer CNN (full stylization):**
+```
+SEQUENCE 10.0 0
+  EFFECT + Hybrid3DEffect 0.00 5.00
+  EFFECT + CNNEffect 0.50 5.00 layers=1
+```
+
+**Multi-layer CNN with blend:**
+```
+SEQUENCE 10.0 0
+  EFFECT + Hybrid3DEffect 0.00 5.00
+  EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7
+```
+
+Expands to:
+```cpp
+// Layer 0 (captures original, blend=1.0)
+{
+  CNNEffectParams p;
+  p.layer_index = 0;
+  p.total_layers = 3;
+  p.blend_amount = 1.0f;
+  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1);
+}
+// Layer 1 (blend=1.0)
+{
+  CNNEffectParams p;
+  p.layer_index = 1;
+  p.total_layers = 3;
+  p.blend_amount = 1.0f;
+  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2);
+}
+// Layer 2 (final blend=0.7)
+{
+  CNNEffectParams p;
+  p.layer_index = 2;
+  p.total_layers = 3;
+  p.blend_amount = 0.7f;
+  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3);
+}
+```
+
+---
+
+## Shader Structure
+
+**Bindings:**
+```wgsl
+@group(0) @binding(0) var smplr: sampler;
+@group(0) @binding(1) var txt: texture_2d<f32>;              // Current layer input
+@group(0) @binding(2) var<uniform> uniforms: CommonUniforms;
+@group(0) @binding(3) var<uniform> params: CNNLayerParams;
+@group(0) @binding(4) var original_input: texture_2d<f32>;   // Layer 0 input (captured)
+```
+
+**Fragment shader logic:**
+```wgsl
+@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> {
+    let uv = p.xy / uniforms.resolution;
+    let original_raw = textureSample(original_input, smplr, uv);
+    let original = (original_raw - 0.5) * 2.0;  // Normalize to [-1,1]
+    let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722));
+    var result = vec4<f32>(0.0);
+
+    if (params.layer_index == 0) {
+        result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution,
+                                      weights_layer0);
+        result = cnn_tanh(result);
+    }
+    else if (params.layer_index == 1) {
+        result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution,
+                                   gray, weights_layer1);
+        result = cnn_tanh(result);
+    }
+    // ... other layers
+
+    // Blend with ORIGINAL input (not previous layer)
+    return mix(original_raw, result, params.blend_amount);
+}
+```
+
+**Weight Storage (vec4-optimized):**
+
+**Inner layers (7→4 RGBD output):**
+```wgsl
+// Structure: array<vec4<f32>, 72>
+// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
+const weights_layer0: array<vec4<f32>, 72> = array(
+  vec4<f32>(w0_r, w0_g, w0_b, w0_d),        // pos0_ch0 (rgba weights)
+  vec4<f32>(w0_u, w0_v, w0_gray, bias0),    // pos0_ch0 (uv, gray, bias)
+  vec4<f32>(w1_r, w1_g, w1_b, w1_d),        // pos0_ch1 (rgba weights)
+  vec4<f32>(w1_u, w1_v, w1_gray, bias1),    // pos0_ch1 (uv, gray, bias)
+  // ... 68 more vec4s
+);
+```
+
+**Final layer (7→1 grayscale output):**
+```wgsl
+// Structure: array<vec4<f32>, 18>
+// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
+const weights_layerN: array<vec4<f32>, 18> = array(
+  vec4<f32>(w0_r, w0_g, w0_b, w0_d),        // pos0 (rgba weights)
+  vec4<f32>(w0_u, w0_v, w0_gray, bias0),    // pos0 (uv, gray, bias)
+  // ... 16 more vec4s
+);
+```
+
+**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs.
+
+---
+
+## Size Budget
+
+| Component | Size | Notes |
+|-----------|------|-------|
+| Activation functions | ~200 B | 4 functions |
+| Conv3x3 (standard + coord) | ~500 B | Both variants |
+| Conv5x5 (standard + coord) | ~700 B | Both variants |
+| Conv7x7 (standard + coord) | ~900 B | Both variants |
+| Main shader | ~800 B | Layer composition |
+| C++ implementation | ~300 B | Effect class |
+| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) |
+| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes |
+| **Total** | **5-9 KB** | Acceptable for 64k |
+
+**Optimization strategies:**
+- Quantize weights (float32 → int8)
+- Prune near-zero weights
+- Use separable convolutions
+
+---
+
+## Testing
+
+```bash
+./build/test_demo_effects  # CNN construction/shader tests
+./build/demo64k            # Visual test
+```
+
+---
+
+## Blend Parameter Behavior
+
+**blend_amount** controls final compositing with original:
+- `blend=0.0`: Pure original (no CNN effect)
+- `blend=0.5`: 50% original + 50% CNN
+- `blend=1.0`: Pure CNN output (full stylization)
+
+**Important:** Blend uses captured layer 0 input, not previous layer output.
+
+**Example use cases:**
+- `blend=1.0`: Full stylization (default)
+- `blend=0.7`: Subtle effect preserving original details
+- `blend=0.3`: Light artistic touch
+
+## Troubleshooting
+
+**Shader compilation fails:**
+- Check `cnn_weights_generated.wgsl` syntax
+- Verify snippets registered in `shaders.cc::InitShaderComposer()`
+- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`)
+
+**Black/corrupted output:**
+- Weights untrained (identity placeholder)
+- Check `captured_frame` auxiliary texture is registered
+- Verify layer priorities in timeline are sequential
+
+**Wrong blend result:**
+- Ensure layer 0 has `needs_framebuffer_capture() == true`
+- Check MainSequence framebuffer capture logic
+- Verify `original_input` binding is populated
+
+**Training loss not decreasing:**
+- Lower learning rate (`--learning-rate 0.0001`)
+- More epochs (`--epochs 1000`)
+- Check input/target image alignment
+
+---
+
+## Vec4 Optimization
+
+**Architecture:** Weights stored as vec4 pairs for SIMD efficiency.
+
+**Input representation:**
+```wgsl
+let rgbd = textureSample(...);              // vec4: [r, g, b, d]
+let in1 = vec4<f32>(uv_norm, gray, 1.0);   // vec4: [u, v, gray, 1.0]
+```
+
+**Weight indexing:**
+```wgsl
+var pos = 0;  // Direct weight array index
+for (var dy = -1; dy <= 1; dy++) {
+  for (var dx = -1; dx <= 1; dx++) {
+    // Unrolled channel loop (4 output channels)
+    sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1);
+    sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1);
+    sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1);
+    sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1);
+    pos += 8;  // 4 channels × 2 vec4s per channel
+  }
+}
+```
+
+**Benefits:**
+- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs)
+- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment)
+- **Bias integration:** Free via `[..., 1.0]` component (no separate add)
+- **Code simplicity:** Eliminates inner loop, direct indexing with `pos`
+- **Performance:** 2-3× GPU throughput improvement over scalar version
+
+**Weight layout per filter (8 floats):**
+- vec4[0]: [w_r, w_g, w_b, w_d]     (rgba input weights)
+- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias)
+
+**3×3 kernel sizes:**
+- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes)
+- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes)
+
+---
+
+## References
+
+- **Training Script:** `training/train_cnn.py`
+- **Shader Composition:** `doc/SEQUENCE.md`
+- **Effect System:** `src/gpu/effect.h`