17 files changed, 63 insertions, 2751 deletions
diff --git a/doc/AUDIO_WAV_DRIFT_BUG.md b/doc/AUDIO_WAV_DRIFT_BUG.md
index e22f4fa..050dd49 100644
--- a/doc/AUDIO_WAV_DRIFT_BUG.md
+++ b/doc/AUDIO_WAV_DRIFT_BUG.md
@@ -1,7 +1,8 @@
 # Audio WAV Drift Bug Investigation
 
 **Date:** 2026-02-15
-**Status:** ROOT CAUSE IDENTIFIED
+**Status:** ACCEPTABLE (to be continued)
+**Current State:** -150ms drift at beat 64b, no glitches
 
 ## Problem Statement
 
@@ -163,8 +164,18 @@ Eliminates cumulative truncation error.
 1. ✅ Measure WAV sample positions directly (Python script)
 2. ✅ Add render tracking debug output
 3. ✅ Confirm over-rendering (366ms per 10s)
-4. ⏳ Implement fix
-5. ⏳ Verify corrected WAV alignment in viewer
+4. ✅ Implement partial fix (bypass ring buffer, direct render)
+5. ⚠️ Current result: -150ms drift at beat 64b (acceptable, needs further work)
+
+## Current Implementation (main.cc:286-308)
+
+**WAV dump now bypasses ring buffer entirely:**
+1. **Frame accumulator**: Calculates exact frames per update (no truncation)
+2. **Direct render**: Calls `synth_render()` directly with exact frame count
+3. **No ring buffer**: Eliminates buffer management complexity
+4. **Result**: No glitches, but -150ms drift remains
+
+**Remaining issue:** Drift persists despite direct rendering. Likely related to tempo scaling or audio engine state management. Acceptable for now.
 
 ## Notes
 
diff --git a/doc/AUXILIARY_TEXTURE_INIT.md b/doc/AUXILIARY_TEXTURE_INIT.md
index 9cac70b..036cbf7 100644
--- a/doc/AUXILIARY_TEXTURE_INIT.md
+++ b/doc/AUXILIARY_TEXTURE_INIT.md
@@ -18,7 +18,7 @@ entry.seq->resize(width, height);   // Too late - textures already created
 
 **Affected:**
 - CircleMaskEffect (circle_mask texture)
-- CNNEffect (captured_frame texture)
+- CNNv1Effect (captured_frame texture)
 - RotatingCubeEffect (consumer, hardcoded resolution in uniforms)
 
 ---
diff --git a/doc/BUILD.md b/doc/BUILD.md
index d3434f4..fd0c3d9 100644
--- a/doc/BUILD.md
+++ b/doc/BUILD.md
@@ -95,9 +95,11 @@ Use Xcode Metal debugger for shader performance analysis.
 ## Build System Internals
 
 **Asset Dependency Tracking:**
-- CMake tracks 42 demo + 17 test assets
-- Editing shaders/audio/sequences auto-triggers rebuild
-- Asset lists parsed to extract individual file dependencies
+- CMake tracks 42 demo + 17 test assets split into 4 categories
+- **Granular rebuilds:** Changing a shader only rebuilds shader-dependent targets
+- **Categories:** `shaders` (.wgsl), `audio` (.spec, .track), `models` (.obj), `data` (.bin, .png, PROC)
+- Asset lists parsed at configure time to extract category-specific file dependencies
+- Unified output (`assets_data.cc`) avoids duplicate symbols while preserving granular tracking
 
 **Header Organization:**
 - `asset_manager_dcl.h`: Forward declarations
diff --git a/doc/CMAKE_MODULES.md b/doc/CMAKE_MODULES.md
index 2ea7d00..9f71d91 100644
--- a/doc/CMAKE_MODULES.md
+++ b/doc/CMAKE_MODULES.md
@@ -90,6 +90,18 @@ Creates an executable for the demo (legacy macro).
 ### `add_demo_test(NAME TEST_NAME LABEL SOURCES...)`
 Creates a test executable and registers it with CTest (legacy macro).
 
+### `demo_add_asset_deps(TARGET CATEGORY)`
+Adds asset category dependencies to a target for granular rebuilds.
+
+**Categories:** `shaders`, `audio`, `models`, `data`, `all`, `test`
+
+**Example:**
+```cmake
+demo_add_asset_deps(test_synth audio)        # Only depends on audio assets
+demo_add_asset_deps(test_shader_compilation shaders)  # Only depends on shaders
+demo_add_asset_deps(demo64k all)             # Depends on all asset categories
+```
+
 ---
 
 ## Conditional Inclusion
@@ -107,12 +119,13 @@ This reduces parse time when building without tests/tools.
 ## Adding New Components
 
 ### New Effect
-- Add sources to `cmake/DemoSourceLists.cmake` (GPU_SOURCES list)
-- No other CMake changes needed
+- Add sources to `cmake/DemoSourceLists.cmake` (`COMMON_GPU_EFFECTS` list)
+- No other CMake changes needed (automatically included in headless and normal modes)
 
 ### New Test
-- Add to `cmake/DemoTests.cmake` using `demo_add_test_with_deps()`
-- Use LINK and DEPENDS parameters for libraries/assets
+- Add to `cmake/DemoTests.cmake` using `add_demo_test()`
+- Use `demo_add_asset_deps()` to specify asset category dependencies (e.g., `shaders`, `audio`)
+- This enables granular rebuilds—only changed asset categories trigger test recompilation
 
 ### New Library
 - Add to `cmake/DemoLibraries.cmake` with appropriate dependencies
@@ -132,6 +145,7 @@ This reduces parse time when building without tests/tools.
 4. **Reusability:** Shared macros—eliminate 200+ lines of repetition
 5. **Clarity:** Top-level CMakeLists.txt is 54-line roadmap
 6. **Scalability:** Easy to add new tests/tools/libraries without bloating main file
+7. **Granular Rebuilds:** Asset categories enable 3-5× faster incremental builds for typical changes
 
 ---
 
diff --git a/doc/CNN.md b/doc/CNN.md
deleted file mode 100644
index 2dc3362..0000000
--- a/doc/CNN.md
+++ /dev/null
@@ -1,79 +0,0 @@
-# Convolutional Neural Net Shader (CNN) post-processing
-
-**Status:** ✅ Foundation implemented (single-layer, expandable to multi-pass)
-
-## Idea
-
-Have the input 3d scene be processed by a multi-layer CNN trained on the side.
-Input: some rendered scene.
-Output: 'stylized' scene with CNN post-processing.
-
-**See `doc/CNN_EFFECT.md` for implementation details, usage, and API reference.**
-
-## Shader implementation
-
-### input / output
-
-Need 1 texture buffer per CNN layer.
-Input (r,g,b,1/z) for layer 0 (render 3d scene), or output from layer N-1 for layer N.
-output: (r,g,b, alpha). Don't need the 1/z information (can be fetched from input)
-
-### size of one layer
-
-Notation:
-S: the number of input samples from layer N-1.
-Example: 3x3 input -> S = 3x3 = 9. 
-
-Each S samples is 4 values (r,g,b, w=1/z).
-
-Each sample is processed by a mat4 matrix. 4 input => 4 output.
-
-Weight matrix = S x mat4
-
-Final bias: 4 values.
-
-WGSL code example: See file CNN.shader
-
-### Layers
-
-we need 3 or 4 layer ?
-Several different shaders for each layer.
-Ping-pong for input/output texture buffer between each layers?
-
-## Implementation Status
-
-**Completed:**
-- ✅ Modular WGSL shader architecture (6 snippet files)
-- ✅ CNNEffect C++ class (single-layer rendering)
-- ✅ ShaderComposer integration (#include resolution)
-- ✅ Asset registration (7 new shader assets)
-- ✅ Test coverage (test_demo_effects.cc)
-- ✅ Placeholder identity weights for testing
-
-**Size:** ~3-4 KB shader code + ~2-4 KB weights = **5-8 KB total**
-
-**Pending:**
-- ⏳ Training script (`scripts/train_cnn.py`) to generate real weights
-- ⏳ Multi-layer rendering with ping-pong textures
-- ⏳ Weight quantization for size optimization
-
----
-
-## Training (To Be Implemented)
-
-The layer weight/bias data are hard-coded in the shaders.
-Training workflow:
-
-1. Prepare image pairs (before: raw render, after: target style)
-2. Run `python scripts/train_cnn.py --input scene.png --target stylized.png`
-3. Script generates `cnn_weights_generated.wgsl`
-4. Rebuild: `cmake --build build -j4`
-
-**Reference:** File `CNN.py` contains training example (needs adaptation).
-
-Need a repository of reference image pairs (before/after) for training and validation.
-Each input image is randomly sampled into 3×3 patch of (r,g,b,1/z) input samples.
-And trained to match the (r,g,b,a) output.
-
-Training generates the .wgsl code for layers' shaders.
-
diff --git a/doc/CNN_BIAS_FIX_2026-02.md b/doc/CNN_BIAS_FIX_2026-02.md
deleted file mode 100644
index 26db8eb..0000000
--- a/doc/CNN_BIAS_FIX_2026-02.md
+++ /dev/null
@@ -1,85 +0,0 @@
-# CNN Bias Accumulation Fix (2026-02-11)
-
-## Problem
-Bias was being added multiple times in shader convolution loops (once per kernel position), causing mismatch between PyTorch training and WGSL inference.
-
-## Root Cause
-**Location**: `training/train_cnn.py:381, 398`
-
-When exporting weights to WGSL, bias was replicated for every kernel position. The shader loops through positions doing:
-```wgsl
-sum += dot(weights[pos], rgbd) + dot(weights[pos+1], in1);  // in1.w = 1.0
-```
-
-For 3×3 kernel (9 positions), bias added 9×. For 5×5, added 25×.
-
-## Fix
-Divide bias by `num_positions` during export:
-```python
-# Final layer (7→1)
-v1.append(f"{bias[0] / num_positions:.6f}")
-
-# Inner layers (7→4)
-v1.append(f"{bias[out_c] / num_positions:.6f}")
-```
-
-Shader accumulates bias × num_positions = original bias (correct).
-
----
-
-## Additional Improvements
-
-### 1. RGBA Output Support
-**train_cnn.py**: Now saves 4-channel RGBA PNG preserving alpha from input:
-```python
-alpha = img_tensor[0, 3:4, :, :].permute(1, 2, 0).numpy()
-output_rgba = np.concatenate([output, alpha], axis=2)
-Image.fromarray((output_rgba * 255).astype(np.uint8), mode='RGBA')
-```
-
-Intermediate layers also save RGBA if 4-channel.
-
-### 2. Debug Hex Output
-**Both tools** support `--debug-hex` to print first 8 pixels as hex:
-```bash
-./training/train_cnn.py --infer input.png --export-only checkpoint.pth --debug-hex
-./build/cnn_test input.png output.png --debug-hex
-```
-
-Output format: `[0] 0xRRGGBBAA` for pixel-level comparison.
-
-### 3. Cleanup
-Removed sRGB/linear_png debug code from `cnn_test.cc` (simplified PNG saving).
-
----
-
-## Files Modified
-- `training/train_cnn.py`: Bias fix, RGBA output, --debug-hex
-- `tools/cnn_test.cc`: --debug-hex, remove linear_png
-- `workspaces/main/shaders/cnn/cnn_weights_generated.wgsl`: Regenerated with fixed bias
-
-## Testing
-```bash
-# Train with fixed export
-./training/train_cnn.py --input training/input/ --target training/output/ \
-  --layers 3 --kernel_sizes 3,3,3 --epochs 5000
-
-# Generate ground truth
-./training/train_cnn.py --infer input.png --export-only checkpoint.pth \
-  --output ground_truth.png --debug-hex
-
-# Run GPU tool
-./build/cnn_test input.png tool_output.png --debug-hex
-
-# Compare hex output for first 8 pixels
-```
-
----
-
-## Status
-✅ Bias accumulation bug fixed
-✅ RGBA output with alpha preservation
-✅ Debug hex comparison tool
-✅ Weights regenerated
-
-Commit: `8ff8c56`
diff --git a/doc/CNN_DEBUG.md b/doc/CNN_DEBUG.md
deleted file mode 100644
index ba220a0..0000000
--- a/doc/CNN_DEBUG.md
+++ /dev/null
@@ -1,43 +0,0 @@
-# CNN Effect Black Screen Bug - Resolution (2026-02)
-
-## Problem
-CNN post-processing effect showed black screen when activated at 11.50s, despite scene rendering correctly before CNN started.
-
-## Root Causes
-
-### Bug 1: Framebuffer Capture Timing
-**Location**: `src/gpu/effect.cc`
-**Issue**: Capture ran INSIDE post-effect loop after ping-pong buffer swaps. CNN layers 1+ captured wrong buffer (output being written to, not scene).
-**Fix**: Moved capture before loop starts (lines 308-346). Capture now copies `framebuffer_a` to `captured_frame` auxiliary texture ONCE before any post-effects run.
-
-### Bug 2: Missing Uniforms Update ⚠️ CRITICAL
-**Location**: `src/effects/cnn_effect.cc`
-**Issue**: `CNNEffect::update_bind_group()` never updated `uniforms_` buffer. `uniforms.resolution` uninitialized (0,0 or garbage) → UV calculation `p.xy / uniforms.resolution` produced NaN → all texture samples black.
-**Fix**: Added uniforms update before bind group creation (lines 132-142):
-```cpp
-const CommonPostProcessUniforms u = {
-    .resolution = {(float)width_, (float)height_},
-    .aspect_ratio = (float)width_ / (float)height_,
-    .time = 0.0f,
-    .beat = 0.0f,
-    .audio_intensity = 0.0f,
-};
-uniforms_.update(ctx_.queue, u);
-```
-
-## Key Lessons
-
-1. **All post-process effects MUST update `uniforms_` buffer** - Required for UV calculations and shader parameters
-2. **Framebuffer capture timing is critical** - Must happen before post-chain ping-pong starts
-3. **Uninitialized uniforms cause silent failures** - Produces black output without validation errors
-4. **Post-effects must render or chain breaks** - `loadOp=Load` preserves previous (black) content if no draw call executes
-
-## Files Modified
-- `src/gpu/effect.cc`: Lines 308-346 (capture timing)
-- `src/effects/cnn_effect.cc`: Lines 132-142 (uniforms update)
-
-## Verification
-Test: `demo64k --seek 11.5`
-- ✅ Scene visible with RotatingCube
-- ✅ CNN stylization applied
-- ✅ All 3 layers process with correct original texture reference
diff --git a/doc/CNN_EFFECT.md b/doc/CNN_EFFECT.md
deleted file mode 100644
index 40f095e..0000000
--- a/doc/CNN_EFFECT.md
+++ /dev/null
@@ -1,400 +0,0 @@
-# CNN Post-Processing Effect
-
-Neural network-based stylization for rendered scenes.
-
----
-
-## Overview
-
-Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead.
-
-**Key Features:**
-- Position-aware layer 0 (coordinate input for vignetting, edge effects)
-- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining
-- Original input available to all layers via framebuffer capture
-- Configurable final blend with original scene
-- Modular WGSL shader architecture
-- Hardcoded weights (trained offline via PyTorch)
-- ~5-8 KB binary footprint
-
----
-
-## Architecture
-
-### RGBD → Grayscale Pipeline
-
-**Input:** RGBD (RGB + inverse depth D=1/z)
-**Output:** Grayscale (1 channel)
-**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
-
-**Architecture:**
-- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD
-- **Final layer (N-1):** Conv2d(7→1) - output grayscale
-
-```wgsl
-// Inner layers: 7→4 (RGBD output, vec4-optimized)
-fn cnn_conv3x3_7to4(
-  tex: texture_2d<f32>,
-  samp: sampler,
-  uv: vec2<f32>,
-  resolution: vec2<f32>,
-  gray: f32,                               # Grayscale [-1,1]
-  weights: array<vec4<f32>, 72>           # 9 pos × 4 ch × 2 vec4 (8 floats per filter)
-) -> vec4<f32>
-
-// Final layer: 7→1 (grayscale output, vec4-optimized)
-fn cnn_conv3x3_7to1(
-  tex: texture_2d<f32>,
-  samp: sampler,
-  uv: vec2<f32>,
-  resolution: vec2<f32>,
-  gray: f32,
-  weights: array<vec4<f32>, 18>           # 9 pos × 2 vec4 (8 floats per filter)
-) -> f32
-```
-
-**Input normalization:**
-- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1]
-- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1]
-- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))`
-- **Inter-layer data** stays in [-1,1] (no denormalization)
-- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1]
-
-**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer
-
-### Multi-Layer Architecture
-
-CNNEffect supports multi-layer networks via automatic effect chaining:
-
-1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7`
-2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2)
-3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"`
-4. **Original input binding**: All layers access original via `@binding(4)`
-5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)`
-
-**Framebuffer Capture API:**
-- `Effect::needs_framebuffer_capture()` - effect requests pre-capture
-- MainSequence automatically blits input → `"captured_frame"` auxiliary texture
-- Generic mechanism usable by any effect
-
-### File Structure
-
-```
-src/effects/
-  cnn_effect.h/cc         # CNNEffect class + framebuffer capture
-
-workspaces/main/shaders/cnn/
-  cnn_activation.wgsl     # tanh, ReLU, sigmoid, leaky_relu
-  cnn_conv3x3.wgsl        # 3×3 convolution (standard + coord-aware)
-  cnn_conv5x5.wgsl        # 5×5 convolution (standard + coord-aware)
-  cnn_conv7x7.wgsl        # 7×7 convolution (standard + coord-aware)
-  cnn_weights_generated.wgsl  # Weight arrays (auto-generated by train_cnn.py)
-  cnn_layer.wgsl          # Main shader with layer switches (auto-generated by train_cnn.py)
-```
-
----
-
-## Training Workflow
-
-### 1. Prepare Training Data
-
-Input/target image pairs:
-```
-training/input/img_000.png   # RGBA (RGB + alpha)
-training/output/img_000.png  # Grayscale target
-```
-
-**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily.
-
-### 2. Train Network
-
-**Patch-based (Recommended)** - Preserves natural pixel scale:
-```bash
-python3 training/train_cnn.py \
-  --input training/input --target training/output \
-  --patch-size 32 --patches-per-image 64 --detector harris \
-  --layers 3 --kernel-sizes 3,5,3 \
-  --epochs 5000 --batch-size 16 --checkpoint-every 1000
-```
-
-**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges)
-
-**Full-image (Legacy)** - Resizes to 256×256:
-```bash
-python3 training/train_cnn.py \
-  --input training/input --target training/output \
-  --layers 3 --kernel-sizes 3,5,3 \
-  --epochs 10000 --batch-size 8 --checkpoint-every 1000
-```
-
-**Auto-generates:**
-- `cnn_weights_generated.wgsl` - Weight arrays
-- `cnn_layer.wgsl` - Layer shader
-
-### 3. Export & Validate
-
-```bash
-# Export shaders
-./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth
-
-# Generate ground truth
-./training/train_cnn.py --infer input.png \
-  --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png
-```
-
-### 4. Rebuild Demo
-
-```bash
-cmake --build build -j4 && ./build/demo64k
-```
-
----
-
-## Usage
-
-### C++ Integration
-
-**Single layer (manual):**
-```cpp
-#include "effects/cnn_effect.h"
-
-CNNEffectParams p;
-p.layer_index = 0;
-p.total_layers = 1;
-p.blend_amount = 1.0f;
-auto cnn = std::make_shared<CNNEffect>(ctx, p);
-timeline.add_effect(cnn, start_time, end_time);
-```
-
-**Multi-layer (automatic via timeline compiler):**
-
-Use timeline syntax - `seq_compiler` expands to multiple instances.
-
-### Timeline Examples
-
-**Single-layer CNN (full stylization):**
-```
-SEQUENCE 10.0 0
-  EFFECT + Hybrid3DEffect 0.00 5.00
-  EFFECT + CNNEffect 0.50 5.00 layers=1
-```
-
-**Multi-layer CNN with blend:**
-```
-SEQUENCE 10.0 0
-  EFFECT + Hybrid3DEffect 0.00 5.00
-  EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7
-```
-
-Expands to:
-```cpp
-// Layer 0 (captures original, blend=1.0)
-{
-  CNNEffectParams p;
-  p.layer_index = 0;
-  p.total_layers = 3;
-  p.blend_amount = 1.0f;
-  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1);
-}
-// Layer 1 (blend=1.0)
-{
-  CNNEffectParams p;
-  p.layer_index = 1;
-  p.total_layers = 3;
-  p.blend_amount = 1.0f;
-  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2);
-}
-// Layer 2 (final blend=0.7)
-{
-  CNNEffectParams p;
-  p.layer_index = 2;
-  p.total_layers = 3;
-  p.blend_amount = 0.7f;
-  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3);
-}
-```
-
----
-
-## Shader Structure
-
-**Bindings:**
-```wgsl
-@group(0) @binding(0) var smplr: sampler;
-@group(0) @binding(1) var txt: texture_2d<f32>;              // Current layer input
-@group(0) @binding(2) var<uniform> uniforms: CommonUniforms;
-@group(0) @binding(3) var<uniform> params: CNNLayerParams;
-@group(0) @binding(4) var original_input: texture_2d<f32>;   // Layer 0 input (captured)
-```
-
-**Fragment shader logic:**
-```wgsl
-@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> {
-    let uv = p.xy / uniforms.resolution;
-    let original_raw = textureSample(original_input, smplr, uv);
-    let original = (original_raw - 0.5) * 2.0;  // Normalize to [-1,1]
-    let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722));
-    var result = vec4<f32>(0.0);
-
-    if (params.layer_index == 0) {
-        result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution,
-                                      weights_layer0);
-        result = cnn_tanh(result);
-    }
-    else if (params.layer_index == 1) {
-        result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution,
-                                   gray, weights_layer1);
-        result = cnn_tanh(result);
-    }
-    // ... other layers
-
-    // Blend with ORIGINAL input (not previous layer)
-    return mix(original_raw, result, params.blend_amount);
-}
-```
-
-**Weight Storage (vec4-optimized):**
-
-**Inner layers (7→4 RGBD output):**
-```wgsl
-// Structure: array<vec4<f32>, 72>
-// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
-const weights_layer0: array<vec4<f32>, 72> = array(
-  vec4<f32>(w0_r, w0_g, w0_b, w0_d),        // pos0_ch0 (rgba weights)
-  vec4<f32>(w0_u, w0_v, w0_gray, bias0),    // pos0_ch0 (uv, gray, bias)
-  vec4<f32>(w1_r, w1_g, w1_b, w1_d),        // pos0_ch1 (rgba weights)
-  vec4<f32>(w1_u, w1_v, w1_gray, bias1),    // pos0_ch1 (uv, gray, bias)
-  // ... 68 more vec4s
-);
-```
-
-**Final layer (7→1 grayscale output):**
-```wgsl
-// Structure: array<vec4<f32>, 18>
-// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
-const weights_layerN: array<vec4<f32>, 18> = array(
-  vec4<f32>(w0_r, w0_g, w0_b, w0_d),        // pos0 (rgba weights)
-  vec4<f32>(w0_u, w0_v, w0_gray, bias0),    // pos0 (uv, gray, bias)
-  // ... 16 more vec4s
-);
-```
-
-**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs.
-
----
-
-## Size Budget
-
-| Component | Size | Notes |
-|-----------|------|-------|
-| Activation functions | ~200 B | 4 functions |
-| Conv3x3 (standard + coord) | ~500 B | Both variants |
-| Conv5x5 (standard + coord) | ~700 B | Both variants |
-| Conv7x7 (standard + coord) | ~900 B | Both variants |
-| Main shader | ~800 B | Layer composition |
-| C++ implementation | ~300 B | Effect class |
-| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) |
-| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes |
-| **Total** | **5-9 KB** | Acceptable for 64k |
-
-**Optimization strategies:**
-- Quantize weights (float32 → int8)
-- Prune near-zero weights
-- Use separable convolutions
-
----
-
-## Testing
-
-```bash
-./build/test_demo_effects  # CNN construction/shader tests
-./build/demo64k            # Visual test
-```
-
----
-
-## Blend Parameter Behavior
-
-**blend_amount** controls final compositing with original:
-- `blend=0.0`: Pure original (no CNN effect)
-- `blend=0.5`: 50% original + 50% CNN
-- `blend=1.0`: Pure CNN output (full stylization)
-
-**Important:** Blend uses captured layer 0 input, not previous layer output.
-
-**Example use cases:**
-- `blend=1.0`: Full stylization (default)
-- `blend=0.7`: Subtle effect preserving original details
-- `blend=0.3`: Light artistic touch
-
-## Troubleshooting
-
-**Shader compilation fails:**
-- Check `cnn_weights_generated.wgsl` syntax
-- Verify snippets registered in `shaders.cc::InitShaderComposer()`
-- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`)
-
-**Black/corrupted output:**
-- Weights untrained (identity placeholder)
-- Check `captured_frame` auxiliary texture is registered
-- Verify layer priorities in timeline are sequential
-
-**Wrong blend result:**
-- Ensure layer 0 has `needs_framebuffer_capture() == true`
-- Check MainSequence framebuffer capture logic
-- Verify `original_input` binding is populated
-
-**Training loss not decreasing:**
-- Lower learning rate (`--learning-rate 0.0001`)
-- More epochs (`--epochs 1000`)
-- Check input/target image alignment
-
----
-
-## Vec4 Optimization
-
-**Architecture:** Weights stored as vec4 pairs for SIMD efficiency.
-
-**Input representation:**
-```wgsl
-let rgbd = textureSample(...);              // vec4: [r, g, b, d]
-let in1 = vec4<f32>(uv_norm, gray, 1.0);   // vec4: [u, v, gray, 1.0]
-```
-
-**Weight indexing:**
-```wgsl
-var pos = 0;  // Direct weight array index
-for (var dy = -1; dy <= 1; dy++) {
-  for (var dx = -1; dx <= 1; dx++) {
-    // Unrolled channel loop (4 output channels)
-    sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1);
-    sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1);
-    sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1);
-    sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1);
-    pos += 8;  // 4 channels × 2 vec4s per channel
-  }
-}
-```
-
-**Benefits:**
-- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs)
-- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment)
-- **Bias integration:** Free via `[..., 1.0]` component (no separate add)
-- **Code simplicity:** Eliminates inner loop, direct indexing with `pos`
-- **Performance:** 2-3× GPU throughput improvement over scalar version
-
-**Weight layout per filter (8 floats):**
-- vec4[0]: [w_r, w_g, w_b, w_d]     (rgba input weights)
-- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias)
-
-**3×3 kernel sizes:**
-- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes)
-- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes)
-
----
-
-## References
-
-- **Training Script:** `training/train_cnn.py`
-- **Shader Composition:** `doc/SEQUENCE.md`
-- **Effect System:** `src/gpu/effect.h`
diff --git a/doc/CNN_FLATTEN_ANALYSIS.md b/doc/CNN_FLATTEN_ANALYSIS.md
deleted file mode 100644
index bf63c5d..0000000
--- a/doc/CNN_FLATTEN_ANALYSIS.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# CNN Shader Flatten Mode - Technical Analysis
-
-**Status:** Analysis complete - flatten mode NOT RECOMMENDED
-
-**Date:** February 2026
-
----
-
-## Context
-
-Current CNN architecture uses **3 sequential render passes** (linear chaining):
-- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer
-- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer
-- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original
-
-Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers.
-
----
-
-## Current Architecture
-
-**Shader Structure:**
-- 1 pipeline with layer branching (`layer_index` uniform)
-- 5 bindings: sampler, input texture, uniforms, layer params, original capture
-- Total shader size: ~8 KB (snippets + weights)
-
-**Performance Profile:**
-- 3 render pass dispatches
-- 2 framebuffer writes + reads between layers
-- Memory bandwidth: ~2× framebuffer size per layer
-- Register pressure: Low (per-layer isolation)
-
-**Weight Buffer:** 290 vec4s (4.6 KB) - already unified
-
----
-
-## Flatten Approaches Evaluated
-
-### Option A: Full Flatten (All 3 Layers)
-
-**Cascading Receptive Field:**
-
-To compute final output at position (x, y):
-- Layer 2 needs 3×3 neighborhood of Layer 1 outputs
-- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs
-- Each Layer 0 output needs 5×5 neighborhood of input samples
-
-**Effective input sampling:** 9×9 pixels (vs current 5×5 max)
-
-**Intermediate Storage (per thread/pixel):**
-```
-Layer 0 outputs: 5×5 positions × 4 channels = 100 floats
-Layer 1 outputs: 3×3 positions × 4 channels =  36 floats
-                                   TOTAL = 136 floats (544 bytes)
-```
-
-**GPU Register Pressure:**
-- Modern GPUs: 32-64 KB registers per SM, shared across warps
-- 544 bytes/thread → max 64 threads/SM (**low occupancy**)
-- Current multi-pass: ~4-8 bytes/thread (high occupancy)
-
-**Pros:**
-- 1 dispatch vs 3 (reduce CPU overhead)
-- Zero framebuffer bandwidth between layers
-
-**Cons:**
-- **Severe register pressure** (10-20× increase)
-- Reduced occupancy → potential performance loss
-- Complex shader (harder debug, larger binary)
-- 9×9 input sampling
-
-**Assessment:** ❌ **Not Recommended**
-Register cost outweighs bandwidth savings.
-
----
-
-### Option B: Partial Flatten (Layers 1 + 2)
-
-Keep Layer 0 separate, flatten only Layers 1 and 2.
-
-**Pass Structure:**
-1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer
-2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader
-
-**Intermediate Storage:**
-```
-Layer 0 samples: 3×3 × 4 = 36 floats (read once)
-Layer 1 outputs: 3×3 × 4 = 36 floats (computed)
-                 TOTAL = 72 floats (288 bytes)
-```
-
-**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs
-
-**Pros:**
-- 2 passes vs 3 (33% reduction)
-- 1 framebuffer write saved
-- More manageable register usage
-
-**Cons:**
-- Still significant register pressure (288 bytes vs ~8 bytes baseline)
-- Medium complexity increase
-- Layer 0 (heaviest kernel) still separate
-
-**Assessment:** ⚠️ **Marginal Benefit**
-Saves 1 pass but register cost still high.
-
----
-
-### Option C: Keep Current Multi-Pass ✅
-
-**Rationale:**
-- Current architecture well-suited to GPU design (high throughput via parallelism)
-- Minimal register usage → high occupancy → hides memory latency
-- Framebuffer bandwidth cost < register pressure cost
-- Clean separation aids debugging/iteration
-- Modular (easy to add/remove layers)
-
-**Alternative Optimizations (if bandwidth critical):**
-1. Merge passes via render pass load/store ops (Vulkan subpasses)
-2. Reduce intermediate channel count (4→3 or 2)
-3. Hybrid: Compute shaders + workgroup shared memory
-4. Layer pruning (2-layer vs 3-layer quality comparison)
-
----
-
-## Recommendation
-
-**✅ Keep current multi-pass architecture**
-
-### Decision Matrix
-
-| Factor | Multi-Pass | Partial Flatten | Full Flatten |
-|--------|-----------|----------------|--------------|
-| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme |
-| Occupancy | ✅ High | ⚠️ Medium | ❌ Low |
-| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest |
-| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High |
-| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard |
-| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest |
-
-**Modern GPU Architecture Favors:**
-- High parallelism (many small threads) over complex threads
-- Hiding latency via occupancy over minimizing operations
-- Memory bandwidth via caching, not elimination
-
----
-
-## Alternative: Compute Shader + Shared Memory
-
-**If bandwidth becomes critical:**
-- Use compute shader with workgroup shared memory
-- Load tile + halos into shared memory (9×9 input samples)
-- Compute all 3 layers for tile interior (avoids redundant sampling)
-- Requires explicit synchronization (`workgroupBarrier`)
-
-**Trade-offs:**
-- ✅ Low register pressure + low bandwidth
-- ❌ Compute pipeline complexity (no render pass integration)
-- ❌ Tile edge handling
-- ❌ Larger code size
-
----
-
-## Conclusion
-
-Current 3-pass architecture is **appropriate for demo64k**:
-- Size-efficient (modular shaders)
-- Performance adequate (bandwidth not bottleneck)
-- Maintainable (clean layer isolation)
-
-**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint.
-
-### Size Optimization Alternatives (Better ROI)
-
-If size optimization critical, focus on:
-1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization)
-2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s)
-3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels)
-
-These yield better size/performance than shader architecture changes.
-
----
-
-## References
-
-- `doc/CNN_EFFECT.md` - CNN implementation details
-- `doc/CNN.md` - High-level CNN design
-- `src/effects/cnn_effect.cc` - Current implementation
-- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets
diff --git a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md b/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md
deleted file mode 100644
index 3439f2c..0000000
--- a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md
+++ /dev/null
@@ -1,136 +0,0 @@
-# CNN RGBD→Grayscale Architecture Implementation
-
-## Summary
-
-Implemented CNN architecture upgrade: RGBD input → grayscale output with 7-channel augmented input.
-
-## Changes Made
-
-### Architecture
-
-**Input:** RGBD (4 channels: RGB + inverse depth D=1/z)
-**Output:** Grayscale (1 channel)
-**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
-
-**Layer Configuration:**
-- Inner layers (0..N-2): Conv2d(7→4) - output RGBD with tanh activation
-- Final layer (N-1): Conv2d(7→1) - output grayscale, no activation
-
-### Input Normalization (all to [-1,1])
-
-- **RGBD:** `(rgbd - 0.5) * 2`
-- **UV coords:** `(uv - 0.5) * 2`
-- **Grayscale:** `dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722))` (computed once, passed as parameter)
-
-**Rationale:** Zero-centered inputs for tanh activation, better gradient flow.
-
-### Modified Files
-
-**Training (`/Users/skal/demo/training/train_cnn.py`):**
-1. Removed `CoordConv2d` class
-2. Updated `SimpleCNN`:
-   - Inner layers: `Conv2d(7, 4)` - RGBD output
-   - Final layer: `Conv2d(7, 1)` - grayscale output
-3. Updated `forward()`:
-   - Normalize RGBD/coords/gray to [-1,1]
-   - Concatenate 7-channel input for each layer
-   - Apply tanh (inner) or none (final)
-   - Denormalize final output
-4. Updated `export_weights_to_wgsl()`:
-   - Inner: `array<array<f32, 8>, 36>` (9 pos × 4 ch × 8 values)
-   - Final: `array<array<f32, 8>, 9>` (9 pos × 8 values)
-5. Updated `generate_layer_shader()`:
-   - Use `cnn_conv3x3_7to4` for inner layers
-   - Use `cnn_conv3x3_7to1` for final layer
-   - Denormalize outputs from [-1,1] to [0,1]
-6. Updated `ImagePairDataset`:
-   - Load RGBA input (was RGB)
-
-**Shaders (`/Users/skal/demo/workspaces/main/shaders/cnn/cnn_conv3x3.wgsl`):**
-1. Added `cnn_conv3x3_7to4()`:
-   - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter)
-   - 4-channel output: RGBD
-   - Weights: `array<array<f32, 8>, 36>`
-2. Added `cnn_conv3x3_7to1()`:
-   - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter)
-   - 1-channel output: grayscale
-   - Weights: `array<array<f32, 8>, 9>`
-3. Optimized: gray computed once in caller using `dot()`, not per-function
-
-**Documentation (`/Users/skal/demo/doc/CNN_EFFECT.md`):**
-1. Updated architecture section with RGBD→grayscale pipeline
-2. Updated training data requirements (RGBA input)
-3. Updated weight storage format
-
-### No C++ Changes
-
-CNNLayerParams and bind groups remain unchanged.
-
-## Data Flow
-
-1. Layer 0 captures original RGBD to `captured_frame`
-2. Each layer:
-   - Samples previous layer output (RGBD in [0,1])
-   - Normalizes RGBD to [-1,1]
-   - Computes gray once using `dot()` (fs_main level)
-   - Normalizes UV coords to [-1,1] (inside conv functions)
-   - Concatenates 7-channel input
-   - Applies convolution with layer-specific weights
-   - Outputs RGBD (inner) or grayscale (final) in [-1,1]
-   - Applies tanh (inner only)
-   - Denormalizes to [0,1] for texture storage
-   - Blends with original
-
-## Next Steps
-
-1. **Prepare RGBD training data:**
-   - Input: RGBA images (RGB + depth in alpha)
-   - Target: Grayscale stylized output
-
-2. **Train network:**
-   ```bash
-   python3 training/train_cnn.py \
-     --input training/input \
-     --target training/output \
-     --layers 3 \
-     --epochs 1000
-   ```
-
-3. **Verify generated shaders:**
-   - Check `cnn_weights_generated.wgsl` structure
-   - Check `cnn_layer.wgsl` uses new conv functions
-
-4. **Test in demo:**
-   ```bash
-   cmake --build build -j4
-   ./build/demo64k
-   ```
-
-## Design Rationale
-
-**Why [-1,1] normalization?**
-- Centered inputs for tanh (operates best around 0)
-- Better gradient flow
-- Standard ML practice for normalized data
-
-**Why RGBD throughout vs RGB?**
-- Depth information propagates through network
-- Enables depth-aware stylization
-- Consistent 4-channel processing
-
-**Why 7-channel input?**
-- Coordinates: position-dependent effects (vignettes)
-- Grayscale: luminance-aware processing
-- RGBD: full color+depth information
-- Enables richer feature learning
-
-## Testing Checklist
-
-- [ ] Train network with RGBD input data
-- [ ] Verify `cnn_weights_generated.wgsl` structure
-- [ ] Verify `cnn_layer.wgsl` uses `7to4`/`7to1` functions
-- [ ] Build demo without errors
-- [ ] Visual test: inner layers show RGBD evolution
-- [ ] Visual test: final layer produces grayscale
-- [ ] Visual test: blending works correctly
-- [ ] Compare quality with previous RGB→RGB architecture
diff --git a/doc/CNN_TEST_TOOL.md b/doc/CNN_TEST_TOOL.md
deleted file mode 100644
index 4307894..0000000
--- a/doc/CNN_TEST_TOOL.md
+++ /dev/null
@@ -1,244 +0,0 @@
-# CNN Shader Testing Tool
-
-Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Supports both CNN v1 (render pipeline) and v2 (compute, storage buffer).
-
----
-
-## Purpose
-
-- Validate trained weights against ground truth
-- Debug CNN layer behavior in isolation
-- Generate test outputs for training workflow
-- Match Python training script's inference mode
-
----
-
-## Architecture
-
-**Two implementations:**
-
-1. **CNN v1** (render pipeline, texture atlas weights)
-   - 3 fixed layers
-   - RGBA16Float intermediates
-   - BGRA8Unorm final output
-
-2. **CNN v2** (compute shaders, storage buffer weights)
-   - Dynamic layer count from binary
-   - 7D static features (RGBD + UV + sin + bias)
-   - RGBA32Uint packed f16 intermediates
-   - Storage buffer: ~3-5 KB weights
-
-**Core GPU utility:** `src/gpu/texture_readback.{h,cc}`
-- Synchronous texture-to-CPU readback
-- Supports RGBA16Float, RGBA32Uint, BGRA8Unorm
-- Protected with STRIP_ALL (0 bytes in release)
-
----
-
-## Usage
-
-```bash
-cnn_test input.png output.png [OPTIONS]
-
-OPTIONS:
-  --cnn-version N          CNN version: 1 (default) or 2 (ignored with --weights)
-  --weights PATH           Load weights from .bin (forces CNN v2, overrides layer config)
-  --blend F                Final blend amount (0.0-1.0, default: 1.0)
-  --format ppm|png         Output format (default: png)
-  --layers N               Number of CNN layers (1-10, v1 only, default: 3, ignored with --weights)
-  --save-intermediates DIR Save intermediate layers to directory
-  --debug-hex              Print first 8 pixels as hex (debug)
-  --help                   Show usage
-```
-
-**Examples:**
-```bash
-# CNN v1 (render pipeline, 3 layers)
-./build/cnn_test input.png output.png --cnn-version 1
-
-# CNN v2 (compute, storage buffer, uses asset system weights)
-./build/cnn_test input.png output.png --cnn-version 2
-
-# CNN v2 with runtime weight loading (loads layer config from .bin)
-./build/cnn_test input.png output.png --weights checkpoints/checkpoint_epoch_100.pth.bin
-
-# 50% blend with original (v2)
-./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5
-
-# Debug hex dump
-./build/cnn_test input.png output.png --cnn-version 2 --debug-hex
-```
-
-**Important:** When using `--weights`, the layer count and kernel sizes are read from the binary file header, overriding any `--layers` or `--cnn-version` arguments.
-
----
-
-## Implementation Details
-
-### Core Readback Utility
-
-**File:** `src/gpu/texture_readback.{h,cc}`
-
-**Function:**
-```cpp
-std::vector<uint8_t> read_texture_pixels(
-    WGPUInstance instance,
-    WGPUDevice device,
-    WGPUTexture texture,
-    int width,
-    int height);
-```
-
-**Features:**
-- Returns BGRA8 format (4 bytes per pixel)
-- Synchronous blocking operation
-- Cross-platform async callback handling (Win32 vs Native API)
-- Automatic staging buffer creation and cleanup
-
-**Refactored OffscreenRenderTarget:**
-```cpp
-std::vector<uint8_t> OffscreenRenderTarget::read_pixels() {
-#if !defined(STRIP_ALL)
-  return read_texture_pixels(instance_, device_, texture_, width_, height_);
-#else
-  return std::vector<uint8_t>();
-#endif
-}
-```
-
-### CNN v1 Pipeline (Render)
-
-**Fixed 3-layer architecture:**
-- Ping-pong RGBA16Float textures
-- CNNLayerParams (binding 3): layer_index, blend_amount
-- Shader composer resolves #include directives
-
-### CNN v2 Pipeline (Compute)
-
-**Dynamic layer architecture:**
-1. **Static features compute:** Generate 7D features (RGBD + UV + sin + bias)
-2. **Layer computes:** N layers from binary weights (3-5 typically)
-   - Storage buffer weights (read-only)
-   - RGBA32Uint packed f16 textures (ping-pong)
-   - CNNv2LayerParams: kernel_size, channels, weight_offset, blend
-3. **Readback:** RGBA32Uint → f16 decode → u8 clamp
-
-**Binary format:** Header (20B) + layer info (20B×N) + f16 weights
-
-**Weight Loading:**
-- **Without `--weights`:** Loads from asset system (`ASSET_WEIGHTS_CNN_V2`)
-- **With `--weights PATH`:** Loads from external `.bin` file (e.g., checkpoint exports)
-  - Layer count and kernel sizes parsed from binary header
-  - Overrides any `--layers` or `--cnn-version` arguments
-  - Enables runtime testing of training checkpoints without rebuild
-
----
-
-## Build Integration
-
-**CMakeLists.txt:**
-
-1. Added `src/gpu/texture_readback.cc` to GPU_SOURCES (both sections)
-2. Tool target:
-```cmake
-add_executable(cnn_test
-    tools/cnn_test.cc
-    src/tests/common/webgpu_test_fixture.cc
-    src/tests/common/offscreen_render_target.cc
-    ${PLATFORM_SOURCES}
-    ${GEN_DEMO_CC})
-
-target_link_libraries(cnn_test PRIVATE
-    gpu util procedural ${DEMO_LIBS})
-
-add_dependencies(cnn_test generate_demo_assets)
-
-target_compile_definitions(cnn_test PRIVATE
-    STB_IMAGE_IMPLEMENTATION
-    STB_IMAGE_WRITE_IMPLEMENTATION)
-```
-
-**Build:**
-```bash
-cmake -S . -B build -DDEMO_BUILD_TOOLS=ON
-cmake --build build -j4
-```
-
----
-
-## Validation Workflow (CNN v2)
-
-### 1. Train and Export
-```bash
-# Train and export weights
-./scripts/train_cnn_v2_full.sh --epochs 200 --batch-size 16
-```
-
-### 2. Tool Inference
-```bash
-# Run tool with v2
-./build/cnn_test training/input/img_000.png output.png --cnn-version 2
-```
-
-### 3. Visual Comparison
-Compare output.png with training/target_X/img_000.png
-
----
-
-## Status
-
-**CNN v1:** Builds and runs, produces incorrect output (all white). Use CNNEffect in demo for visual validation.
-
-**CNN v2:** ⚠️ Partially functional. Readback works but output differs from HTML validation tool.
-- Loads binary weights from `workspaces/main/weights/cnn_v2_weights.bin`
-- Matches CNNv2Effect architecture
-- **Known Issue:** Visual output differs from `tools/cnn_v2_test/index.html` despite matching shader code
-- Root cause under investigation (weight indexing? texture sampling? activation clamping?)
-- Use HTML tool (`tools/cnn_v2_test/index.html`) for accurate validation
-
----
-
-## Technical Notes (Readback Fix)
-
-**Original Bug:** Buffer mapping returned `WGPUMapAsyncStatus_Unknown` (status=5)
-
-**Root Cause:** Callback mode mismatch
-- Used `WGPUCallbackMode_WaitAnyOnly` (fires only during `wgpuInstanceWaitAny`)
-- Called `wgpuInstanceProcessEvents` in wait loop (wrong API for this mode)
-- Callback never fired → timeout → empty buffer
-
-**Fix Applied:**
-1. Changed callback mode to `WGPUCallbackMode_AllowProcessEvents`
-2. Replaced `wgpuInstanceProcessEvents` with `wgpuDevicePoll(device, true, nullptr)`
-3. Added pre-mapping device poll to ensure copy completes
-
-**Relevant Code:** `src/gpu/texture_readback.cc` lines 97-110
-
-**Reference:** WebGPU spec - Asynchronous Operations, Callback Modes
-
----
-
-## Limitations
-
-- **CNN v1:** Produces incorrect output, use for debugging only
-- **Single image:** Batch processing requires shell loop
-- **No real-time preview:** Offline processing only
-- **PNG input:** stb_image (JPEG/PNG/BMP/TGA also supported)
-
----
-
-## Technical Notes
-
-**CNN v2 f16 decoding:**
-- RGBA32Uint texture stores 8×f16 as 4×u32
-- Custom decoder: extract u16, decode f16→f32, clamp [0,1]→u8
-- Handles denormals, infinity, NaN
-
-**Cross-platform:**
-- macOS, Linux (native WebGPU)
-- Windows (mingw-w64 cross-compile)
-
-**Size impact:**
-- Debug/STRIP_ALL=OFF: compiled
-- STRIP_ALL=ON: 0 bytes (compiled out)
-- FINAL_STRIP=ON: tool not built
diff --git a/doc/CNN_V2.md b/doc/CNN_V2.md
deleted file mode 100644
index b7fd6f8..0000000
--- a/doc/CNN_V2.md
+++ /dev/null
@@ -1,813 +0,0 @@
-# CNN v2: Parametric Static Features
-
-**Technical Design Document**
-
----
-
-## Overview
-
-CNN v2 extends the original CNN post-processing effect with parametric static features, enabling richer spatial and frequency-domain inputs for improved visual quality.
-
-**Key improvements over v1:**
-- 7D static feature input (vs 4D RGB)
-- Multi-frequency position encoding (NeRF-style)
-- Configurable mip-level for p0-p3 parametric features (0-3)
-- Per-layer configurable kernel sizes (1×1, 3×3, 5×5)
-- Variable channel counts per layer
-- Float16 weight storage (~3.2 KB for 3-layer model)
-- Bias integrated as static feature dimension
-- Storage buffer architecture (dynamic layer count)
-- Binary weight format v2 for runtime loading
-- Sigmoid activation for layer 0 and final layer (smooth [0,1] mapping)
-
-**Status:** ✅ Complete. Sigmoid activation, stable training, validation tools operational.
-
-**Breaking Change:**
-- Models trained with `clamp()` incompatible. Retrain required.
-
-**TODO:**
-- 8-bit quantization with QAT for 2× size reduction (~1.6 KB)
-
----
-
-## Architecture
-
-### Pipeline Overview
-
-```
-Input RGBD → Static Features Compute → CNN Layers → Output RGBA
-             └─ computed once/frame ─┘  └─ multi-pass ─┘
-```
-
-**Detailed Data Flow:**
-
-```
-                  ┌─────────────────────────────────────────┐
-                  │   Static Features (computed once)      │
-                  │   8D: p0,p1,p2,p3,uv_x,uv_y,sin10x,bias │
-                  └──────────────┬──────────────────────────┘
-                                 │
-                                 │ 8D (broadcast to all layers)
-                                 ├───────────────────────────┐
-                                 │                           │
-  ┌──────────────┐              │                           │
-  │ Input RGBD   │──────────────┤                           │
-  │     4D       │     4D       │                           │
-  └──────────────┘              │                           │
-                                 ▼                           │
-                          ┌────────────┐                    │
-                          │  Layer 0   │  (12D input)       │
-                          │   (CNN)    │   = 4D + 8D        │
-                          │  12D → 4D  │                    │
-                          └─────┬──────┘                    │
-                                │ 4D output                 │
-                                │                           │
-                                ├───────────────────────────┘
-                                │                           │
-                                ▼                           │
-                          ┌────────────┐                    │
-                          │  Layer 1   │  (12D input)       │
-                          │   (CNN)    │   = 4D + 8D        │
-                          │  12D → 4D  │                    │
-                          └─────┬──────┘                    │
-                                │ 4D output                 │
-                                │                           │
-                                ├───────────────────────────┘
-                                ▼                           │
-                               ...                          │
-                                │                           │
-                                ▼                           │
-                          ┌────────────┐                    │
-                          │  Layer N   │  (12D input)       │
-                          │ (output)   │◄──────────────────┘
-                          │  12D → 4D  │
-                          └─────┬──────┘
-                                │ 4D (RGBA)
-                                ▼
-                            Output
-```
-
-**Key Points:**
-- Static features computed once, broadcast to all CNN layers
-- Each layer: previous 4D output + 8D static → 12D input → 4D output
-- Ping-pong buffering between layers
-- Layer 0 special case: uses input RGBD instead of previous layer output
-
-**Static Features Texture:**
-- Name: `static_features`
-- Format: `texture_storage_2d<rgba32uint, write>` (4×u32)
-- Data: 8 float16 values packed via `pack2x16float()`
-- Computed once per frame, read by all CNN layers
-- Lifetime: Entire frame (all CNN layer passes)
-
-**CNN Layers:**
-- Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels
-- Layer 1+: previous output (4D) + static (8D) = 12D → 4 channels
-- All layers: uniform 12D input, 4D output (ping-pong buffer)
-- Storage: `texture_storage_2d<rgba32uint>` (4 channels as 2×f16 pairs)
-
-**Activation Functions:**
-- Layer 0 & final layer: `sigmoid(x)` for smooth [0,1] mapping
-- Middle layers: `ReLU` (max(0, x))
-- Rationale: Sigmoid prevents gradient blocking at boundaries, enabling better convergence
-- Breaking change: Models trained with `clamp(x, 0, 1)` are incompatible, retrain required
-
----
-
-## Static Features (7D + 1 bias)
-
-### Feature Layout
-
-**8 float16 values per pixel:**
-
-```wgsl
-// Slot 0-3: Parametric features (p0, p1, p2, p3)
-// Sampled from configurable mip level (0=original, 1=half, 2=quarter, 3=eighth)
-// Training sets mip_level via --mip-level flag, stored in binary format v2
-let p0 = ...;  // RGB.r from selected mip level
-let p1 = ...;  // RGB.g from selected mip level
-let p2 = ...;  // RGB.b from selected mip level
-let p3 = ...;  // Depth or RGB channel from mip level
-
-// Slot 4-5: UV coordinates (normalized screen space)
-let uv_x = coord.x / resolution.x;  // Horizontal position [0,1]
-let uv_y = coord.y / resolution.y;  // Vertical position [0,1]
-
-// Slot 6: Multi-frequency position encoding
-let sin20_y = sin(20.0 * uv_y);     // Periodic feature (frequency=20, vertical)
-
-// Slot 7: Bias dimension (always 1.0)
-let bias = 1.0;                     // Learned bias per output channel
-
-// Packed storage: [p0, p1, p2, p3, uv.x, uv.y, sin(20*uv.y), 1.0]
-```
-
-### Input Channel Mapping
-
-**Weight tensor layout (12 input channels per layer):**
-
-| Input Channel | Feature | Description |
-|--------------|---------|-------------|
-| 0-3 | Previous layer output | 4D RGBA from prior CNN layer (or input RGBD for Layer 0) |
-| 4-11 | Static features | 8D: p0, p1, p2, p3, uv_x, uv_y, sin20_y, bias |
-
-**Static feature channel details:**
-- Channel 4 → p0 (RGB.r from mip level)
-- Channel 5 → p1 (RGB.g from mip level)
-- Channel 6 → p2 (RGB.b from mip level)
-- Channel 7 → p3 (depth or RGB channel from mip level)
-- Channel 8 → p4 (uv_x: normalized horizontal position)
-- Channel 9 → p5 (uv_y: normalized vertical position)
-- Channel 10 → p6 (sin(20*uv_y): periodic encoding)
-- Channel 11 → p7 (bias: constant 1.0)
-
-**Note:** When generating identity weights, p4-p7 correspond to input channels 8-11, not 4-7.
-
-### Feature Rationale
-
-| Feature | Dimension | Purpose | Priority |
-|---------|-----------|---------|----------|
-| p0-p3 | 4D | Parametric auxiliary features (mips, gradients, etc.) | Essential |
-| UV coords | 2D | Spatial position awareness | Essential |
-| sin(20\*uv.y) | 1D | Periodic position encoding (vertical) | Medium |
-| Bias | 1D | Learned bias (standard NN) | Essential |
-
-**Note:** Input image RGBD (mip 0) fed only to Layer 0. Subsequent layers see static features + previous layer output.
-
-**Why bias as static feature:**
-- Simpler shader code (single weight array)
-- Standard NN formulation: y = Wx (x includes bias term)
-- Saves 56-112 bytes (no separate bias buffer)
-- 7 features sufficient for initial implementation
-
-### Future Feature Extensions
-
-**Option: Additional encodings:**
-- `sin(40*uv.y)` - Higher frequency encoding
-- `gray_mip1` - Multi-scale luminance
-- `dx`, `dy` - Sobel gradients
-- `variance` - Local texture measure
-- `laplacian` - Edge detection
-
-**Option: uint8 packing (16+ features):**
-```wgsl
-// texture_storage_2d<rgba8unorm> stores 16 uint8 values
-// Trade precision for feature count
-// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y,
-//  sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, var, bias]
-```
-Requires quantization-aware training.
-
----
-
-## Layer Structure
-
-### Example 3-Layer Network
-
-```
-Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels (3×3 kernel)
-Layer 1: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel)
-Layer 2: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel, output RGBA)
-```
-
-**Output:** 4 channels (RGBA). Training targets preserve alpha from target images.
-
-### Weight Calculations
-
-**Per-layer weights (uniform 12D→4D, 3×3 kernels):**
-```
-Layer 0: 12 × 3 × 3 × 4 = 432 weights
-Layer 1: 12 × 3 × 3 × 4 = 432 weights
-Layer 2: 12 × 3 × 3 × 4 = 432 weights
-Total: 1296 weights
-```
-
-**Storage sizes:**
-- f32: 1296 × 4 = 5,184 bytes (~5.1 KB)
-- f16: 1296 × 2 = 2,592 bytes (~2.5 KB) ✓ **recommended**
-
-**Comparison to v1:**
-- v1: ~800 weights (3.2 KB f32)
-- v2: ~1296 weights (2.5 KB f16)
-- **Uniform architecture, smaller than v1 f32**
-
-### Kernel Size Guidelines
-
-**1×1 kernel (pointwise):**
-- No spatial context, channel mixing only
-- Weights: `12 × 4 = 48` per layer
-- Use for: Fast inference, channel remapping
-
-**3×3 kernel (standard conv):**
-- Local spatial context (recommended)
-- Weights: `12 × 9 × 4 = 432` per layer
-- Use for: Most layers (balanced quality/size)
-
-**5×5 kernel (large receptive field):**
-- Wide spatial context
-- Weights: `12 × 25 × 4 = 1200` per layer
-- Use for: Output layer, fine detail enhancement
-
-### Channel Storage (4×f16 per texel)
-
-```wgsl
-@group(0) @binding(1) var layer_input: texture_2d<u32>;
-
-fn unpack_channels(coord: vec2<i32>) -> vec4<f32> {
-  let packed = textureLoad(layer_input, coord, 0);
-  let v0 = unpack2x16float(packed.x);  // [ch0, ch1]
-  let v1 = unpack2x16float(packed.y);  // [ch2, ch3]
-  return vec4<f32>(v0.x, v0.y, v1.x, v1.y);
-}
-
-fn pack_channels(values: vec4<f32>) -> vec4<u32> {
-  return vec4<u32>(
-    pack2x16float(vec2(values.x, values.y)),
-    pack2x16float(vec2(values.z, values.w)),
-    0u,  // Unused
-    0u   // Unused
-  );
-}
-```
-
----
-
-## Training Workflow
-
-### Script: `training/train_cnn_v2.py`
-
-**Static Feature Extraction:**
-
-```python
-def compute_static_features(rgb, depth, mip_level=0):
-    """Generate parametric features (8D: p0-p3 + spatial).
-
-    Args:
-        mip_level: 0=original, 1=half res, 2=quarter res, 3=eighth res
-    """
-    h, w = rgb.shape[:2]
-
-    # Generate mip level for p0-p3 (downsample then upsample)
-    if mip_level > 0:
-        mip_rgb = rgb.copy()
-        for _ in range(mip_level):
-            mip_rgb = cv2.pyrDown(mip_rgb)
-        for _ in range(mip_level):
-            mip_rgb = cv2.pyrUp(mip_rgb)
-        if mip_rgb.shape[:2] != (h, w):
-            mip_rgb = cv2.resize(mip_rgb, (w, h), interpolation=cv2.INTER_LINEAR)
-    else:
-        mip_rgb = rgb
-
-    # Parametric features from mip level
-    p0, p1, p2, p3 = mip_rgb[..., 0], mip_rgb[..., 1], mip_rgb[..., 2], depth
-
-    # UV coordinates (normalized)
-    uv_x = np.linspace(0, 1, w)[None, :].repeat(h, axis=0)
-    uv_y = np.linspace(0, 1, h)[:, None].repeat(w, axis=1)
-
-    # Multi-frequency position encoding
-    sin10_x = np.sin(10.0 * uv_x)
-
-    # Bias dimension (always 1.0)
-    bias = np.ones_like(p0)
-
-    # Stack: [p0, p1, p2, p3, uv.x, uv.y, sin10_x, bias]
-    return np.stack([p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias], axis=-1)
-```
-
-**Network Definition:**
-
-```python
-class CNNv2(nn.Module):
-    def __init__(self, kernel_sizes, num_layers=3):
-        super().__init__()
-        if isinstance(kernel_sizes, int):
-            kernel_sizes = [kernel_sizes] * num_layers
-        self.kernel_sizes = kernel_sizes
-        self.layers = nn.ModuleList()
-
-        # All layers: 12D input (4 prev + 8 static) → 4D output
-        for kernel_size in kernel_sizes:
-            self.layers.append(
-                nn.Conv2d(12, 4, kernel_size=kernel_size,
-                         padding=kernel_size//2, bias=False)
-            )
-
-    def forward(self, input_rgbd, static_features):
-        # Layer 0: input RGBD (4D) + static (8D) = 12D
-        x = torch.cat([input_rgbd, static_features], dim=1)
-        x = self.layers[0](x)
-        x = torch.sigmoid(x)  # Soft [0,1] for layer 0
-
-        # Layer 1+: previous output (4D) + static (8D) = 12D
-        for i in range(1, len(self.layers)):
-            x_input = torch.cat([x, static_features], dim=1)
-            x = self.layers[i](x_input)
-            if i < len(self.layers) - 1:
-                x = F.relu(x)
-            else:
-                x = torch.sigmoid(x)  # Soft [0,1] for final layer
-
-        return x  # RGBA output
-```
-
-**Training Configuration:**
-
-```python
-# Hyperparameters
-kernel_sizes = [3, 3, 3]     # Per-layer kernel sizes (e.g., [1,3,5])
-num_layers = 3               # Number of CNN layers
-mip_level = 0                # Mip level for p0-p3: 0=orig, 1=half, 2=quarter, 3=eighth
-grayscale_loss = False       # Compute loss on grayscale (Y) instead of RGBA
-learning_rate = 1e-3
-batch_size = 16
-epochs = 5000
-
-# Dataset: Input RGB, Target RGBA (preserves alpha channel from image)
-# Model outputs RGBA, loss compares all 4 channels (or grayscale if --grayscale-loss)
-
-# Training loop (standard PyTorch f32)
-for epoch in range(epochs):
-    for rgb_batch, depth_batch, target_batch in dataloader:
-        # Compute static features (8D) with mip level
-        static_feat = compute_static_features(rgb_batch, depth_batch, mip_level)
-
-        # Input RGBD (4D)
-        input_rgbd = torch.cat([rgb_batch, depth_batch.unsqueeze(1)], dim=1)
-
-        # Forward pass
-        output = model(input_rgbd, static_feat)
-
-        # Loss computation (grayscale or RGBA)
-        if grayscale_loss:
-            # Convert RGBA to grayscale: Y = 0.299*R + 0.587*G + 0.114*B
-            output_gray = 0.299 * output[:, 0:1] + 0.587 * output[:, 1:2] + 0.114 * output[:, 2:3]
-            target_gray = 0.299 * target[:, 0:1] + 0.587 * target[:, 1:2] + 0.114 * target[:, 2:3]
-            loss = criterion(output_gray, target_gray)
-        else:
-            loss = criterion(output, target_batch)
-
-        # Backward pass
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-```
-
-**Checkpoint Format:**
-
-```python
-torch.save({
-    'state_dict': model.state_dict(),  # f32 weights
-    'config': {
-        'kernel_sizes': [3, 3, 3],  # Per-layer kernel sizes
-        'num_layers': 3,
-        'mip_level': 0,             # Mip level used for p0-p3
-        'grayscale_loss': False,    # Whether grayscale loss was used
-        'features': ['p0', 'p1', 'p2', 'p3', 'uv.x', 'uv.y', 'sin10_x', 'bias']
-    },
-    'epoch': epoch,
-    'loss': loss.item()
-}, f'checkpoints/checkpoint_epoch_{epoch}.pth')
-```
-
----
-
-## Export Workflow
-
-### Script: `training/export_cnn_v2_shader.py`
-
-**Process:**
-1. Load checkpoint (f32 PyTorch weights)
-2. Extract layer configs (kernels, channels)
-3. Quantize weights to float16: `weights_f16 = weights_f32.astype(np.float16)`
-4. Generate WGSL shader per layer
-5. Write to `workspaces/<workspace>/shaders/cnn_v2/cnn_v2_*.wgsl`
-
-**Example Generated Shader:**
-
-```wgsl
-// cnn_v2_layer_0.wgsl - Auto-generated from checkpoint_epoch_5000.pth
-
-const KERNEL_SIZE: u32 = 1u;
-const IN_CHANNELS: u32 = 8u;   // 7 features + bias
-const OUT_CHANNELS: u32 = 16u;
-
-// Weights quantized to float16 (stored as f32 in shader)
-const weights: array<f32, 128> = array(
-  0.123047, -0.089844, 0.234375, 0.456055, ...
-);
-
-@group(0) @binding(0) var static_features: texture_2d<u32>;
-@group(0) @binding(1) var output_texture: texture_storage_2d<rgba32uint, write>;
-
-@compute @workgroup_size(8, 8)
-fn main(@builtin(global_invocation_id) id: vec3<u32>) {
-  // Load static features (8D)
-  let static_feat = get_static_features(vec2<i32>(id.xy));
-
-  // Convolution (1×1 kernel = pointwise)
-  var output: array<f32, OUT_CHANNELS>;
-  for (var c: u32 = 0u; c < OUT_CHANNELS; c++) {
-    var sum: f32 = 0.0;
-    for (var k: u32 = 0u; k < IN_CHANNELS; k++) {
-      sum += weights[c * IN_CHANNELS + k] * static_feat[k];
-    }
-    output[c] = max(0.0, sum);  // ReLU activation
-  }
-
-  // Pack and store (8×f16 per texel)
-  textureStore(output_texture, vec2<i32>(id.xy), pack_f16x8(output));
-}
-```
-
-**Float16 Quantization:**
-- Training uses f32 throughout (PyTorch standard)
-- Export converts to np.float16, then back to f32 for WGSL literals
-- **Expected discrepancy:** <0.1% MSE (acceptable)
-- Validation via HTML tool (see below)
-
----
-
-## Validation Workflow
-
-### HTML Tool: `tools/cnn_v2_test/index.html`
-
-**WebGPU-based testing tool** with layer visualization.
-
-**Usage:**
-1. Open `tools/cnn_v2_test/index.html` in browser
-2. Drop `.bin` weights file (from `export_cnn_v2_weights.py`)
-3. Drop PNG test image
-4. View results with layer inspection
-
-**Features:**
-- Live CNN inference with WebGPU
-- Layer-by-layer visualization (static features + all CNN layers)
-- Weight visualization (per-layer kernels)
-- View modes: CNN output, original, diff (×10)
-- Blend control for comparing with original
-
-**Export weights:**
-```bash
-./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \
-  --output-weights workspaces/main/cnn_v2_weights.bin
-```
-
-See `doc/CNN_V2_WEB_TOOL.md` for detailed documentation
-
----
-
-## Implementation Checklist
-
-### Phase 1: Shaders (Core Infrastructure)
-
-- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl` - Static features compute
-  - [ ] RGBD sampling from framebuffer
-  - [ ] UV coordinate calculation
-  - [ ] sin(10\*uv.x) computation
-  - [ ] Bias dimension (constant 1.0)
-  - [ ] Float16 packing via `pack2x16float()`
-  - [ ] Output to `texture_storage_2d<rgba32uint>`
-
-- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_layer_template.wgsl` - Layer template
-  - [ ] Static features unpacking
-  - [ ] Previous layer unpacking (8×f16)
-  - [ ] Convolution implementation (1×1, 3×3, 5×5)
-  - [ ] ReLU activation
-  - [ ] Output packing (8×f16)
-  - [ ] Proper padding handling
-
-### Phase 2: C++ Effect Class
-
-- [ ] `src/effects/cnn_v2_effect.h` - Header
-  - [ ] Class declaration inheriting from `PostProcessEffect`
-  - [ ] Static features texture member
-  - [ ] Layer textures vector
-  - [ ] Pipeline and bind group members
-
-- [ ] `src/effects/cnn_v2_effect.cc` - Implementation
-  - [ ] Constructor: Load shaders, create textures
-  - [ ] `init()`: Create pipelines, bind groups
-  - [ ] `render()`: Multi-pass execution
-    - [ ] Pass 0: Compute static features
-    - [ ] Pass 1-N: CNN layers
-    - [ ] Final: Composite to output
-  - [ ] Proper resource cleanup
-
-- [ ] Integration
-  - [ ] Add to `src/gpu/demo_effects.h` includes
-  - [ ] Add `cnn_v2_effect.cc` to `CMakeLists.txt` (headless + normal)
-  - [ ] Add shaders to `workspaces/main/assets.txt`
-  - [ ] Add to `src/tests/gpu/test_demo_effects.cc`
-
-### Phase 3: Training Pipeline
-
-- [ ] `training/train_cnn_v2.py` - Training script
-  - [ ] Static feature extraction function
-  - [ ] CNNv2 PyTorch model class
-  - [ ] Patch-based dataloader
-  - [ ] Training loop with checkpointing
-  - [ ] Command-line argument parsing
-  - [ ] Inference mode (ground truth generation)
-
-- [ ] `training/export_cnn_v2_shader.py` - Export script
-  - [ ] Checkpoint loading
-  - [ ] Weight extraction and f16 quantization
-  - [ ] Per-layer WGSL generation
-  - [ ] File output to workspace shaders/
-  - [ ] Metadata preservation
-
-### Phase 4: Tools & Validation
-
-- [x] HTML validation tool - WebGPU inference with layer visualization
-  - [ ] Command-line argument parsing
-  - [ ] Shader export orchestration
-  - [ ] Build orchestration
-  - [ ] Batch image processing
-  - [ ] Results display
-
-- [ ] `src/tools/cnn_test_main.cc` - Tool updates
-  - [ ] Add `--cnn-version v2` flag
-  - [ ] CNNv2Effect instantiation path
-  - [ ] Static features pass execution
-  - [ ] Multi-layer processing
-
-### Phase 5: Documentation
-
-- [ ] `doc/HOWTO.md` - Usage guide
-  - [ ] Training section (CNN v2)
-  - [ ] Export section
-  - [ ] Validation section
-  - [ ] Examples
-
-- [ ] `README.md` - Project overview update
-  - [ ] Mention CNN v2 capability
-
----
-
-## File Structure
-
-### New Files
-
-```
-# Shaders (generated by export script)
-workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl       # Static features compute
-workspaces/main/shaders/cnn_v2/cnn_v2_layer_0.wgsl      # Input layer (generated)
-workspaces/main/shaders/cnn_v2/cnn_v2_layer_1.wgsl      # Inner layer (generated)
-workspaces/main/shaders/cnn_v2/cnn_v2_layer_2.wgsl      # Output layer (generated)
-
-# C++ implementation
-src/effects/cnn_v2_effect.h                  # Effect class header
-src/effects/cnn_v2_effect.cc                 # Effect implementation
-
-# Python training/export
-training/train_cnn_v2.py                         # Training script
-training/export_cnn_v2_shader.py                 # Shader generator
-training/validation/                             # Test images directory
-
-# Validation
-tools/cnn_v2_test/index.html                     # WebGPU validation tool
-
-# Documentation
-doc/CNN_V2.md                                    # This file
-```
-
-### Modified Files
-
-```
-src/gpu/demo_effects.h                           # Add CNNv2Effect include
-CMakeLists.txt                                   # Add cnn_v2_effect.cc
-workspaces/main/assets.txt                       # Add cnn_v2 shaders
-workspaces/main/timeline.seq                     # Optional: add CNNv2Effect
-src/tests/gpu/test_demo_effects.cc               # Add CNNv2 test case
-src/tools/cnn_test_main.cc                       # Add --cnn-version v2
-doc/HOWTO.md                                     # Add CNN v2 sections
-TODO.md                                          # Add CNN v2 task
-```
-
-### Unchanged (v1 Preserved)
-
-```
-training/train_cnn.py                            # Original training
-src/effects/cnn_effect.*                     # Original effect
-workspaces/main/shaders/cnn_*.wgsl               # Original v1 shaders
-```
-
----
-
-## Performance Characteristics
-
-### Static Features Compute
-- **Cost:** ~0.1ms @ 1080p
-- **Frequency:** Once per frame
-- **Operations:** sin(), texture sampling, packing
-
-### CNN Layers (Example 3-layer)
-- **Layer0 (1×1, 8→16):** ~0.3ms
-- **Layer1 (3×3, 23→8):** ~0.8ms
-- **Layer2 (5×5, 15→4):** ~1.2ms
-- **Total:** ~2.4ms @ 1080p
-
-### Memory Usage
-- Static features: 1920×1080×8×2 = 33 MB (f16)
-- Layer buffers: 1920×1080×16×2 = 66 MB (max 16 channels)
-- Weights: ~6.4 KB (f16, in shader code)
-- **Total GPU memory:** ~100 MB
-
----
-
-## Size Budget
-
-### CNN v1 vs v2
-
-| Metric | v1 | v2 | Delta |
-|--------|----|----|-------|
-| Weights (count) | 800 | 3268 | +2468 |
-| Storage (f32) | 3.2 KB | 13.1 KB | +9.9 KB |
-| Storage (f16) | N/A | 6.5 KB | +6.5 KB |
-| Shader code | ~500 lines | ~800 lines | +300 lines |
-
-### Mitigation Strategies
-
-**Reduce channels:**
-- [16,8,4] → [8,4,4] saves ~50% weights
-- [16,8,4] → [4,4,4] saves ~60% weights
-
-**Smaller kernels:**
-- [1,3,5] → [1,3,3] saves ~30% weights
-- [1,3,5] → [1,1,3] saves ~50% weights
-
-**Quantization:**
-- int8 weights: saves 75% (requires QAT training)
-- 4-bit weights: saves 87.5% (extreme, needs research)
-
-**Target:** Keep CNN v2 under 10 KB for 64k demo constraint
-
----
-
-## Future Extensions
-
-### Flexible Feature Layout (Binary Format v3)
-
-**TODO:** Support arbitrary feature vector layouts and ordering in binary format.
-
-**Current Limitation:**
-- Feature layout hardcoded: `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]`
-- Shader must match training script exactly
-- Experimentation requires shader recompilation
-
-**Proposed Enhancement:**
-- Add feature descriptor to binary format header
-- Specify feature types, sources, and ordering
-- Runtime shader generation or dynamic feature indexing
-- Examples: `[R, G, B, dx, dy, uv_x, bias]` or `[mip1.r, mip2.g, laplacian, uv_x, sin20_x, bias]`
-
-**Benefits:**
-- Training experiments without C++/shader changes
-- A/B test different feature combinations
-- Single binary format, multiple architectures
-- Faster iteration on feature engineering
-
-**Implementation Options:**
-1. **Static approach:** Generate shader code from descriptor at load time
-2. **Dynamic approach:** Array-based indexing with feature map uniform
-3. **Hybrid:** Precompile common layouts, fallback to dynamic
-
-See `doc/CNN_V2_BINARY_FORMAT.md` for proposed descriptor format.
-
----
-
-### More Features (uint8 Packing)
-
-```wgsl
-// 16 uint8 features per texel (texture_storage_2d<rgba8unorm>)
-// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y,
-//  sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, variance, bias]
-```
-- Trade precision for quantity
-- Requires quantization-aware training
-
-### Temporal Features
-
-- Previous frame RGBA (motion awareness)
-- Optical flow vectors
-- Requires multi-frame buffer
-
-### Learned Position Encodings
-
-- Replace hand-crafted sin(10\*uv) with learned embeddings
-- Requires separate embedding network
-- Similar to NeRF position encoding
-
-### Dynamic Architecture
-
-- Runtime kernel size selection based on scene
-- Conditional layer execution (skip connections)
-- Layer pruning for performance
-
----
-
-## References
-
-- **v1 Implementation:** `src/effects/cnn_effect.*`
-- **Training Guide:** `doc/HOWTO.md` (CNN Training section)
-- **Test Tool:** `doc/CNN_TEST_TOOL.md`
-- **Shader System:** `doc/SEQUENCE.md`
-- **Size Measurement:** `doc/SIZE_MEASUREMENT.md`
-
----
-
-## Appendix: Design Decisions
-
-### Why Bias as Static Feature?
-
-**Alternatives considered:**
-1. Separate bias array per layer (Option B)
-2. Bias as static feature = 1.0 (Option A, chosen)
-
-**Decision rationale:**
-- Simpler shader code (fewer bindings)
-- Standard NN formulation (augmented input)
-- Saves 56-112 bytes per model
-- 7 features sufficient for v1 implementation
-- Can extend to uint8 packing if >7 features needed
-
-### Why Float16 for Weights?
-
-**Alternatives considered:**
-1. Keep f32 (larger, more accurate)
-2. Use f16 (smaller, GPU-native)
-3. Use int8 (smallest, needs QAT)
-
-**Decision rationale:**
-- f16 saves 50% vs f32 (critical for 64k target)
-- GPU-native support (pack2x16float in WGSL)
-- <0.1% accuracy loss (acceptable)
-- Simpler than int8 quantization
-
-### Why Multi-Frequency Position Encoding?
-
-**Inspiration:** NeRF (Neural Radiance Fields)
-
-**Benefits:**
-- Helps network learn high-frequency details
-- Better than raw UV coordinates
-- Small footprint (1D per frequency)
-
-**Future:** Add sin(20\*uv), sin(40\*uv) if >7 features available
-
----
-
-## Related Documentation
-
-- `doc/CNN_V2_BINARY_FORMAT.md` - Binary weight file specification (.bin format)
-- `doc/CNN_V2_WEB_TOOL.md` - WebGPU testing tool with layer visualization
-- `doc/CNN_TEST_TOOL.md` - C++ offline validation tool (deprecated)
-- `doc/HOWTO.md` - Training and validation workflows
-
----
-
-**Document Version:** 1.0
-**Last Updated:** 2026-02-12
-**Status:** Design approved, ready for implementation
diff --git a/doc/CNN_V2_BINARY_FORMAT.md b/doc/CNN_V2_BINARY_FORMAT.md
deleted file mode 100644
index 59c859d..0000000
--- a/doc/CNN_V2_BINARY_FORMAT.md
+++ /dev/null
@@ -1,235 +0,0 @@
-# CNN v2 Binary Weight Format Specification
-
-Binary format for storing trained CNN v2 weights with static feature architecture.
-
-**File Extension:** `.bin`
-**Byte Order:** Little-endian
-**Version:** 2.0 (supports mip-level for parametric features)
-**Backward Compatible:** Version 1.0 files supported (mip_level=0)
-
----
-
-## File Structure
-
-**Version 2 (current):**
-```
-┌─────────────────────┐
-│  Header (20 bytes)  │
-├─────────────────────┤
-│  Layer Info         │
-│  (20 bytes × N)     │
-├─────────────────────┤
-│  Weight Data        │
-│  (variable size)    │
-└─────────────────────┘
-```
-
-**Version 1 (legacy):**
-```
-┌─────────────────────┐
-│  Header (16 bytes)  │
-├─────────────────────┤
-│  Layer Info         │
-│  (20 bytes × N)     │
-├─────────────────────┤
-│  Weight Data        │
-│  (variable size)    │
-└─────────────────────┘
-```
-
----
-
-## Header
-
-**Version 2 (20 bytes):**
-
-| Offset | Type | Field          | Description                          |
-|--------|------|----------------|--------------------------------------|
-| 0x00   | u32  | magic          | Magic number: `0x32_4E_4E_43` ("CNN2") |
-| 0x04   | u32  | version        | Format version (2 for current)       |
-| 0x08   | u32  | num_layers     | Number of CNN layers (excludes static features) |
-| 0x0C   | u32  | total_weights  | Total f16 weight count across all layers |
-| 0x10   | u32  | mip_level      | Mip level for p0-p3 features (0=original, 1=half, 2=quarter, 3=eighth) |
-
-**Version 1 (16 bytes) - Legacy:**
-
-| Offset | Type | Field          | Description                          |
-|--------|------|----------------|--------------------------------------|
-| 0x00   | u32  | magic          | Magic number: `0x32_4E_4E_43` ("CNN2") |
-| 0x04   | u32  | version        | Format version (1)                   |
-| 0x08   | u32  | num_layers     | Number of CNN layers                 |
-| 0x0C   | u32  | total_weights  | Total f16 weight count               |
-
-**Note:** Loaders should check version field and handle both formats. Version 1 files treated as mip_level=0.
-
----
-
-## Layer Info (20 bytes per layer)
-
-Repeated `num_layers` times:
-- **Version 2:** Starting at offset 0x14 (20 bytes)
-- **Version 1:** Starting at offset 0x10 (16 bytes)
-
-| Offset      | Type | Field          | Description                          |
-|-------------|------|----------------|--------------------------------------|
-| 0x00        | u32  | kernel_size    | Convolution kernel dimension (3, 5, 7, etc.) |
-| 0x04        | u32  | in_channels    | Input channel count (includes 8 static features for Layer 1) |
-| 0x08        | u32  | out_channels   | Output channel count (max 8)         |
-| 0x0C        | u32  | weight_offset  | Weight array start index (f16 units, relative to weight data section) |
-| 0x10        | u32  | weight_count   | Number of f16 weights for this layer |
-
-**Layer Order:** Sequential (Layer 1, Layer 2, Layer 3, ...)
-
----
-
-## Weight Data (variable size)
-
-Starts at offset:
-- **Version 2:** `20 + (num_layers × 20)`
-- **Version 1:** `16 + (num_layers × 20)`
-
-**Format:** Packed f16 pairs stored as u32
-**Packing:** `u32 = (f16_hi << 16) | f16_lo`
-**Storage:** Sequential by layer, then by output channel, input channel, spatial position
-
-**Weight Indexing:**
-```
-weight_idx = output_ch × (in_channels × kernel_size²) +
-             input_ch × kernel_size² +
-             (ky × kernel_size + kx)
-```
-
-Where:
-- `output_ch` ∈ [0, out_channels)
-- `input_ch` ∈ [0, in_channels)
-- `ky`, `kx` ∈ [0, kernel_size)
-
-**Unpacking f16 from u32:**
-```c
-uint32_t packed = weights_buffer[weight_idx / 2];
-uint16_t f16_bits = (weight_idx % 2 == 0) ? (packed & 0xFFFF) : (packed >> 16);
-```
-
----
-
-## Example: 3-Layer Network (Version 2)
-
-**Configuration:**
-- Mip level: 0 (original resolution)
-- Layer 0: 12→4, kernel 3×3 (432 weights)
-- Layer 1: 12→4, kernel 3×3 (432 weights)
-- Layer 2: 12→4, kernel 3×3 (432 weights)
-
-**File Layout:**
-```
-Offset   Size   Content
-------   ----   -------
-0x00     20     Header (magic, version=2, layers=3, weights=1296, mip_level=0)
-0x14     20     Layer 0 info (kernel=3, in=12, out=4, offset=0, count=432)
-0x28     20     Layer 1 info (kernel=3, in=12, out=4, offset=432, count=432)
-0x3C     20     Layer 2 info (kernel=3, in=12, out=4, offset=864, count=432)
-0x50     2592   Weight data (1296 u32 packed f16 pairs)
-         ----
-Total:   2672 bytes (~2.6 KB)
-```
-
----
-
-## Static Features
-
-Not stored in .bin file (computed at runtime):
-
-**8D Input Features:**
-1. **p0** - Parametric feature 0 (from mip level)
-2. **p1** - Parametric feature 1 (from mip level)
-3. **p2** - Parametric feature 2 (from mip level)
-4. **p3** - Parametric feature 3 (depth or from mip level)
-5. **UV_X** - Normalized x coordinate [0,1]
-6. **UV_Y** - Normalized y coordinate [0,1]
-7. **sin(20 × UV_Y)** - Spatial frequency encoding (vertical, frequency=20)
-8. **1.0** - Bias term
-
-**Mip Level Usage (p0-p3):**
-- `mip_level=0`: RGB from original resolution (mip 0)
-- `mip_level=1`: RGB from half resolution (mip 1), upsampled
-- `mip_level=2`: RGB from quarter resolution (mip 2), upsampled
-- `mip_level=3`: RGB from eighth resolution (mip 3), upsampled
-
-**Layer 0** receives input RGBD (4D) + static features (8D) = 12D input → 4D output.
-**Layer 1+** receive previous layer output (4D) + static features (8D) = 12D input → 4D output.
-
----
-
-## Validation
-
-**Magic Check:**
-```c
-uint32_t magic;
-fread(&magic, 4, 1, fp);
-if (magic != 0x32_4E_4E_43) { error("Invalid CNN v2 file"); }
-```
-
-**Version Check:**
-```c
-uint32_t version;
-fread(&version, 4, 1, fp);
-if (version != 1 && version != 2) { error("Unsupported version"); }
-uint32_t header_size = (version == 1) ? 16 : 20;
-```
-
-**Size Check:**
-```c
-expected_size = header_size + (num_layers × 20) + (total_weights × 2);
-if (file_size != expected_size) { error("Size mismatch"); }
-```
-
-**Weight Offset Sanity:**
-```c
-// Each layer's offset should match cumulative count
-uint32_t cumulative = 0;
-for (int i = 0; i < num_layers; i++) {
-    if (layers[i].weight_offset != cumulative) { error("Invalid offset"); }
-    cumulative += layers[i].weight_count;
-}
-if (cumulative != total_weights) { error("Total mismatch"); }
-```
-
----
-
-## Future Extensions
-
-**TODO: Flexible Feature Layout**
-
-Current limitation: Feature vector layout is hardcoded as `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]`.
-
-Proposed enhancement for version 3:
-- Add feature descriptor section to header
-- Specify feature count, types, and ordering
-- Support arbitrary 7D feature combinations (e.g., `[R, G, B, dx, dy, uv_x, bias]`)
-- Allow runtime shader generation based on descriptor
-- Enable experimentation without recompiling shaders
-
-Example descriptor format:
-```
-struct FeatureDescriptor {
-  u32 feature_count;           // Number of features (typically 7-8)
-  u32 feature_types[8];        // Type enum per feature
-  u32 feature_sources[8];      // Source enum (mip0, mip1, gradient, etc.)
-  u32 reserved[8];             // Future use
-}
-```
-
-Benefits:
-- Training can experiment with different feature combinations
-- No shader recompilation needed
-- Single binary format supports multiple architectures
-- Easier A/B testing of feature effectiveness
-
----
-
-## Related Files
-
-- `training/export_cnn_v2_weights.py` - Binary export tool
-- `src/effects/cnn_v2_effect.cc` - C++ loader
-- `tools/cnn_v2_test/index.html` - WebGPU validator
-- `doc/CNN_V2.md` - Architecture design
diff --git a/doc/CNN_V2_DEBUG_TOOLS.md b/doc/CNN_V2_DEBUG_TOOLS.md
deleted file mode 100644
index 8d1289a..0000000
--- a/doc/CNN_V2_DEBUG_TOOLS.md
+++ /dev/null
@@ -1,143 +0,0 @@
-# CNN v2 Debugging Tools
-
-Tools for investigating CNN v2 mismatch between HTML tool and cnn_test.
-
----
-
-## Identity Weight Generator
-
-**Purpose:** Generate trivial .bin files with identity passthrough for debugging.
-
-**Script:** `training/gen_identity_weights.py`
-
-**Usage:**
-```bash
-# 1×1 identity (default)
-./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity.bin
-
-# 3×3 identity
-./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity_3x3.bin --kernel-size 3
-
-# Mix mode: 50-50 blend (0.5*p0+0.5*p4, etc)
-./training/gen_identity_weights.py output.bin --mix
-
-# Static features only: p4→ch0, p5→ch1, p6→ch2, p7→ch3
-./training/gen_identity_weights.py output.bin --p47
-
-# Custom mip level
-./training/gen_identity_weights.py output.bin --kernel-size 1 --mip-level 2
-```
-
-**Output:**
-- Single layer, 12D→4D (4 input channels + 8 static features)
-- Identity mode: Output Ch{0,1,2,3} = Input Ch{0,1,2,3}
-- Mix mode (--mix): Output Ch{i} = 0.5*Input Ch{i} + 0.5*Input Ch{i+4} (50-50 blend, avoids overflow)
-- Static mode (--p47): Output Ch{i} = Input Ch{i+4} (static features only, visualizes p4-p7)
-- Minimal file size (~136 bytes for 1×1, ~904 bytes for 3×3)
-
-**Validation:**
-Load in HTML tool or cnn_test - output should match input (RGB only, ignoring static features).
-
----
-
-## Composited Layer Visualization
-
-**Purpose:** Save current layer view as single composited image (4 channels side-by-side, grayscale).
-
-**Location:** HTML tool - "Layer Visualization" panel
-
-**Usage:**
-1. Load image + weights in HTML tool
-2. Select layer to visualize (Static 0-3, Static 4-7, Layer 0, Layer 1, etc.)
-3. Click "Save Composited" button
-4. Downloads PNG: `composited_layer{N}_{W}x{H}.png`
-
-**Output:**
-- 4 channels stacked horizontally
-- Grayscale representation
-- Useful for comparing layer activations across tools
-
----
-
-## Debugging Strategy
-
-### Track a) Binary Conversion Chain
-
-**Hypothesis:** Conversion error in .bin ↔ base64 ↔ Float32Array
-
-**Test:**
-1. Generate identity weights:
-   ```bash
-   ./training/gen_identity_weights.py workspaces/main/weights/test_identity.bin
-   ```
-
-2. Load in HTML tool - output should match input RGB
-
-3. If mismatch:
-   - Check Python export: f16 packing in `export_cnn_v2_weights.py` line 105
-   - Check HTML parsing: `unpackF16()` in `index.html` line 805-815
-   - Check weight indexing: `get_weight()` shader function
-
-**Key locations:**
-- Python: `np.float16` → `view(np.uint32)` (line 105 of export script)
-- JS: `DataView` → `unpackF16()` → manual f16 decode (line 773-803)
-- WGSL: `unpack2x16float()` built-in (line 492 of shader)
-
-### Track b) Layer Visualization
-
-**Purpose:** Confirm layer outputs match between HTML and C++
-
-**Method:**
-1. Run identical input through both tools
-2. Save composited layers from HTML tool
-3. Compare with cnn_test output
-4. Use identity weights to isolate weight loading from computation
-
-### Track c) Trivial Test Case
-
-**Use identity weights to test:**
-- Weight loading (binary parsing)
-- Feature generation (static features)
-- Convolution (should be passthrough)
-- Output packing
-
-**Expected behavior:**
-- Input RGB → Output RGB (exact match)
-- Static features ignored (all zeros in identity matrix)
-
----
-
-## Known Issues
-
-### ~~Layer 0 Visualization Scale~~ [FIXED]
-
-**Issue:** Layer 0 output displayed at 0.5× brightness (divided by 2).
-
-**Cause:** Line 1530 used `vizScale = 0.5` for all CNN layers, but Layer 0 is clamped [0,1] and doesn't need dimming.
-
-**Fix:** Use scale 1.0 for Layer 0 output (layerIdx=1), 0.5 only for middle layers (ReLU, unbounded).
-
-### Remaining Mismatch
-
-**Current:** HTML tool and cnn_test produce different outputs for same input/weights.
-
-**Suspects:**
-1. F16 unpacking difference (CPU vs GPU vs JS)
-2. Static feature generation (RGBD, UV, sin encoding)
-3. Convolution kernel iteration order
-4. Output packing/unpacking
-
-**Next steps:**
-1. Test with identity weights (eliminates weight loading)
-2. Compare composited layer outputs
-3. Add debug visualization for static features
-4. Hex dump comparison (first 8 pixels) - use `--debug-hex` flag in cnn_test
-
----
-
-## Related Documentation
-
-- `doc/CNN_V2.md` - CNN v2 architecture
-- `doc/CNN_V2_WEB_TOOL.md` - HTML tool documentation
-- `doc/CNN_TEST_TOOL.md` - cnn_test CLI tool
-- `training/export_cnn_v2_weights.py` - Binary export format
diff --git a/doc/CNN_V2_WEB_TOOL.md b/doc/CNN_V2_WEB_TOOL.md
deleted file mode 100644
index b6f5b0b..0000000
--- a/doc/CNN_V2_WEB_TOOL.md
+++ /dev/null
@@ -1,348 +0,0 @@
-# CNN v2 Web Testing Tool
-
-Browser-based WebGPU tool for validating CNN v2 inference with layer visualization and weight inspection.
-
-**Location:** `tools/cnn_v2_test/index.html`
-
----
-
-## Status (2026-02-13)
-
-**Working:**
-- ✅ WebGPU initialization and device setup
-- ✅ Binary weight file parsing (v1 and v2 formats)
-- ✅ Automatic mip-level detection from binary format v2
-- ✅ Weight statistics (min/max per layer)
-- ✅ UI layout with collapsible panels
-- ✅ Mode switching (Activations/Weights tabs)
-- ✅ Canvas context management (2D for weights, WebGPU for activations)
-- ✅ Weight visualization infrastructure (layer selection, grid layout)
-- ✅ Layer naming matches codebase convention (Layer 0, Layer 1, Layer 2)
-- ✅ Static features split visualization (Static 0-3, Static 4-7)
-- ✅ All layers visible including output layer (Layer 2)
-- ✅ Video playback support (MP4, WebM) with frame-by-frame controls
-- ✅ Video looping (automatic continuous playback)
-- ✅ Mip level selection (p0-p3 features at different resolutions)
-
-**Recent Changes (Latest):**
-- Binary format v2 support: Reads mip_level from 20-byte header
-- Backward compatible: v1 (16-byte header) → mip_level=0
-- Auto-update UI dropdown when loading weights with mip_level
-- Display mip_level in metadata panel
-- Code refactoring: Extracted FULLSCREEN_QUAD_VS shader (reused 3× across pipelines)
-- Added helper methods: `getDimensions()`, `setVideoControlsEnabled()`
-- Improved code organization with section headers and comments
-- Moved Mip Level selector to bottom of left sidebar (removed "Features (p0-p3)" label)
-- Added `loop` attribute to video element for automatic continuous playback
-
-**Previous Fixes:**
-- Fixed Layer 2 not appearing (was excluded from layerOutputs due to isOutput check)
-- Fixed canvas context switching (force clear before recreation)
-- Added Static 0-3 / Static 4-7 buttons to view all 8 static feature channels
-- Aligned naming with train_cnn_v2.py/.wgsl: Layer 0, Layer 1, Layer 2 (not Layer 1, 2, 3)
-- Disabled Static buttons in weights mode (no learnable weights)
-
-**Known Issues:**
-- Layer activation visualization may show black if texture data not properly unpacked
-- Weight kernel display depends on correct 2D context creation after canvas recreation
-
----
-
-## Architecture
-
-### File Structure
-- Single-file HTML tool (~1100 lines)
-- Embedded shaders: STATIC_SHADER, CNN_SHADER, DISPLAY_SHADER, LAYER_VIZ_SHADER
-- Shared WGSL component: FULLSCREEN_QUAD_VS (reused across render pipelines)
-- **Embedded default weights:** DEFAULT_WEIGHTS_B64 (base64-encoded binary v2)
-  - Current: 4 layers (3×3, 5×5, 3×3, 3×3), 2496 f16 weights, mip_level=2
-  - Source: `workspaces/main/weights/cnn_v2_weights.bin`
-  - Updates: Re-encode binary with `base64 -i <file>` and update constant
-- Pure WebGPU (no external dependencies)
-
-### Code Organization
-
-**Recent Refactoring (2026-02-13):**
-- Extracted `FULLSCREEN_QUAD_VS` constant: Reused fullscreen quad vertex shader (2 triangles covering NDC)
-- Added helper methods to CNNTester class:
-  - `getDimensions()`: Returns current source dimensions (video or image)
-  - `setVideoControlsEnabled(enabled)`: Centralized video control enable/disable
-- Consolidated duplicate vertex shader code (used in mipmap generation, display, layer visualization)
-- Added section headers in JavaScript for better navigation
-- Improved inline comments explaining shader architecture
-
-**Benefits:**
-- Reduced code duplication (~40 lines saved)
-- Easier maintenance (single source of truth for fullscreen quad)
-- Clearer separation of concerns
-
-### Key Components
-
-**1. Weight Parsing**
-- Reads binary format v2: header (20B) + layer info (20B×N) + f16 weights
-- Backward compatible with v1: header (16B), mip_level defaults to 0
-- Computes min/max per layer via f16 unpacking
-- Stores `{ layers[], weights[], mipLevel, fileSize }`
-- Auto-sets UI mip-level dropdown from loaded weights
-
-**2. CNN Pipeline**
-- Static features computation (RGBD + UV + sin + bias → 7D packed)
-- Layer-by-layer convolution with storage buffer weights
-- Ping-pong buffers for intermediate results
-- Copy to persistent textures for visualization
-
-**3. Visualization Modes**
-
-**Activations Mode:**
-- 4 grayscale views per layer (channels 0-3 of up to 8 total)
-- WebGPU compute → unpack f16 → scale → grayscale
-- Auto-scale: Static features = 1.0, CNN layers = 0.2
-- Static features: Shows R,G,B,D (first 4 of 8: RGBD+UV+sin+bias)
-- CNN layers: Shows first 4 output channels
-
-**Weights Mode:**
-- 2D canvas rendering per output channel
-- Shows all input kernels horizontally
-- Normalized by layer min/max → [0, 1] → grayscale
-- 20px cells, 2px padding between kernels
-
-### Texture Management
-
-**Persistent Storage (layerTextures[]):**
-- One texture per layer output (static + all CNN layers)
-- `rgba32uint` format (packed f16 data)
-- `COPY_DST` usage for storing results
-
-**Compute Buffers (computeTextures[]):**
-- 2 textures for ping-pong computation
-- Reused across all layers
-- `COPY_SRC` usage for copying to persistent storage
-
-**Pipeline:**
-```
-Static pass → copy to layerTextures[0]
-For each CNN layer i:
-  Compute (ping-pong) → copy to layerTextures[i+1]
-```
-
-### Layer Indexing
-
-**UI Layer Buttons:**
-- "Static" → layerOutputs[0] (7D input features)
-- "Layer 1" → layerOutputs[1] (CNN layer 1 output, uses weights.layers[0])
-- "Layer 2" → layerOutputs[2] (CNN layer 2 output, uses weights.layers[1])
-- "Layer N" → layerOutputs[N] (CNN layer N output, uses weights.layers[N-1])
-
-**Weights Table:**
-- "Layer 1" → weights.layers[0] (first CNN layer weights)
-- "Layer 2" → weights.layers[1] (second CNN layer weights)
-- "Layer N" → weights.layers[N-1]
-
-**Consistency:** Both UI and weights table use same numbering (1, 2, 3...) for CNN layers.
-
----
-
-## Known Issues
-
-### Issue #1: Layer Activations Show Black
-
-**Symptom:**
-- All 4 channel canvases render black
-- UV gradient test (debug mode 10) works
-- Raw packed data test (mode 11) shows black
-- Unpacked f16 test (mode 12) shows black
-
-**Diagnosis:**
-- Texture access works (UV gradient visible)
-- Texture data is all zeros (packed.x = 0)
-- Textures being read are empty
-
-**Root Cause:**
-- `copyTextureToTexture` operations may not be executing
-- Possible ordering issue (copies not submitted before visualization)
-- Alternative: textures created with wrong usage flags
-
-**Investigation Steps Taken:**
-1. Added `onSubmittedWorkDone()` wait before visualization
-2. Verified texture creation with `COPY_SRC` and `COPY_DST` flags
-3. Confirmed separate texture allocation per layer (no aliasing)
-4. Added debug shader modes to isolate issue
-
-**Next Steps:**
-- Verify encoder contains copy commands (add debug logging)
-- Check if compute passes actually write data (add known-value test)
-- Test copyTextureToTexture in isolation
-- Consider CPU readback to verify texture contents
-
-### Issue #2: Weight Visualization Empty
-
-**Symptom:**
-- Canvases created with correct dimensions (logged)
-- No visual output (black canvases)
-- Console logs show method execution
-
-**Potential Causes:**
-1. Weight indexing calculation incorrect
-2. Canvas not properly attached to DOM when rendering
-3. 2D context operations not flushing
-4. Min/max normalization producing black (all values equal?)
-
-**Debug Added:**
-- Comprehensive logging of dimensions, indices, ranges
-- Canvas context check before rendering
-
-**Next Steps:**
-- Add test rendering (fixed gradient) to verify 2D context works
-- Log sample weight values to verify data access
-- Check if canvas is visible in DOM inspector
-- Verify min/max calculation produces valid range
-
----
-
-## UI Layout
-
-### Header
-- Controls: Blend slider, Depth input, View mode display
-- Drop zone for .bin weight files
-
-### Content Area
-
-**Left Sidebar (300px):**
-1. Drop zone for .bin weight files
-2. Weights Info panel (file size, layer table with min/max)
-3. Weights Visualization panel (per-layer kernel display)
-4. **Mip Level selector** (bottom) - Select p0/p1/p2 for static features
-
-**Main Canvas (center):**
-- CNN output display with video controls (Play/Pause, Frame ◄/►)
-- Supports both PNG images and video files (MP4, WebM)
-- Video loops automatically for continuous playback
-
-**Right Sidebar (panels):**
-1. **Layer Visualization Panel** (top, flex: 1)
-   - Layer selection buttons (Static 0-3, Static 4-7, Layer 0, Layer 1, ...)
-   - 2×2 grid of channel views (grayscale activations)
-   - 4× zoom view at bottom
-
-### Footer
-- Status line (GPU timing, dimensions, mode)
-- Console log (scrollable, color-coded)
-
----
-
-## Shader Details
-
-### LAYER_VIZ_SHADER
-
-**Purpose:** Display single channel from packed layer texture
-
-**Inputs:**
-- `@binding(0) layer_tex: texture_2d<u32>` - Packed f16 layer data
-- `@binding(1) viz_params: vec2<f32>` - (channel_idx, scale)
-
-**Debug Modes:**
-- Channel 10: UV gradient (texture coordinate test)
-- Channel 11: Raw packed u32 data
-- Channel 12: First unpacked f16 value
-
-**Normal Operation:**
-- Unpack all 8 f16 channels from rgba32uint
-- Select channel by index (0-7)
-- Apply scale factor (1.0 for static, 0.2 for CNN)
-- Clamp to [0, 1] and output grayscale
-
-**Scale Rationale:**
-- Static features (RGBD, UV): already in [0, 1] range
-- CNN activations: post-ReLU [0, ~5], need scaling for visibility
-
----
-
-## Binary Weight Format
-
-See `doc/CNN_V2_BINARY_FORMAT.md` for complete specification.
-
-**Quick Summary:**
-- Header: 16 bytes (magic, version, layer count, total weights)
-- Layer info: 20 bytes × N (kernel size, channels, offsets)
-- Weights: Packed f16 pairs as u32
-
----
-
-## Testing Workflow
-
-### Load & Parse
-1. Drop PNG image → displays original
-2. Drop .bin weights → parses and shows info table
-3. Auto-runs CNN pipeline
-
-### Verify Pipeline
-1. Check console for "Running CNN pipeline"
-2. Verify "Completed in Xms"
-3. Check "Layer visualization ready: N layers"
-
-### Debug Activations
-1. Select "Activations" tab
-2. Click layer buttons to switch
-3. Check console for texture/canvas logs
-4. If black: note which debug modes work (UV vs data)
-
-### Debug Weights
-1. Select "Weights" tab
-2. Click Layer 1 or Layer 2 (Layer 0 has no weights)
-3. Check console for "Visualizing Layer N weights"
-4. Check canvas dimensions logged
-5. Verify weight range is non-trivial (not [0, 0])
-
----
-
-## Integration with Main Project
-
-**Training Pipeline:**
-```bash
-# Generate weights
-./training/train_cnn_v2.py --export-binary
-
-# Test in browser
-open tools/cnn_v2_test/index.html
-# Drop: workspaces/main/cnn_v2_weights.bin
-# Drop: training/input/test.png
-```
-
-**Validation:**
-- Compare against demo CNNv2Effect (visual check)
-- Verify layer count matches binary file
-- Check weight ranges match training logs
-
----
-
-## Future Enhancements
-
-- [ ] Fix layer activation visualization (black texture issue)
-- [ ] Fix weight kernel display (empty canvas issue)
-- [ ] Add per-channel auto-scaling (compute min/max from visible data)
-- [ ] Export rendered outputs (download PNG)
-- [ ] Side-by-side comparison with original
-- [ ] Heatmap mode (color-coded activations)
-- [ ] Weight statistics overlay (mean, std, sparsity)
-- [ ] Batch processing (multiple images in sequence)
-- [ ] Integration with Python training (live reload)
-
----
-
-## Code Metrics
-
-- Total lines: ~1100
-- JavaScript: ~700 lines
-- WGSL shaders: ~300 lines
-- HTML/CSS: ~100 lines
-
-**Dependencies:** None (pure WebGPU + HTML5)
-
----
-
-## Related Files
-
-- `doc/CNN_V2.md` - CNN v2 architecture and design
-- `doc/CNN_TEST_TOOL.md` - C++ offline testing tool (deprecated)
-- `training/train_cnn_v2.py` - Training script with binary export
-- `workspaces/main/cnn_v2_weights.bin` - Trained weights
diff --git a/doc/COMPLETED.md b/doc/COMPLETED.md
index 55fac50..8d30cca 100644
--- a/doc/COMPLETED.md
+++ b/doc/COMPLETED.md
@@ -67,7 +67,7 @@ Use `read @doc/archive/FILENAME.md` to access archived documents.
     - **Changes**:
       - Added `get_common_uniforms()` helper to Effect base class
       - Refactored all render()/compute() signatures from 5 parameters to single `CommonPostProcessUniforms&`
-      - Fixed uninitialized uniforms in CircleMaskEffect and CNNEffect
+      - Fixed uninitialized uniforms in CircleMaskEffect and CNNv1Effect
       - Updated 19 effect implementations + headers
       - Fixed WGSL syntax error in FlashEffect (u.audio_intensity → audio_intensity)
     - **Impact**:
@@ -93,7 +93,7 @@ Use `read @doc/archive/FILENAME.md` to access archived documents.
       - All 36 tests pass (100%)
       - Processes 64×64 test image successfully
       - Ready for ground-truth validation vs Python training script
-      - Documented in `doc/CNN_TEST_TOOL.md`
+      - Documented in `cnn_v1/docs/CNN_TEST_TOOL.md`
 
 ## Recently Completed (February 10, 2026)
 
@@ -103,7 +103,7 @@ Use `read @doc/archive/FILENAME.md` to access archived documents.
       - Created `BindGroupLayoutBuilder` and `BindGroupBuilder` for declarative bind group creation
       - Created `RenderPipelineBuilder` to simplify pipeline setup with ShaderComposer integration
       - Created `SamplerCache` singleton to deduplicate sampler instances
-      - Refactored `post_process_helper.cc`, `cnn_effect.cc`, `rotating_cube_effect.cc`
+      - Refactored `post_process_helper.cc`, `cnn_v1_effect.cc`, `rotating_cube_effect.cc`
     - **Result**:
       - Bind group creation: 19 instances reduced from 14→4 lines each
       - Pipeline creation: 30-50 lines reduced to 8 lines
diff --git a/doc/HOWTO.md b/doc/HOWTO.md
index 0dc9ec7..4cafaa2 100644
--- a/doc/HOWTO.md
+++ b/doc/HOWTO.md
@@ -100,7 +100,7 @@ make run_util_tests     # Utility tests
 Extracts patches at salient points, trains on center pixels only (matches WGSL sliding window):
 ```bash
 # Train with 32×32 patches at detected corners/edges
-./training/train_cnn.py \
+./cnn_v1/training/train_cnn.py \
   --input training/input/ --target training/output/ \
   --patch-size 32 --patches-per-image 64 --detector harris \
   --layers 3 --kernel_sizes 3,5,3 --epochs 5000 --batch_size 16 \
@@ -117,7 +117,7 @@ Extracts patches at salient points, trains on center pixels only (matches WGSL s
 ### Full-Image
 Processes entire image with sliding window (matches WGSL):
 ```bash
-./training/train_cnn.py \
+./cnn_v1/training/train_cnn.py \
   --input training/input/ --target training/output/ \
   --layers 3 --kernel_sizes 3,5,3 --epochs 10000 --batch_size 8 \
   --checkpoint-every 1000
@@ -126,10 +126,10 @@ Processes entire image with sliding window (matches WGSL):
 ### Export & Validation
 ```bash
 # Generate shaders from checkpoint
-./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth
+./cnn_v1/training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth
 
 # Generate ground truth (sliding window, no tiling)
-./training/train_cnn.py --infer input.png \
+./cnn_v1/training/train_cnn.py --infer input.png \
   --export-only checkpoints/checkpoint_epoch_5000.pth \
   --output ground_truth.png
 ```
@@ -145,31 +145,31 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding
 **Complete Pipeline** (recommended):
 ```bash
 # Train → Export → Build → Validate (default config)
-./scripts/train_cnn_v2_full.sh
+./cnn_v2/scripts/train_cnn_v2_full.sh
 
 # Rapid debug (1 layer, 3×3, 5 epochs)
-./scripts/train_cnn_v2_full.sh --num-layers 1 --kernel-sizes 3 --epochs 5 --output-weights test.bin
+./cnn_v2/scripts/train_cnn_v2_full.sh --num-layers 1 --kernel-sizes 3 --epochs 5 --output-weights test.bin
 
 # Custom training parameters
-./scripts/train_cnn_v2_full.sh --epochs 500 --batch-size 32 --checkpoint-every 100
+./cnn_v2/scripts/train_cnn_v2_full.sh --epochs 500 --batch-size 32 --checkpoint-every 100
 
 # Custom architecture
-./scripts/train_cnn_v2_full.sh --kernel-sizes 3,5,3 --num-layers 3 --mip-level 1
+./cnn_v2/scripts/train_cnn_v2_full.sh --kernel-sizes 3,5,3 --num-layers 3 --mip-level 1
 
 # Custom output path
-./scripts/train_cnn_v2_full.sh --output-weights workspaces/test/cnn_weights.bin
+./cnn_v2/scripts/train_cnn_v2_full.sh --output-weights workspaces/test/cnn_weights.bin
 
 # Grayscale loss (compute loss on luminance instead of RGBA)
-./scripts/train_cnn_v2_full.sh --grayscale-loss
+./cnn_v2/scripts/train_cnn_v2_full.sh --grayscale-loss
 
 # Custom directories
-./scripts/train_cnn_v2_full.sh --input training/input --target training/target_2
+./cnn_v2/scripts/train_cnn_v2_full.sh --input training/input --target training/target_2
 
 # Full-image mode (instead of patch-based)
-./scripts/train_cnn_v2_full.sh --full-image --image-size 256
+./cnn_v2/scripts/train_cnn_v2_full.sh --full-image --image-size 256
 
 # See all options
-./scripts/train_cnn_v2_full.sh --help
+./cnn_v2/scripts/train_cnn_v2_full.sh --help
 ```
 
 **Defaults:** 200 epochs, 3×3 kernels, 8→4→4 channels, batch-size 16, patch-based (8×8, harris detector).
@@ -184,33 +184,33 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding
 **Validation Only** (skip training):
 ```bash
 # Use latest checkpoint
-./scripts/train_cnn_v2_full.sh --validate
+./cnn_v2/scripts/train_cnn_v2_full.sh --validate
 
 # Use specific checkpoint
-./scripts/train_cnn_v2_full.sh --validate checkpoints/checkpoint_epoch_50.pth
+./cnn_v2/scripts/train_cnn_v2_full.sh --validate checkpoints/checkpoint_epoch_50.pth
 ```
 
 **Manual Training:**
 ```bash
 # Default config
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --epochs 100 --batch-size 16 --checkpoint-every 5
 
 # Custom architecture (per-layer kernel sizes)
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --kernel-sizes 1,3,5 \
   --epochs 5000 --batch-size 16
 
 # Mip-level for p0-p3 features (0=original, 1=half, 2=quarter, 3=eighth)
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --mip-level 1 \
   --epochs 100 --batch-size 16
 
 # Grayscale loss (compute loss on luminance Y = 0.299*R + 0.587*G + 0.114*B)
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --grayscale-loss \
   --epochs 100 --batch-size 16
@@ -236,7 +236,7 @@ Use `--quiet` for streamlined output in scripts (used automatically by train_cnn
 
 ```
 
-**Validation:** Use HTML tool (`tools/cnn_v2_test/index.html`) for CNN v2 validation. See `doc/CNN_V2_WEB_TOOL.md`.
+**Validation:** Use HTML tool (`cnn_v2/tools/cnn_v2_test/index.html`) for CNN v2 validation. See `cnn_v2/docs/CNN_V2_WEB_TOOL.md`.
 
 ---
 
@@ -323,11 +323,11 @@ See `doc/ASSET_SYSTEM.md` and `doc/WORKSPACE_SYSTEM.md`.
 
 **Status:**
 - **CNN v2:** ✅ Fully functional, matches CNNv2Effect
-- **CNN v1:** ⚠️ Produces incorrect output, use CNNEffect in demo for validation
+- **CNN v1:** ⚠️ Produces incorrect output, use CNNv1Effect in demo for validation
 
 **Note:** `--weights` loads layer count and kernel sizes from the binary file, overriding `--layers` and forcing CNN v2.
 
-See `doc/CNN_TEST_TOOL.md` for full documentation.
+See `cnn_v1/docs/CNN_TEST_TOOL.md` for full documentation.
 
 ---