18 files changed, 268 insertions, 2750 deletions
diff --git a/doc/AUDIO_WAV_DRIFT_BUG.md b/doc/AUDIO_WAV_DRIFT_BUG.md
new file mode 100644
index 0000000..050dd49
--- /dev/null
+++ b/doc/AUDIO_WAV_DRIFT_BUG.md
@@ -0,0 +1,185 @@
+# Audio WAV Drift Bug Investigation
+
+**Date:** 2026-02-15
+**Status:** ACCEPTABLE (to be continued)
+**Current State:** -150ms drift at beat 64b, no glitches
+
+## Problem Statement
+
+Timeline viewer shows progressive visual drift between audio waveform and beat grid markers:
+- At beat 8 (5.33s @ 90 BPM): kick waveform appears **-30ms early** (left of grid line)
+- At beat 60 (40.0s @ 90 BPM): kick waveform appears **-180ms early** (left of grid line)
+
+Progressive drift rate: ~4.3ms/second
+
+## Initial Hypotheses (Ruled Out)
+
+### 1. ❌ Viewer Display Bug
+- **Tested:** Sample rate detection in viewer (32kHz correctly detected)
+- **Result:** Viewer BPM = 90 (correct), `pixelsPerSecond` mapping correct
+- **Conclusion:** Not a viewer rendering issue
+
+### 2. ❌ WAV File Content Error
+- **Tested:** Direct WAV sample position analysis via Python
+- **Result:** Actual kick positions in WAV file:
+  ```
+  Beat | Expected(s) | WAV(s)   | Drift
+  -----|-------------|----------|-------
+     8 |      5.3333 |   5.3526 | +19ms (LATE)
+    60 |     40.0000 |  39.9980 | -2ms (nearly perfect)
+  ```
+- **Conclusion:** WAV file samples are at correct positions; visual drift not in WAV content
+
+### 3. ❌ Frame Truncation (Partial Cause)
+- **Issue:** `frames_per_update = (int)(32000 * (1/60))` = 533 frames (truncates 0.333)
+- **Impact:** Loses 0.333 frames/update = 10.4μs/frame
+- **Total drift over 40s:** 2400 frames × 10.4μs = **25ms**
+- **Conclusion:** Explains 25ms of 180ms, but not sufficient
+
+## Root Cause Discovery
+
+### Investigation Method
+Added debug tracking to `audio_render_ahead()` (audio.cc:115):
+```cpp
+static int64_t g_total_render_calls = 0;
+static int64_t g_total_frames_rendered = 0;
+
+// Track actual frames rendered vs expected
+const int64_t actual_rendered = frames_after - frames_before;
+g_total_render_calls++;
+g_total_frames_rendered += actual_rendered;
+```
+
+### Critical Finding: Over-Rendering
+
+**WAV dump @ 40s (2400 iterations):**
+```
+Expected frames: 1,279,200 (2400 × 533)
+Actual rendered: 1,290,933
+Difference: +11,733 frames = +366.66ms EXTRA audio
+```
+
+**Pattern observed every 10s (600 calls @ 60fps):**
+```
+[RENDER_DRIFT] calls=600  expect=319800  actual=331533  drift=-366.66ms
+[RENDER_DRIFT] calls=1200 expect=639600  actual=651333  drift=-366.66ms
+[RENDER_DRIFT] calls=1800 expect=959400  actual=971133  drift=-366.66ms
+[RENDER_DRIFT] calls=2400 expect=1279200 actual=1290933 drift=-366.66ms
+```
+
+### Why This Causes Visual Drift
+
+**WAV Dump Flow (main.cc:289-302):**
+1. `fill_audio_buffer(update_dt)` → calls `audio_render_ahead()`
+   - Renders audio into ring buffer
+   - **BUG:** Renders MORE than `chunk_frames` due to buffer management loop
+2. `ring_buffer->read(chunk_buffer, samples_per_update)`
+   - Reads exactly 533 frames from ring buffer
+3. `wav_backend.write_audio(chunk_buffer, samples_per_update)`
+   - Writes exactly 533 frames to WAV
+
+**Result:** Ring buffer accumulates 11,733 extra frames over 40s.
+
+### Timing Shift Mechanism
+
+Ring buffer acts as FIFO queue with 400ms lookahead:
+- Initially fills to 400ms (12,800 frames)
+- Each iteration: renders 533.333 (actual: ~536) frames, reads 533 frames
+- Net accumulation: ~3 frames/iteration
+- After 2400 iterations: 12,800 + (2400 × 3) = 20,000 frames buffer size
+
+Events trigger at correct `music_time` but get written to ring buffer position that's ahead. When WAV reads from buffer, it reads from older position, causing events to appear EARLIER in WAV file than their nominal music_time.
+
+## Technical Details
+
+### Code Locations
+
+**Truncation point 1:** `main.cc:282`
+```cpp
+const int frames_per_update = (int)(32000 * update_dt); // 533.333 → 533
+```
+
+**Truncation point 2:** `audio.cc:105`
+```cpp
+const int chunk_frames = (int)(dt * RING_BUFFER_SAMPLE_RATE); // 533.333 → 533
+```
+
+**Over-render loop:** `audio.cc:112-229`
+```cpp
+while (true) {
+  // Keeps rendering until buffer >= target_lookahead
+  // Renders MORE than chunk_frames due to buffer management
+  ...
+}
+```
+
+### Why 366ms Per 10s?
+
+At 60fps, 10s = 600 iterations:
+- Expected: 600 × 533 = 319,800 frames
+- Actual: 331,533 frames
+- Extra: 11,733 frames ÷ 600 = **19.55 frames extra per iteration**
+
+But `chunk_frames = 533`, so we render 533 + 19.55 = **~552.55 frames per call** on average.
+
+Discrepancy from 533.333 expected: 552.55 - 533.333 = **19.22 frames/call over-render**
+
+This 19.22 frames = 0.6ms per iteration accumulates to 366ms per 10s.
+
+## Proposed Fix
+
+### Option 1: Match Render to Read (Recommended)
+In WAV dump mode, ensure `audio_render_ahead()` renders exactly `frames_per_update`:
+```cpp
+// main.cc WAV dump loop
+const int frames_per_update = (int)(32000 * update_dt);
+audio_render_ahead(g_music_time, update_dt, /* force_exact_amount */ frames_per_update);
+```
+
+Modify `audio_render_ahead()` to accept optional exact frame count and render precisely that amount instead of filling to target lookahead.
+
+### Option 2: Round Instead of Truncate
+```cpp
+const int frames_per_update = (int)(32000 * update_dt + 0.5f); // Round: 533.333 → 533
+```
+Reduces truncation error but doesn't solve over-rendering.
+
+### Option 3: Use Double Precision + Accumulator
+```cpp
+static double accumulated_time = 0.0;
+accumulated_time += update_dt;
+const int frames_to_render = (int)(accumulated_time * 32000);
+accumulated_time -= frames_to_render / 32000.0;
+```
+Eliminates cumulative truncation error.
+
+## Related Issues
+
+- `tracker.cc:237` TODO comment mentions "180ms drift over 63 beats" - this is the same bug
+- Ring buffer lookahead (400ms) is separate from drift (not the cause)
+- Web Audio API `outputLatency` in viewer is unrelated (affects playback, not waveform display)
+
+## Verification Steps
+
+1. ✅ Measure WAV sample positions directly (Python script)
+2. ✅ Add render tracking debug output
+3. ✅ Confirm over-rendering (366ms per 10s)
+4. ✅ Implement partial fix (bypass ring buffer, direct render)
+5. ⚠️ Current result: -150ms drift at beat 64b (acceptable, needs further work)
+
+## Current Implementation (main.cc:286-308)
+
+**WAV dump now bypasses ring buffer entirely:**
+1. **Frame accumulator**: Calculates exact frames per update (no truncation)
+2. **Direct render**: Calls `synth_render()` directly with exact frame count
+3. **No ring buffer**: Eliminates buffer management complexity
+4. **Result**: No glitches, but -150ms drift remains
+
+**Remaining issue:** Drift persists despite direct rendering. Likely related to tempo scaling or audio engine state management. Acceptable for now.
+
+## Notes
+
+- Viewer waveform rendering is CORRECT - displays WAV content accurately
+- Bug is in demo's WAV generation, specifically ring buffer management in `audio_render_ahead()`
+- Progressive nature of drift (30ms → 180ms) indicates accumulation, not one-time offset
+- Fix must ensure rendered frames = read frames in WAV dump mode
diff --git a/doc/AUXILIARY_TEXTURE_INIT.md b/doc/AUXILIARY_TEXTURE_INIT.md
index 9cac70b..036cbf7 100644
--- a/doc/AUXILIARY_TEXTURE_INIT.md
+++ b/doc/AUXILIARY_TEXTURE_INIT.md
@@ -18,7 +18,7 @@ entry.seq->resize(width, height);   // Too late - textures already created
 
 **Affected:**
 - CircleMaskEffect (circle_mask texture)
-- CNNEffect (captured_frame texture)
+- CNNv1Effect (captured_frame texture)
 - RotatingCubeEffect (consumer, hardcoded resolution in uniforms)
 
 ---
diff --git a/doc/BUILD.md b/doc/BUILD.md
index d3434f4..fd0c3d9 100644
--- a/doc/BUILD.md
+++ b/doc/BUILD.md
@@ -95,9 +95,11 @@ Use Xcode Metal debugger for shader performance analysis.
 ## Build System Internals
 
 **Asset Dependency Tracking:**
-- CMake tracks 42 demo + 17 test assets
-- Editing shaders/audio/sequences auto-triggers rebuild
-- Asset lists parsed to extract individual file dependencies
+- CMake tracks 42 demo + 17 test assets split into 4 categories
+- **Granular rebuilds:** Changing a shader only rebuilds shader-dependent targets
+- **Categories:** `shaders` (.wgsl), `audio` (.spec, .track), `models` (.obj), `data` (.bin, .png, PROC)
+- Asset lists parsed at configure time to extract category-specific file dependencies
+- Unified output (`assets_data.cc`) avoids duplicate symbols while preserving granular tracking
 
 **Header Organization:**
 - `asset_manager_dcl.h`: Forward declarations
diff --git a/doc/CMAKE_MODULES.md b/doc/CMAKE_MODULES.md
index 2ea7d00..9f71d91 100644
--- a/doc/CMAKE_MODULES.md
+++ b/doc/CMAKE_MODULES.md
@@ -90,6 +90,18 @@ Creates an executable for the demo (legacy macro).
 ### `add_demo_test(NAME TEST_NAME LABEL SOURCES...)`
 Creates a test executable and registers it with CTest (legacy macro).
 
+### `demo_add_asset_deps(TARGET CATEGORY)`
+Adds asset category dependencies to a target for granular rebuilds.
+
+**Categories:** `shaders`, `audio`, `models`, `data`, `all`, `test`
+
+**Example:**
+```cmake
+demo_add_asset_deps(test_synth audio)        # Only depends on audio assets
+demo_add_asset_deps(test_shader_compilation shaders)  # Only depends on shaders
+demo_add_asset_deps(demo64k all)             # Depends on all asset categories
+```
+
 ---
 
 ## Conditional Inclusion
@@ -107,12 +119,13 @@ This reduces parse time when building without tests/tools.
 ## Adding New Components
 
 ### New Effect
-- Add sources to `cmake/DemoSourceLists.cmake` (GPU_SOURCES list)
-- No other CMake changes needed
+- Add sources to `cmake/DemoSourceLists.cmake` (`COMMON_GPU_EFFECTS` list)
+- No other CMake changes needed (automatically included in headless and normal modes)
 
 ### New Test
-- Add to `cmake/DemoTests.cmake` using `demo_add_test_with_deps()`
-- Use LINK and DEPENDS parameters for libraries/assets
+- Add to `cmake/DemoTests.cmake` using `add_demo_test()`
+- Use `demo_add_asset_deps()` to specify asset category dependencies (e.g., `shaders`, `audio`)
+- This enables granular rebuilds—only changed asset categories trigger test recompilation
 
 ### New Library
 - Add to `cmake/DemoLibraries.cmake` with appropriate dependencies
@@ -132,6 +145,7 @@ This reduces parse time when building without tests/tools.
 4. **Reusability:** Shared macros—eliminate 200+ lines of repetition
 5. **Clarity:** Top-level CMakeLists.txt is 54-line roadmap
 6. **Scalability:** Easy to add new tests/tools/libraries without bloating main file
+7. **Granular Rebuilds:** Asset categories enable 3-5× faster incremental builds for typical changes
 
 ---
 
diff --git a/doc/CNN.md b/doc/CNN.md
deleted file mode 100644
index 2dc3362..0000000
--- a/doc/CNN.md
+++ /dev/null
@@ -1,79 +0,0 @@
-# Convolutional Neural Net Shader (CNN) post-processing
-
-**Status:** ✅ Foundation implemented (single-layer, expandable to multi-pass)
-
-## Idea
-
-Have the input 3d scene be processed by a multi-layer CNN trained on the side.
-Input: some rendered scene.
-Output: 'stylized' scene with CNN post-processing.
-
-**See `doc/CNN_EFFECT.md` for implementation details, usage, and API reference.**
-
-## Shader implementation
-
-### input / output
-
-Need 1 texture buffer per CNN layer.
-Input (r,g,b,1/z) for layer 0 (render 3d scene), or output from layer N-1 for layer N.
-output: (r,g,b, alpha). Don't need the 1/z information (can be fetched from input)
-
-### size of one layer
-
-Notation:
-S: the number of input samples from layer N-1.
-Example: 3x3 input -> S = 3x3 = 9. 
-
-Each S samples is 4 values (r,g,b, w=1/z).
-
-Each sample is processed by a mat4 matrix. 4 input => 4 output.
-
-Weight matrix = S x mat4
-
-Final bias: 4 values.
-
-WGSL code example: See file CNN.shader
-
-### Layers
-
-we need 3 or 4 layer ?
-Several different shaders for each layer.
-Ping-pong for input/output texture buffer between each layers?
-
-## Implementation Status
-
-**Completed:**
-- ✅ Modular WGSL shader architecture (6 snippet files)
-- ✅ CNNEffect C++ class (single-layer rendering)
-- ✅ ShaderComposer integration (#include resolution)
-- ✅ Asset registration (7 new shader assets)
-- ✅ Test coverage (test_demo_effects.cc)
-- ✅ Placeholder identity weights for testing
-
-**Size:** ~3-4 KB shader code + ~2-4 KB weights = **5-8 KB total**
-
-**Pending:**
-- ⏳ Training script (`scripts/train_cnn.py`) to generate real weights
-- ⏳ Multi-layer rendering with ping-pong textures
-- ⏳ Weight quantization for size optimization
-
----
-
-## Training (To Be Implemented)
-
-The layer weight/bias data are hard-coded in the shaders.
-Training workflow:
-
-1. Prepare image pairs (before: raw render, after: target style)
-2. Run `python scripts/train_cnn.py --input scene.png --target stylized.png`
-3. Script generates `cnn_weights_generated.wgsl`
-4. Rebuild: `cmake --build build -j4`
-
-**Reference:** File `CNN.py` contains training example (needs adaptation).
-
-Need a repository of reference image pairs (before/after) for training and validation.
-Each input image is randomly sampled into 3×3 patch of (r,g,b,1/z) input samples.
-And trained to match the (r,g,b,a) output.
-
-Training generates the .wgsl code for layers' shaders.
-
diff --git a/doc/CNN_BIAS_FIX_2026-02.md b/doc/CNN_BIAS_FIX_2026-02.md
deleted file mode 100644
index 26db8eb..0000000
--- a/doc/CNN_BIAS_FIX_2026-02.md
+++ /dev/null
@@ -1,85 +0,0 @@
-# CNN Bias Accumulation Fix (2026-02-11)
-
-## Problem
-Bias was being added multiple times in shader convolution loops (once per kernel position), causing mismatch between PyTorch training and WGSL inference.
-
-## Root Cause
-**Location**: `training/train_cnn.py:381, 398`
-
-When exporting weights to WGSL, bias was replicated for every kernel position. The shader loops through positions doing:
-```wgsl
-sum += dot(weights[pos], rgbd) + dot(weights[pos+1], in1);  // in1.w = 1.0
-```
-
-For 3×3 kernel (9 positions), bias added 9×. For 5×5, added 25×.
-
-## Fix
-Divide bias by `num_positions` during export:
-```python
-# Final layer (7→1)
-v1.append(f"{bias[0] / num_positions:.6f}")
-
-# Inner layers (7→4)
-v1.append(f"{bias[out_c] / num_positions:.6f}")
-```
-
-Shader accumulates bias × num_positions = original bias (correct).
-
----
-
-## Additional Improvements
-
-### 1. RGBA Output Support
-**train_cnn.py**: Now saves 4-channel RGBA PNG preserving alpha from input:
-```python
-alpha = img_tensor[0, 3:4, :, :].permute(1, 2, 0).numpy()
-output_rgba = np.concatenate([output, alpha], axis=2)
-Image.fromarray((output_rgba * 255).astype(np.uint8), mode='RGBA')
-```
-
-Intermediate layers also save RGBA if 4-channel.
-
-### 2. Debug Hex Output
-**Both tools** support `--debug-hex` to print first 8 pixels as hex:
-```bash
-./training/train_cnn.py --infer input.png --export-only checkpoint.pth --debug-hex
-./build/cnn_test input.png output.png --debug-hex
-```
-
-Output format: `[0] 0xRRGGBBAA` for pixel-level comparison.
-
-### 3. Cleanup
-Removed sRGB/linear_png debug code from `cnn_test.cc` (simplified PNG saving).
-
----
-
-## Files Modified
-- `training/train_cnn.py`: Bias fix, RGBA output, --debug-hex
-- `tools/cnn_test.cc`: --debug-hex, remove linear_png
-- `workspaces/main/shaders/cnn/cnn_weights_generated.wgsl`: Regenerated with fixed bias
-
-## Testing
-```bash
-# Train with fixed export
-./training/train_cnn.py --input training/input/ --target training/output/ \
-  --layers 3 --kernel_sizes 3,3,3 --epochs 5000
-
-# Generate ground truth
-./training/train_cnn.py --infer input.png --export-only checkpoint.pth \
-  --output ground_truth.png --debug-hex
-
-# Run GPU tool
-./build/cnn_test input.png tool_output.png --debug-hex
-
-# Compare hex output for first 8 pixels
-```
-
----
-
-## Status
-✅ Bias accumulation bug fixed
-✅ RGBA output with alpha preservation
-✅ Debug hex comparison tool
-✅ Weights regenerated
-
-Commit: `8ff8c56`
diff --git a/doc/CNN_DEBUG.md b/doc/CNN_DEBUG.md
deleted file mode 100644
index ba220a0..0000000
--- a/doc/CNN_DEBUG.md
+++ /dev/null
@@ -1,43 +0,0 @@
-# CNN Effect Black Screen Bug - Resolution (2026-02)
-
-## Problem
-CNN post-processing effect showed black screen when activated at 11.50s, despite scene rendering correctly before CNN started.
-
-## Root Causes
-
-### Bug 1: Framebuffer Capture Timing
-**Location**: `src/gpu/effect.cc`
-**Issue**: Capture ran INSIDE post-effect loop after ping-pong buffer swaps. CNN layers 1+ captured wrong buffer (output being written to, not scene).
-**Fix**: Moved capture before loop starts (lines 308-346). Capture now copies `framebuffer_a` to `captured_frame` auxiliary texture ONCE before any post-effects run.
-
-### Bug 2: Missing Uniforms Update ⚠️ CRITICAL
-**Location**: `src/effects/cnn_effect.cc`
-**Issue**: `CNNEffect::update_bind_group()` never updated `uniforms_` buffer. `uniforms.resolution` uninitialized (0,0 or garbage) → UV calculation `p.xy / uniforms.resolution` produced NaN → all texture samples black.
-**Fix**: Added uniforms update before bind group creation (lines 132-142):
-```cpp
-const CommonPostProcessUniforms u = {
-    .resolution = {(float)width_, (float)height_},
-    .aspect_ratio = (float)width_ / (float)height_,
-    .time = 0.0f,
-    .beat = 0.0f,
-    .audio_intensity = 0.0f,
-};
-uniforms_.update(ctx_.queue, u);
-```
-
-## Key Lessons
-
-1. **All post-process effects MUST update `uniforms_` buffer** - Required for UV calculations and shader parameters
-2. **Framebuffer capture timing is critical** - Must happen before post-chain ping-pong starts
-3. **Uninitialized uniforms cause silent failures** - Produces black output without validation errors
-4. **Post-effects must render or chain breaks** - `loadOp=Load` preserves previous (black) content if no draw call executes
-
-## Files Modified
-- `src/gpu/effect.cc`: Lines 308-346 (capture timing)
-- `src/effects/cnn_effect.cc`: Lines 132-142 (uniforms update)
-
-## Verification
-Test: `demo64k --seek 11.5`
-- ✅ Scene visible with RotatingCube
-- ✅ CNN stylization applied
-- ✅ All 3 layers process with correct original texture reference
diff --git a/doc/CNN_EFFECT.md b/doc/CNN_EFFECT.md
deleted file mode 100644
index 40f095e..0000000
--- a/doc/CNN_EFFECT.md
+++ /dev/null
@@ -1,400 +0,0 @@
-# CNN Post-Processing Effect
-
-Neural network-based stylization for rendered scenes.
-
----
-
-## Overview
-
-Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead.
-
-**Key Features:**
-- Position-aware layer 0 (coordinate input for vignetting, edge effects)
-- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining
-- Original input available to all layers via framebuffer capture
-- Configurable final blend with original scene
-- Modular WGSL shader architecture
-- Hardcoded weights (trained offline via PyTorch)
-- ~5-8 KB binary footprint
-
----
-
-## Architecture
-
-### RGBD → Grayscale Pipeline
-
-**Input:** RGBD (RGB + inverse depth D=1/z)
-**Output:** Grayscale (1 channel)
-**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
-
-**Architecture:**
-- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD
-- **Final layer (N-1):** Conv2d(7→1) - output grayscale
-
-```wgsl
-// Inner layers: 7→4 (RGBD output, vec4-optimized)
-fn cnn_conv3x3_7to4(
-  tex: texture_2d<f32>,
-  samp: sampler,
-  uv: vec2<f32>,
-  resolution: vec2<f32>,
-  gray: f32,                               # Grayscale [-1,1]
-  weights: array<vec4<f32>, 72>           # 9 pos × 4 ch × 2 vec4 (8 floats per filter)
-) -> vec4<f32>
-
-// Final layer: 7→1 (grayscale output, vec4-optimized)
-fn cnn_conv3x3_7to1(
-  tex: texture_2d<f32>,
-  samp: sampler,
-  uv: vec2<f32>,
-  resolution: vec2<f32>,
-  gray: f32,
-  weights: array<vec4<f32>, 18>           # 9 pos × 2 vec4 (8 floats per filter)
-) -> f32
-```
-
-**Input normalization:**
-- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1]
-- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1]
-- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))`
-- **Inter-layer data** stays in [-1,1] (no denormalization)
-- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1]
-
-**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer
-
-### Multi-Layer Architecture
-
-CNNEffect supports multi-layer networks via automatic effect chaining:
-
-1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7`
-2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2)
-3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"`
-4. **Original input binding**: All layers access original via `@binding(4)`
-5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)`
-
-**Framebuffer Capture API:**
-- `Effect::needs_framebuffer_capture()` - effect requests pre-capture
-- MainSequence automatically blits input → `"captured_frame"` auxiliary texture
-- Generic mechanism usable by any effect
-
-### File Structure
-
-```
-src/effects/
-  cnn_effect.h/cc         # CNNEffect class + framebuffer capture
-
-workspaces/main/shaders/cnn/
-  cnn_activation.wgsl     # tanh, ReLU, sigmoid, leaky_relu
-  cnn_conv3x3.wgsl        # 3×3 convolution (standard + coord-aware)
-  cnn_conv5x5.wgsl        # 5×5 convolution (standard + coord-aware)
-  cnn_conv7x7.wgsl        # 7×7 convolution (standard + coord-aware)
-  cnn_weights_generated.wgsl  # Weight arrays (auto-generated by train_cnn.py)
-  cnn_layer.wgsl          # Main shader with layer switches (auto-generated by train_cnn.py)
-```
-
----
-
-## Training Workflow
-
-### 1. Prepare Training Data
-
-Input/target image pairs:
-```
-training/input/img_000.png   # RGBA (RGB + alpha)
-training/output/img_000.png  # Grayscale target
-```
-
-**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily.
-
-### 2. Train Network
-
-**Patch-based (Recommended)** - Preserves natural pixel scale:
-```bash
-python3 training/train_cnn.py \
-  --input training/input --target training/output \
-  --patch-size 32 --patches-per-image 64 --detector harris \
-  --layers 3 --kernel-sizes 3,5,3 \
-  --epochs 5000 --batch-size 16 --checkpoint-every 1000
-```
-
-**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges)
-
-**Full-image (Legacy)** - Resizes to 256×256:
-```bash
-python3 training/train_cnn.py \
-  --input training/input --target training/output \
-  --layers 3 --kernel-sizes 3,5,3 \
-  --epochs 10000 --batch-size 8 --checkpoint-every 1000
-```
-
-**Auto-generates:**
-- `cnn_weights_generated.wgsl` - Weight arrays
-- `cnn_layer.wgsl` - Layer shader
-
-### 3. Export & Validate
-
-```bash
-# Export shaders
-./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth
-
-# Generate ground truth
-./training/train_cnn.py --infer input.png \
-  --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png
-```
-
-### 4. Rebuild Demo
-
-```bash
-cmake --build build -j4 && ./build/demo64k
-```
-
----
-
-## Usage
-
-### C++ Integration
-
-**Single layer (manual):**
-```cpp
-#include "effects/cnn_effect.h"
-
-CNNEffectParams p;
-p.layer_index = 0;
-p.total_layers = 1;
-p.blend_amount = 1.0f;
-auto cnn = std::make_shared<CNNEffect>(ctx, p);
-timeline.add_effect(cnn, start_time, end_time);
-```
-
-**Multi-layer (automatic via timeline compiler):**
-
-Use timeline syntax - `seq_compiler` expands to multiple instances.
-
-### Timeline Examples
-
-**Single-layer CNN (full stylization):**
-```
-SEQUENCE 10.0 0
-  EFFECT + Hybrid3DEffect 0.00 5.00
-  EFFECT + CNNEffect 0.50 5.00 layers=1
-```
-
-**Multi-layer CNN with blend:**
-```
-SEQUENCE 10.0 0
-  EFFECT + Hybrid3DEffect 0.00 5.00
-  EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7
-```
-
-Expands to:
-```cpp
-// Layer 0 (captures original, blend=1.0)
-{
-  CNNEffectParams p;
-  p.layer_index = 0;
-  p.total_layers = 3;
-  p.blend_amount = 1.0f;
-  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1);
-}
-// Layer 1 (blend=1.0)
-{
-  CNNEffectParams p;
-  p.layer_index = 1;
-  p.total_layers = 3;
-  p.blend_amount = 1.0f;
-  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2);
-}
-// Layer 2 (final blend=0.7)
-{
-  CNNEffectParams p;
-  p.layer_index = 2;
-  p.total_layers = 3;
-  p.blend_amount = 0.7f;
-  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3);
-}
-```
-
----
-
-## Shader Structure
-
-**Bindings:**
-```wgsl
-@group(0) @binding(0) var smplr: sampler;
-@group(0) @binding(1) var txt: texture_2d<f32>;              // Current layer input
-@group(0) @binding(2) var<uniform> uniforms: CommonUniforms;
-@group(0) @binding(3) var<uniform> params: CNNLayerParams;
-@group(0) @binding(4) var original_input: texture_2d<f32>;   // Layer 0 input (captured)
-```
-
-**Fragment shader logic:**
-```wgsl
-@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> {
-    let uv = p.xy / uniforms.resolution;
-    let original_raw = textureSample(original_input, smplr, uv);
-    let original = (original_raw - 0.5) * 2.0;  // Normalize to [-1,1]
-    let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722));
-    var result = vec4<f32>(0.0);
-
-    if (params.layer_index == 0) {
-        result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution,
-                                      weights_layer0);
-        result = cnn_tanh(result);
-    }
-    else if (params.layer_index == 1) {
-        result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution,
-                                   gray, weights_layer1);
-        result = cnn_tanh(result);
-    }
-    // ... other layers
-
-    // Blend with ORIGINAL input (not previous layer)
-    return mix(original_raw, result, params.blend_amount);
-}
-```
-
-**Weight Storage (vec4-optimized):**
-
-**Inner layers (7→4 RGBD output):**
-```wgsl
-// Structure: array<vec4<f32>, 72>
-// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
-const weights_layer0: array<vec4<f32>, 72> = array(
-  vec4<f32>(w0_r, w0_g, w0_b, w0_d),        // pos0_ch0 (rgba weights)
-  vec4<f32>(w0_u, w0_v, w0_gray, bias0),    // pos0_ch0 (uv, gray, bias)
-  vec4<f32>(w1_r, w1_g, w1_b, w1_d),        // pos0_ch1 (rgba weights)
-  vec4<f32>(w1_u, w1_v, w1_gray, bias1),    // pos0_ch1 (uv, gray, bias)
-  // ... 68 more vec4s
-);
-```
-
-**Final layer (7→1 grayscale output):**
-```wgsl
-// Structure: array<vec4<f32>, 18>
-// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
-const weights_layerN: array<vec4<f32>, 18> = array(
-  vec4<f32>(w0_r, w0_g, w0_b, w0_d),        // pos0 (rgba weights)
-  vec4<f32>(w0_u, w0_v, w0_gray, bias0),    // pos0 (uv, gray, bias)
-  // ... 16 more vec4s
-);
-```
-
-**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs.
-
----
-
-## Size Budget
-
-| Component | Size | Notes |
-|-----------|------|-------|
-| Activation functions | ~200 B | 4 functions |
-| Conv3x3 (standard + coord) | ~500 B | Both variants |
-| Conv5x5 (standard + coord) | ~700 B | Both variants |
-| Conv7x7 (standard + coord) | ~900 B | Both variants |
-| Main shader | ~800 B | Layer composition |
-| C++ implementation | ~300 B | Effect class |
-| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) |
-| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes |
-| **Total** | **5-9 KB** | Acceptable for 64k |
-
-**Optimization strategies:**
-- Quantize weights (float32 → int8)
-- Prune near-zero weights
-- Use separable convolutions
-
----
-
-## Testing
-
-```bash
-./build/test_demo_effects  # CNN construction/shader tests
-./build/demo64k            # Visual test
-```
-
----
-
-## Blend Parameter Behavior
-
-**blend_amount** controls final compositing with original:
-- `blend=0.0`: Pure original (no CNN effect)
-- `blend=0.5`: 50% original + 50% CNN
-- `blend=1.0`: Pure CNN output (full stylization)
-
-**Important:** Blend uses captured layer 0 input, not previous layer output.
-
-**Example use cases:**
-- `blend=1.0`: Full stylization (default)
-- `blend=0.7`: Subtle effect preserving original details
-- `blend=0.3`: Light artistic touch
-
-## Troubleshooting
-
-**Shader compilation fails:**
-- Check `cnn_weights_generated.wgsl` syntax
-- Verify snippets registered in `shaders.cc::InitShaderComposer()`
-- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`)
-
-**Black/corrupted output:**
-- Weights untrained (identity placeholder)
-- Check `captured_frame` auxiliary texture is registered
-- Verify layer priorities in timeline are sequential
-
-**Wrong blend result:**
-- Ensure layer 0 has `needs_framebuffer_capture() == true`
-- Check MainSequence framebuffer capture logic
-- Verify `original_input` binding is populated
-
-**Training loss not decreasing:**
-- Lower learning rate (`--learning-rate 0.0001`)
-- More epochs (`--epochs 1000`)
-- Check input/target image alignment
-
----
-
-## Vec4 Optimization
-
-**Architecture:** Weights stored as vec4 pairs for SIMD efficiency.
-
-**Input representation:**
-```wgsl
-let rgbd = textureSample(...);              // vec4: [r, g, b, d]
-let in1 = vec4<f32>(uv_norm, gray, 1.0);   // vec4: [u, v, gray, 1.0]
-```
-
-**Weight indexing:**
-```wgsl
-var pos = 0;  // Direct weight array index
-for (var dy = -1; dy <= 1; dy++) {
-  for (var dx = -1; dx <= 1; dx++) {
-    // Unrolled channel loop (4 output channels)
-    sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1);
-    sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1);
-    sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1);
-    sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1);
-    pos += 8;  // 4 channels × 2 vec4s per channel
-  }
-}
-```
-
-**Benefits:**
-- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs)
-- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment)
-- **Bias integration:** Free via `[..., 1.0]` component (no separate add)
-- **Code simplicity:** Eliminates inner loop, direct indexing with `pos`
-- **Performance:** 2-3× GPU throughput improvement over scalar version
-
-**Weight layout per filter (8 floats):**
-- vec4[0]: [w_r, w_g, w_b, w_d]     (rgba input weights)
-- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias)
-
-**3×3 kernel sizes:**
-- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes)
-- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes)
-
----
-
-## References
-
-- **Training Script:** `training/train_cnn.py`
-- **Shader Composition:** `doc/SEQUENCE.md`
-- **Effect System:** `src/gpu/effect.h`
diff --git a/doc/CNN_FLATTEN_ANALYSIS.md b/doc/CNN_FLATTEN_ANALYSIS.md
deleted file mode 100644
index bf63c5d..0000000
--- a/doc/CNN_FLATTEN_ANALYSIS.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# CNN Shader Flatten Mode - Technical Analysis
-
-**Status:** Analysis complete - flatten mode NOT RECOMMENDED
-
-**Date:** February 2026
-
----
-
-## Context
-
-Current CNN architecture uses **3 sequential render passes** (linear chaining):
-- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer
-- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer
-- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original
-
-Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers.
-
----
-
-## Current Architecture
-
-**Shader Structure:**
-- 1 pipeline with layer branching (`layer_index` uniform)
-- 5 bindings: sampler, input texture, uniforms, layer params, original capture
-- Total shader size: ~8 KB (snippets + weights)
-
-**Performance Profile:**
-- 3 render pass dispatches
-- 2 framebuffer writes + reads between layers
-- Memory bandwidth: ~2× framebuffer size per layer
-- Register pressure: Low (per-layer isolation)
-
-**Weight Buffer:** 290 vec4s (4.6 KB) - already unified
-
----
-
-## Flatten Approaches Evaluated
-
-### Option A: Full Flatten (All 3 Layers)
-
-**Cascading Receptive Field:**
-
-To compute final output at position (x, y):
-- Layer 2 needs 3×3 neighborhood of Layer 1 outputs
-- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs
-- Each Layer 0 output needs 5×5 neighborhood of input samples
-
-**Effective input sampling:** 9×9 pixels (vs current 5×5 max)
-
-**Intermediate Storage (per thread/pixel):**
-```
-Layer 0 outputs: 5×5 positions × 4 channels = 100 floats
-Layer 1 outputs: 3×3 positions × 4 channels =  36 floats
-                                   TOTAL = 136 floats (544 bytes)
-```
-
-**GPU Register Pressure:**
-- Modern GPUs: 32-64 KB registers per SM, shared across warps
-- 544 bytes/thread → max 64 threads/SM (**low occupancy**)
-- Current multi-pass: ~4-8 bytes/thread (high occupancy)
-
-**Pros:**
-- 1 dispatch vs 3 (reduce CPU overhead)
-- Zero framebuffer bandwidth between layers
-
-**Cons:**
-- **Severe register pressure** (10-20× increase)
-- Reduced occupancy → potential performance loss
-- Complex shader (harder debug, larger binary)
-- 9×9 input sampling
-
-**Assessment:** ❌ **Not Recommended**
-Register cost outweighs bandwidth savings.
-
----
-
-### Option B: Partial Flatten (Layers 1 + 2)
-
-Keep Layer 0 separate, flatten only Layers 1 and 2.
-
-**Pass Structure:**
-1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer
-2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader
-
-**Intermediate Storage:**
-```
-Layer 0 samples: 3×3 × 4 = 36 floats (read once)
-Layer 1 outputs: 3×3 × 4 = 36 floats (computed)
-                 TOTAL = 72 floats (288 bytes)
-```
-
-**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs
-
-**Pros:**
-- 2 passes vs 3 (33% reduction)
-- 1 framebuffer write saved
-- More manageable register usage
-
-**Cons:**
-- Still significant register pressure (288 bytes vs ~8 bytes baseline)
-- Medium complexity increase
-- Layer 0 (heaviest kernel) still separate
-
-**Assessment:** ⚠️ **Marginal Benefit**
-Saves 1 pass but register cost still high.
-
----
-
-### Option C: Keep Current Multi-Pass ✅
-
-**Rationale:**
-- Current architecture well-suited to GPU design (high throughput via parallelism)
-- Minimal register usage → high occupancy → hides memory latency
-- Framebuffer bandwidth cost < register pressure cost
-- Clean separation aids debugging/iteration
-- Modular (easy to add/remove layers)
-
-**Alternative Optimizations (if bandwidth critical):**
-1. Merge passes via render pass load/store ops (Vulkan subpasses)
-2. Reduce intermediate channel count (4→3 or 2)
-3. Hybrid: Compute shaders + workgroup shared memory
-4. Layer pruning (2-layer vs 3-layer quality comparison)
-
----
-
-## Recommendation
-
-**✅ Keep current multi-pass architecture**
-
-### Decision Matrix
-
-| Factor | Multi-Pass | Partial Flatten | Full Flatten |
-|--------|-----------|----------------|--------------|
-| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme |
-| Occupancy | ✅ High | ⚠️ Medium | ❌ Low |
-| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest |
-| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High |
-| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard |
-| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest |
-
-**Modern GPU Architecture Favors:**
-- High parallelism (many small threads) over complex threads
-- Hiding latency via occupancy over minimizing operations
-- Memory bandwidth via caching, not elimination
-
----
-
-## Alternative: Compute Shader + Shared Memory
-
-**If bandwidth becomes critical:**
-- Use compute shader with workgroup shared memory
-- Load tile + halos into shared memory (9×9 input samples)
-- Compute all 3 layers for tile interior (avoids redundant sampling)
-- Requires explicit synchronization (`workgroupBarrier`)
-
-**Trade-offs:**
-- ✅ Low register pressure + low bandwidth
-- ❌ Compute pipeline complexity (no render pass integration)
-- ❌ Tile edge handling
-- ❌ Larger code size
-
----
-
-## Conclusion
-
-Current 3-pass architecture is **appropriate for demo64k**:
-- Size-efficient (modular shaders)
-- Performance adequate (bandwidth not bottleneck)
-- Maintainable (clean layer isolation)
-
-**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint.
-
-### Size Optimization Alternatives (Better ROI)
-
-If size optimization critical, focus on:
-1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization)
-2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s)
-3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels)
-
-These yield better size/performance than shader architecture changes.
-
----
-
-## References
-
-- `doc/CNN_EFFECT.md` - CNN implementation details
-- `doc/CNN.md` - High-level CNN design
-- `src/effects/cnn_effect.cc` - Current implementation
-- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets
diff --git a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md b/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md
deleted file mode 100644
index 3439f2c..0000000
--- a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md
+++ /dev/null
@@ -1,136 +0,0 @@
-# CNN RGBD→Grayscale Architecture Implementation
-
-## Summary
-
-Implemented CNN architecture upgrade: RGBD input → grayscale output with 7-channel augmented input.
-
-## Changes Made
-
-### Architecture
-
-**Input:** RGBD (4 channels: RGB + inverse depth D=1/z)
-**Output:** Grayscale (1 channel)
-**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
-
-**Layer Configuration:**
-- Inner layers (0..N-2): Conv2d(7→4) - output RGBD with tanh activation
-- Final layer (N-1): Conv2d(7→1) - output grayscale, no activation
-
-### Input Normalization (all to [-1,1])
-
-- **RGBD:** `(rgbd - 0.5) * 2`
-- **UV coords:** `(uv - 0.5) * 2`
-- **Grayscale:** `dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722))` (computed once, passed as parameter)
-
-**Rationale:** Zero-centered inputs for tanh activation, better gradient flow.
-
-### Modified Files
-
-**Training (`/Users/skal/demo/training/train_cnn.py`):**
-1. Removed `CoordConv2d` class
-2. Updated `SimpleCNN`:
-   - Inner layers: `Conv2d(7, 4)` - RGBD output
-   - Final layer: `Conv2d(7, 1)` - grayscale output
-3. Updated `forward()`:
-   - Normalize RGBD/coords/gray to [-1,1]
-   - Concatenate 7-channel input for each layer
-   - Apply tanh (inner) or none (final)
-   - Denormalize final output
-4. Updated `export_weights_to_wgsl()`:
-   - Inner: `array<array<f32, 8>, 36>` (9 pos × 4 ch × 8 values)
-   - Final: `array<array<f32, 8>, 9>` (9 pos × 8 values)
-5. Updated `generate_layer_shader()`:
-   - Use `cnn_conv3x3_7to4` for inner layers
-   - Use `cnn_conv3x3_7to1` for final layer
-   - Denormalize outputs from [-1,1] to [0,1]
-6. Updated `ImagePairDataset`:
-   - Load RGBA input (was RGB)
-
-**Shaders (`/Users/skal/demo/workspaces/main/shaders/cnn/cnn_conv3x3.wgsl`):**
-1. Added `cnn_conv3x3_7to4()`:
-   - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter)
-   - 4-channel output: RGBD
-   - Weights: `array<array<f32, 8>, 36>`
-2. Added `cnn_conv3x3_7to1()`:
-   - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter)
-   - 1-channel output: grayscale
-   - Weights: `array<array<f32, 8>, 9>`
-3. Optimized: gray computed once in caller using `dot()`, not per-function
-
-**Documentation (`/Users/skal/demo/doc/CNN_EFFECT.md`):**
-1. Updated architecture section with RGBD→grayscale pipeline
-2. Updated training data requirements (RGBA input)
-3. Updated weight storage format
-
-### No C++ Changes
-
-CNNLayerParams and bind groups remain unchanged.
-
-## Data Flow
-
-1. Layer 0 captures original RGBD to `captured_frame`
-2. Each layer:
-   - Samples previous layer output (RGBD in [0,1])
-   - Normalizes RGBD to [-1,1]
-   - Computes gray once using `dot()` (fs_main level)
-   - Normalizes UV coords to [-1,1] (inside conv functions)
-   - Concatenates 7-channel input
-   - Applies convolution with layer-specific weights
-   - Outputs RGBD (inner) or grayscale (final) in [-1,1]
-   - Applies tanh (inner only)
-   - Denormalizes to [0,1] for texture storage
-   - Blends with original
-
-## Next Steps
-
-1. **Prepare RGBD training data:**
-   - Input: RGBA images (RGB + depth in alpha)
-   - Target: Grayscale stylized output
-
-2. **Train network:**
-   ```bash
-   python3 training/train_cnn.py \
-     --input training/input \
-     --target training/output \
-     --layers 3 \
-     --epochs 1000
-   ```
-
-3. **Verify generated shaders:**
-   - Check `cnn_weights_generated.wgsl` structure
-   - Check `cnn_layer.wgsl` uses new conv functions
-
-4. **Test in demo:**
-   ```bash
-   cmake --build build -j4
-   ./build/demo64k
-   ```
-
-## Design Rationale
-
-**Why [-1,1] normalization?**
-- Centered inputs for tanh (operates best around 0)
-- Better gradient flow
-- Standard ML practice for normalized data
-
-**Why RGBD throughout vs RGB?**
-- Depth information propagates through network
-- Enables depth-aware stylization
-- Consistent 4-channel processing
-
-**Why 7-channel input?**
-- Coordinates: position-dependent effects (vignettes)
-- Grayscale: luminance-aware processing
-- RGBD: full color+depth information
-- Enables richer feature learning
-
-## Testing Checklist
-
-- [ ] Train network with RGBD input data
-- [ ] Verify `cnn_weights_generated.wgsl` structure
-- [ ] Verify `cnn_layer.wgsl` uses `7to4`/`7to1` functions
-- [ ] Build demo without errors
-- [ ] Visual test: inner layers show RGBD evolution
-- [ ] Visual test: final layer produces grayscale
-- [ ] Visual test: blending works correctly
-- [ ] Compare quality with previous RGB→RGB architecture
diff --git a/doc/CNN_TEST_TOOL.md b/doc/CNN_TEST_TOOL.md
deleted file mode 100644
index 4307894..0000000
--- a/doc/CNN_TEST_TOOL.md
+++ /dev/null
@@ -1,244 +0,0 @@
-# CNN Shader Testing Tool
-
-Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Supports both CNN v1 (render pipeline) and v2 (compute, storage buffer).
-
----
-
-## Purpose
-
-- Validate trained weights against ground truth
-- Debug CNN layer behavior in isolation
-- Generate test outputs for training workflow
-- Match Python training script's inference mode
-
----
-
-## Architecture
-
-**Two implementations:**
-
-1. **CNN v1** (render pipeline, texture atlas weights)
-   - 3 fixed layers
-   - RGBA16Float intermediates
-   - BGRA8Unorm final output
-
-2. **CNN v2** (compute shaders, storage buffer weights)
-   - Dynamic layer count from binary
-   - 7D static features (RGBD + UV + sin + bias)
-   - RGBA32Uint packed f16 intermediates
-   - Storage buffer: ~3-5 KB weights
-
-**Core GPU utility:** `src/gpu/texture_readback.{h,cc}`
-- Synchronous texture-to-CPU readback
-- Supports RGBA16Float, RGBA32Uint, BGRA8Unorm
-- Protected with STRIP_ALL (0 bytes in release)
-
----
-
-## Usage
-
-```bash
-cnn_test input.png output.png [OPTIONS]
-
-OPTIONS:
-  --cnn-version N          CNN version: 1 (default) or 2 (ignored with --weights)
-  --weights PATH           Load weights from .bin (forces CNN v2, overrides layer config)
-  --blend F                Final blend amount (0.0-1.0, default: 1.0)
-  --format ppm|png         Output format (default: png)
-  --layers N               Number of CNN layers (1-10, v1 only, default: 3, ignored with --weights)
-  --save-intermediates DIR Save intermediate layers to directory
-  --debug-hex              Print first 8 pixels as hex (debug)
-  --help                   Show usage
-```
-
-**Examples:**
-```bash
-# CNN v1 (render pipeline, 3 layers)
-./build/cnn_test input.png output.png --cnn-version 1
-
-# CNN v2 (compute, storage buffer, uses asset system weights)
-./build/cnn_test input.png output.png --cnn-version 2
-
-# CNN v2 with runtime weight loading (loads layer config from .bin)
-./build/cnn_test input.png output.png --weights checkpoints/checkpoint_epoch_100.pth.bin
-
-# 50% blend with original (v2)
-./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5
-
-# Debug hex dump
-./build/cnn_test input.png output.png --cnn-version 2 --debug-hex
-```
-
-**Important:** When using `--weights`, the layer count and kernel sizes are read from the binary file header, overriding any `--layers` or `--cnn-version` arguments.
-
----
-
-## Implementation Details
-
-### Core Readback Utility
-
-**File:** `src/gpu/texture_readback.{h,cc}`
-
-**Function:**
-```cpp
-std::vector<uint8_t> read_texture_pixels(
-    WGPUInstance instance,
-    WGPUDevice device,
-    WGPUTexture texture,
-    int width,
-    int height);
-```
-
-**Features:**
-- Returns BGRA8 format (4 bytes per pixel)
-- Synchronous blocking operation
-- Cross-platform async callback handling (Win32 vs Native API)
-- Automatic staging buffer creation and cleanup
-
-**Refactored OffscreenRenderTarget:**
-```cpp
-std::vector<uint8_t> OffscreenRenderTarget::read_pixels() {
-#if !defined(STRIP_ALL)
-  return read_texture_pixels(instance_, device_, texture_, width_, height_);
-#else
-  return std::vector<uint8_t>();
-#endif
-}
-```
-
-### CNN v1 Pipeline (Render)
-
-**Fixed 3-layer architecture:**
-- Ping-pong RGBA16Float textures
-- CNNLayerParams (binding 3): layer_index, blend_amount
-- Shader composer resolves #include directives
-
-### CNN v2 Pipeline (Compute)
-
-**Dynamic layer architecture:**
-1. **Static features compute:** Generate 7D features (RGBD + UV + sin + bias)
-2. **Layer computes:** N layers from binary weights (3-5 typically)
-   - Storage buffer weights (read-only)
-   - RGBA32Uint packed f16 textures (ping-pong)
-   - CNNv2LayerParams: kernel_size, channels, weight_offset, blend
-3. **Readback:** RGBA32Uint → f16 decode → u8 clamp
-
-**Binary format:** Header (20B) + layer info (20B×N) + f16 weights
-
-**Weight Loading:**
-- **Without `--weights`:** Loads from asset system (`ASSET_WEIGHTS_CNN_V2`)
-- **With `--weights PATH`:** Loads from external `.bin` file (e.g., checkpoint exports)
-  - Layer count and kernel sizes parsed from binary header
-  - Overrides any `--layers` or `--cnn-version` arguments
-  - Enables runtime testing of training checkpoints without rebuild
-
----
-
-## Build Integration
-
-**CMakeLists.txt:**
-
-1. Added `src/gpu/texture_readback.cc` to GPU_SOURCES (both sections)
-2. Tool target:
-```cmake
-add_executable(cnn_test
-    tools/cnn_test.cc
-    src/tests/common/webgpu_test_fixture.cc
-    src/tests/common/offscreen_render_target.cc
-    ${PLATFORM_SOURCES}
-    ${GEN_DEMO_CC})
-
-target_link_libraries(cnn_test PRIVATE
-    gpu util procedural ${DEMO_LIBS})
-
-add_dependencies(cnn_test generate_demo_assets)
-
-target_compile_definitions(cnn_test PRIVATE
-    STB_IMAGE_IMPLEMENTATION
-    STB_IMAGE_WRITE_IMPLEMENTATION)
-```
-
-**Build:**
-```bash
-cmake -S . -B build -DDEMO_BUILD_TOOLS=ON
-cmake --build build -j4
-```
-
----
-
-## Validation Workflow (CNN v2)
-
-### 1. Train and Export
-```bash
-# Train and export weights
-./scripts/train_cnn_v2_full.sh --epochs 200 --batch-size 16
-```
-
-### 2. Tool Inference
-```bash
-# Run tool with v2
-./build/cnn_test training/input/img_000.png output.png --cnn-version 2
-```
-
-### 3. Visual Comparison
-Compare output.png with training/target_X/img_000.png
-
----
-
-## Status
-
-**CNN v1:** Builds and runs, produces incorrect output (all white). Use CNNEffect in demo for visual validation.
-
-**CNN v2:** ⚠️ Partially functional. Readback works but output differs from HTML validation tool.
-- Loads binary weights from `workspaces/main/weights/cnn_v2_weights.bin`
-- Matches CNNv2Effect architecture
-- **Known Issue:** Visual output differs from `tools/cnn_v2_test/index.html` despite matching shader code
-- Root cause under investigation (weight indexing? texture sampling? activation clamping?)
-- Use HTML tool (`tools/cnn_v2_test/index.html`) for accurate validation
-
----
-
-## Technical Notes (Readback Fix)
-
-**Original Bug:** Buffer mapping returned `WGPUMapAsyncStatus_Unknown` (status=5)
-
-**Root Cause:** Callback mode mismatch
-- Used `WGPUCallbackMode_WaitAnyOnly` (fires only during `wgpuInstanceWaitAny`)
-- Called `wgpuInstanceProcessEvents` in wait loop (wrong API for this mode)
-- Callback never fired → timeout → empty buffer
-
-**Fix Applied:**
-1. Changed callback mode to `WGPUCallbackMode_AllowProcessEvents`
-2. Replaced `wgpuInstanceProcessEvents` with `wgpuDevicePoll(device, true, nullptr)`
-3. Added pre-mapping device poll to ensure copy completes
-
-**Relevant Code:** `src/gpu/texture_readback.cc` lines 97-110
-
-**Reference:** WebGPU spec - Asynchronous Operations, Callback Modes
-
----
-
-## Limitations
-
-- **CNN v1:** Produces incorrect output, use for debugging only
-- **Single image:** Batch processing requires shell loop
-- **No real-time preview:** Offline processing only
-- **PNG input:** stb_image (JPEG/PNG/BMP/TGA also supported)
-
----
-
-## Technical Notes
-
-**CNN v2 f16 decoding:**
-- RGBA32Uint texture stores 8×f16 as 4×u32
-- Custom decoder: extract u16, decode f16→f32, clamp [0,1]→u8
-- Handles denormals, infinity, NaN
-
-**Cross-platform:**
-- macOS, Linux (native WebGPU)
-- Windows (mingw-w64 cross-compile)
-
-**Size impact:**
-- Debug/STRIP_ALL=OFF: compiled
-- STRIP_ALL=ON: 0 bytes (compiled out)
-- FINAL_STRIP=ON: tool not built
diff --git a/doc/CNN_V2.md b/doc/CNN_V2.md
deleted file mode 100644
index b7fd6f8..0000000
--- a/doc/CNN_V2.md
+++ /dev/null
@@ -1,813 +0,0 @@
-# CNN v2: Parametric Static Features
-
-**Technical Design Document**
-
----
-
-## Overview
-
-CNN v2 extends the original CNN post-processing effect with parametric static features, enabling richer spatial and frequency-domain inputs for improved visual quality.
-
-**Key improvements over v1:**
-- 7D static feature input (vs 4D RGB)
-- Multi-frequency position encoding (NeRF-style)
-- Configurable mip-level for p0-p3 parametric features (0-3)
-- Per-layer configurable kernel sizes (1×1, 3×3, 5×5)
-- Variable channel counts per layer
-- Float16 weight storage (~3.2 KB for 3-layer model)
-- Bias integrated as static feature dimension
-- Storage buffer architecture (dynamic layer count)
-- Binary weight format v2 for runtime loading
-- Sigmoid activation for layer 0 and final layer (smooth [0,1] mapping)
-
-**Status:** ✅ Complete. Sigmoid activation, stable training, validation tools operational.
-
-**Breaking Change:**
-- Models trained with `clamp()` incompatible. Retrain required.
-
-**TODO:**
-- 8-bit quantization with QAT for 2× size reduction (~1.6 KB)
-
----
-
-## Architecture
-
-### Pipeline Overview
-
-```
-Input RGBD → Static Features Compute → CNN Layers → Output RGBA
-             └─ computed once/frame ─┘  └─ multi-pass ─┘
-```
-
-**Detailed Data Flow:**
-
-```
-                  ┌─────────────────────────────────────────┐
-                  │   Static Features (computed once)      │
-                  │   8D: p0,p1,p2,p3,uv_x,uv_y,sin10x,bias │
-                  └──────────────┬──────────────────────────┘
-                                 │
-                                 │ 8D (broadcast to all layers)
-                                 ├───────────────────────────┐
-                                 │                           │
-  ┌──────────────┐              │                           │
-  │ Input RGBD   │──────────────┤                           │
-  │     4D       │     4D       │                           │
-  └──────────────┘              │                           │
-                                 ▼                           │
-                          ┌────────────┐                    │
-                          │  Layer 0   │  (12D input)       │
-                          │   (CNN)    │   = 4D + 8D        │
-                          │  12D → 4D  │                    │
-                          └─────┬──────┘                    │
-                                │ 4D output                 │
-                                │                           │
-                                ├───────────────────────────┘
-                                │                           │
-                                ▼                           │
-                          ┌────────────┐                    │
-                          │  Layer 1   │  (12D input)       │
-                          │   (CNN)    │   = 4D + 8D        │
-                          │  12D → 4D  │                    │
-                          └─────┬──────┘                    │
-                                │ 4D output                 │
-                                │                           │
-                                ├───────────────────────────┘
-                                ▼                           │
-                               ...                          │
-                                │                           │
-                                ▼                           │
-                          ┌────────────┐                    │
-                          │  Layer N   │  (12D input)       │
-                          │ (output)   │◄──────────────────┘
-                          │  12D → 4D  │
-                          └─────┬──────┘
-                                │ 4D (RGBA)
-                                ▼
-                            Output
-```
-
-**Key Points:**
-- Static features computed once, broadcast to all CNN layers
-- Each layer: previous 4D output + 8D static → 12D input → 4D output
-- Ping-pong buffering between layers
-- Layer 0 special case: uses input RGBD instead of previous layer output
-
-**Static Features Texture:**
-- Name: `static_features`
-- Format: `texture_storage_2d<rgba32uint, write>` (4×u32)
-- Data: 8 float16 values packed via `pack2x16float()`
-- Computed once per frame, read by all CNN layers
-- Lifetime: Entire frame (all CNN layer passes)
-
-**CNN Layers:**
-- Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels
-- Layer 1+: previous output (4D) + static (8D) = 12D → 4 channels
-- All layers: uniform 12D input, 4D output (ping-pong buffer)
-- Storage: `texture_storage_2d<rgba32uint>` (4 channels as 2×f16 pairs)
-
-**Activation Functions:**
-- Layer 0 & final layer: `sigmoid(x)` for smooth [0,1] mapping
-- Middle layers: `ReLU` (max(0, x))
-- Rationale: Sigmoid prevents gradient blocking at boundaries, enabling better convergence
-- Breaking change: Models trained with `clamp(x, 0, 1)` are incompatible, retrain required
-
----
-
-## Static Features (7D + 1 bias)
-
-### Feature Layout
-
-**8 float16 values per pixel:**
-
-```wgsl
-// Slot 0-3: Parametric features (p0, p1, p2, p3)
-// Sampled from configurable mip level (0=original, 1=half, 2=quarter, 3=eighth)
-// Training sets mip_level via --mip-level flag, stored in binary format v2
-let p0 = ...;  // RGB.r from selected mip level
-let p1 = ...;  // RGB.g from selected mip level
-let p2 = ...;  // RGB.b from selected mip level
-let p3 = ...;  // Depth or RGB channel from mip level
-
-// Slot 4-5: UV coordinates (normalized screen space)
-let uv_x = coord.x / resolution.x;  // Horizontal position [0,1]
-let uv_y = coord.y / resolution.y;  // Vertical position [0,1]
-
-// Slot 6: Multi-frequency position encoding
-let sin20_y = sin(20.0 * uv_y);     // Periodic feature (frequency=20, vertical)
-
-// Slot 7: Bias dimension (always 1.0)
-let bias = 1.0;                     // Learned bias per output channel
-
-// Packed storage: [p0, p1, p2, p3, uv.x, uv.y, sin(20*uv.y), 1.0]
-```
-
-### Input Channel Mapping
-
-**Weight tensor layout (12 input channels per layer):**
-
-| Input Channel | Feature | Description |
-|--------------|---------|-------------|
-| 0-3 | Previous layer output | 4D RGBA from prior CNN layer (or input RGBD for Layer 0) |
-| 4-11 | Static features | 8D: p0, p1, p2, p3, uv_x, uv_y, sin20_y, bias |
-
-**Static feature channel details:**
-- Channel 4 → p0 (RGB.r from mip level)
-- Channel 5 → p1 (RGB.g from mip level)
-- Channel 6 → p2 (RGB.b from mip level)
-- Channel 7 → p3 (depth or RGB channel from mip level)
-- Channel 8 → p4 (uv_x: normalized horizontal position)
-- Channel 9 → p5 (uv_y: normalized vertical position)
-- Channel 10 → p6 (sin(20*uv_y): periodic encoding)
-- Channel 11 → p7 (bias: constant 1.0)
-
-**Note:** When generating identity weights, p4-p7 correspond to input channels 8-11, not 4-7.
-
-### Feature Rationale
-
-| Feature | Dimension | Purpose | Priority |
-|---------|-----------|---------|----------|
-| p0-p3 | 4D | Parametric auxiliary features (mips, gradients, etc.) | Essential |
-| UV coords | 2D | Spatial position awareness | Essential |
-| sin(20\*uv.y) | 1D | Periodic position encoding (vertical) | Medium |
-| Bias | 1D | Learned bias (standard NN) | Essential |
-
-**Note:** Input image RGBD (mip 0) fed only to Layer 0. Subsequent layers see static features + previous layer output.
-
-**Why bias as static feature:**
-- Simpler shader code (single weight array)
-- Standard NN formulation: y = Wx (x includes bias term)
-- Saves 56-112 bytes (no separate bias buffer)
-- 7 features sufficient for initial implementation
-
-### Future Feature Extensions
-
-**Option: Additional encodings:**
-- `sin(40*uv.y)` - Higher frequency encoding
-- `gray_mip1` - Multi-scale luminance
-- `dx`, `dy` - Sobel gradients
-- `variance` - Local texture measure
-- `laplacian` - Edge detection
-
-**Option: uint8 packing (16+ features):**
-```wgsl
-// texture_storage_2d<rgba8unorm> stores 16 uint8 values
-// Trade precision for feature count
-// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y,
-//  sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, var, bias]
-```
-Requires quantization-aware training.
-
----
-
-## Layer Structure
-
-### Example 3-Layer Network
-
-```
-Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels (3×3 kernel)
-Layer 1: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel)
-Layer 2: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel, output RGBA)
-```
-
-**Output:** 4 channels (RGBA). Training targets preserve alpha from target images.
-
-### Weight Calculations
-
-**Per-layer weights (uniform 12D→4D, 3×3 kernels):**
-```
-Layer 0: 12 × 3 × 3 × 4 = 432 weights
-Layer 1: 12 × 3 × 3 × 4 = 432 weights
-Layer 2: 12 × 3 × 3 × 4 = 432 weights
-Total: 1296 weights
-```
-
-**Storage sizes:**
-- f32: 1296 × 4 = 5,184 bytes (~5.1 KB)
-- f16: 1296 × 2 = 2,592 bytes (~2.5 KB) ✓ **recommended**
-
-**Comparison to v1:**
-- v1: ~800 weights (3.2 KB f32)
-- v2: ~1296 weights (2.5 KB f16)
-- **Uniform architecture, smaller than v1 f32**
-
-### Kernel Size Guidelines
-
-**1×1 kernel (pointwise):**
-- No spatial context, channel mixing only
-- Weights: `12 × 4 = 48` per layer
-- Use for: Fast inference, channel remapping
-
-**3×3 kernel (standard conv):**
-- Local spatial context (recommended)
-- Weights: `12 × 9 × 4 = 432` per layer
-- Use for: Most layers (balanced quality/size)
-
-**5×5 kernel (large receptive field):**
-- Wide spatial context
-- Weights: `12 × 25 × 4 = 1200` per layer
-- Use for: Output layer, fine detail enhancement
-
-### Channel Storage (4×f16 per texel)
-
-```wgsl
-@group(0) @binding(1) var layer_input: texture_2d<u32>;
-
-fn unpack_channels(coord: vec2<i32>) -> vec4<f32> {
-  let packed = textureLoad(layer_input, coord, 0);
-  let v0 = unpack2x16float(packed.x);  // [ch0, ch1]
-  let v1 = unpack2x16float(packed.y);  // [ch2, ch3]
-  return vec4<f32>(v0.x, v0.y, v1.x, v1.y);
-}
-
-fn pack_channels(values: vec4<f32>) -> vec4<u32> {
-  return vec4<u32>(
-    pack2x16float(vec2(values.x, values.y)),
-    pack2x16float(vec2(values.z, values.w)),
-    0u,  // Unused
-    0u   // Unused
-  );
-}
-```
-
----
-
-## Training Workflow
-
-### Script: `training/train_cnn_v2.py`
-
-**Static Feature Extraction:**
-
-```python
-def compute_static_features(rgb, depth, mip_level=0):
-    """Generate parametric features (8D: p0-p3 + spatial).
-
-    Args:
-        mip_level: 0=original, 1=half res, 2=quarter res, 3=eighth res
-    """
-    h, w = rgb.shape[:2]
-
-    # Generate mip level for p0-p3 (downsample then upsample)
-    if mip_level > 0:
-        mip_rgb = rgb.copy()
-        for _ in range(mip_level):
-            mip_rgb = cv2.pyrDown(mip_rgb)
-        for _ in range(mip_level):
-            mip_rgb = cv2.pyrUp(mip_rgb)
-        if mip_rgb.shape[:2] != (h, w):
-            mip_rgb = cv2.resize(mip_rgb, (w, h), interpolation=cv2.INTER_LINEAR)
-    else:
-        mip_rgb = rgb
-
-    # Parametric features from mip level
-    p0, p1, p2, p3 = mip_rgb[..., 0], mip_rgb[..., 1], mip_rgb[..., 2], depth
-
-    # UV coordinates (normalized)
-    uv_x = np.linspace(0, 1, w)[None, :].repeat(h, axis=0)
-    uv_y = np.linspace(0, 1, h)[:, None].repeat(w, axis=1)
-
-    # Multi-frequency position encoding
-    sin10_x = np.sin(10.0 * uv_x)
-
-    # Bias dimension (always 1.0)
-    bias = np.ones_like(p0)
-
-    # Stack: [p0, p1, p2, p3, uv.x, uv.y, sin10_x, bias]
-    return np.stack([p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias], axis=-1)
-```
-
-**Network Definition:**
-
-```python
-class CNNv2(nn.Module):
-    def __init__(self, kernel_sizes, num_layers=3):
-        super().__init__()
-        if isinstance(kernel_sizes, int):
-            kernel_sizes = [kernel_sizes] * num_layers
-        self.kernel_sizes = kernel_sizes
-        self.layers = nn.ModuleList()
-
-        # All layers: 12D input (4 prev + 8 static) → 4D output
-        for kernel_size in kernel_sizes:
-            self.layers.append(
-                nn.Conv2d(12, 4, kernel_size=kernel_size,
-                         padding=kernel_size//2, bias=False)
-            )
-
-    def forward(self, input_rgbd, static_features):
-        # Layer 0: input RGBD (4D) + static (8D) = 12D
-        x = torch.cat([input_rgbd, static_features], dim=1)
-        x = self.layers[0](x)
-        x = torch.sigmoid(x)  # Soft [0,1] for layer 0
-
-        # Layer 1+: previous output (4D) + static (8D) = 12D
-        for i in range(1, len(self.layers)):
-            x_input = torch.cat([x, static_features], dim=1)
-            x = self.layers[i](x_input)
-            if i < len(self.layers) - 1:
-                x = F.relu(x)
-            else:
-                x = torch.sigmoid(x)  # Soft [0,1] for final layer
-
-        return x  # RGBA output
-```
-
-**Training Configuration:**
-
-```python
-# Hyperparameters
-kernel_sizes = [3, 3, 3]     # Per-layer kernel sizes (e.g., [1,3,5])
-num_layers = 3               # Number of CNN layers
-mip_level = 0                # Mip level for p0-p3: 0=orig, 1=half, 2=quarter, 3=eighth
-grayscale_loss = False       # Compute loss on grayscale (Y) instead of RGBA
-learning_rate = 1e-3
-batch_size = 16
-epochs = 5000
-
-# Dataset: Input RGB, Target RGBA (preserves alpha channel from image)
-# Model outputs RGBA, loss compares all 4 channels (or grayscale if --grayscale-loss)
-
-# Training loop (standard PyTorch f32)
-for epoch in range(epochs):
-    for rgb_batch, depth_batch, target_batch in dataloader:
-        # Compute static features (8D) with mip level
-        static_feat = compute_static_features(rgb_batch, depth_batch, mip_level)
-
-        # Input RGBD (4D)
-        input_rgbd = torch.cat([rgb_batch, depth_batch.unsqueeze(1)], dim=1)
-
-        # Forward pass
-        output = model(input_rgbd, static_feat)
-
-        # Loss computation (grayscale or RGBA)
-        if grayscale_loss:
-            # Convert RGBA to grayscale: Y = 0.299*R + 0.587*G + 0.114*B
-            output_gray = 0.299 * output[:, 0:1] + 0.587 * output[:, 1:2] + 0.114 * output[:, 2:3]
-            target_gray = 0.299 * target[:, 0:1] + 0.587 * target[:, 1:2] + 0.114 * target[:, 2:3]
-            loss = criterion(output_gray, target_gray)
-        else:
-            loss = criterion(output, target_batch)
-
-        # Backward pass
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-```
-
-**Checkpoint Format:**
-
-```python
-torch.save({
-    'state_dict': model.state_dict(),  # f32 weights
-    'config': {
-        'kernel_sizes': [3, 3, 3],  # Per-layer kernel sizes
-        'num_layers': 3,
-        'mip_level': 0,             # Mip level used for p0-p3
-        'grayscale_loss': False,    # Whether grayscale loss was used
-        'features': ['p0', 'p1', 'p2', 'p3', 'uv.x', 'uv.y', 'sin10_x', 'bias']
-    },
-    'epoch': epoch,
-    'loss': loss.item()
-}, f'checkpoints/checkpoint_epoch_{epoch}.pth')
-```
-
----
-
-## Export Workflow
-
-### Script: `training/export_cnn_v2_shader.py`
-
-**Process:**
-1. Load checkpoint (f32 PyTorch weights)
-2. Extract layer configs (kernels, channels)
-3. Quantize weights to float16: `weights_f16 = weights_f32.astype(np.float16)`
-4. Generate WGSL shader per layer
-5. Write to `workspaces/<workspace>/shaders/cnn_v2/cnn_v2_*.wgsl`
-
-**Example Generated Shader:**
-
-```wgsl
-// cnn_v2_layer_0.wgsl - Auto-generated from checkpoint_epoch_5000.pth
-
-const KERNEL_SIZE: u32 = 1u;
-const IN_CHANNELS: u32 = 8u;   // 7 features + bias
-const OUT_CHANNELS: u32 = 16u;
-
-// Weights quantized to float16 (stored as f32 in shader)
-const weights: array<f32, 128> = array(
-  0.123047, -0.089844, 0.234375, 0.456055, ...
-);
-
-@group(0) @binding(0) var static_features: texture_2d<u32>;
-@group(0) @binding(1) var output_texture: texture_storage_2d<rgba32uint, write>;
-
-@compute @workgroup_size(8, 8)
-fn main(@builtin(global_invocation_id) id: vec3<u32>) {
-  // Load static features (8D)
-  let static_feat = get_static_features(vec2<i32>(id.xy));
-
-  // Convolution (1×1 kernel = pointwise)
-  var output: array<f32, OUT_CHANNELS>;
-  for (var c: u32 = 0u; c < OUT_CHANNELS; c++) {
-    var sum: f32 = 0.0;
-    for (var k: u32 = 0u; k < IN_CHANNELS; k++) {
-      sum += weights[c * IN_CHANNELS + k] * static_feat[k];
-    }
-    output[c] = max(0.0, sum);  // ReLU activation
-  }
-
-  // Pack and store (8×f16 per texel)
-  textureStore(output_texture, vec2<i32>(id.xy), pack_f16x8(output));
-}
-```
-
-**Float16 Quantization:**
-- Training uses f32 throughout (PyTorch standard)
-- Export converts to np.float16, then back to f32 for WGSL literals
-- **Expected discrepancy:** <0.1% MSE (acceptable)
-- Validation via HTML tool (see below)
-
----
-
-## Validation Workflow
-
-### HTML Tool: `tools/cnn_v2_test/index.html`
-
-**WebGPU-based testing tool** with layer visualization.
-
-**Usage:**
-1. Open `tools/cnn_v2_test/index.html` in browser
-2. Drop `.bin` weights file (from `export_cnn_v2_weights.py`)
-3. Drop PNG test image
-4. View results with layer inspection
-
-**Features:**
-- Live CNN inference with WebGPU
-- Layer-by-layer visualization (static features + all CNN layers)
-- Weight visualization (per-layer kernels)
-- View modes: CNN output, original, diff (×10)
-- Blend control for comparing with original
-
-**Export weights:**
-```bash
-./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \
-  --output-weights workspaces/main/cnn_v2_weights.bin
-```
-
-See `doc/CNN_V2_WEB_TOOL.md` for detailed documentation
-
----
-
-## Implementation Checklist
-
-### Phase 1: Shaders (Core Infrastructure)
-
-- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl` - Static features compute
-  - [ ] RGBD sampling from framebuffer
-  - [ ] UV coordinate calculation
-  - [ ] sin(10\*uv.x) computation
-  - [ ] Bias dimension (constant 1.0)
-  - [ ] Float16 packing via `pack2x16float()`
-  - [ ] Output to `texture_storage_2d<rgba32uint>`
-
-- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_layer_template.wgsl` - Layer template
-  - [ ] Static features unpacking
-  - [ ] Previous layer unpacking (8×f16)
-  - [ ] Convolution implementation (1×1, 3×3, 5×5)
-  - [ ] ReLU activation
-  - [ ] Output packing (8×f16)
-  - [ ] Proper padding handling
-
-### Phase 2: C++ Effect Class
-
-- [ ] `src/effects/cnn_v2_effect.h` - Header
-  - [ ] Class declaration inheriting from `PostProcessEffect`
-  - [ ] Static features texture member
-  - [ ] Layer textures vector
-  - [ ] Pipeline and bind group members
-
-- [ ] `src/effects/cnn_v2_effect.cc` - Implementation
-  - [ ] Constructor: Load shaders, create textures
-  - [ ] `init()`: Create pipelines, bind groups
-  - [ ] `render()`: Multi-pass execution
-    - [ ] Pass 0: Compute static features
-    - [ ] Pass 1-N: CNN layers
-    - [ ] Final: Composite to output
-  - [ ] Proper resource cleanup
-
-- [ ] Integration
-  - [ ] Add to `src/gpu/demo_effects.h` includes
-  - [ ] Add `cnn_v2_effect.cc` to `CMakeLists.txt` (headless + normal)
-  - [ ] Add shaders to `workspaces/main/assets.txt`
-  - [ ] Add to `src/tests/gpu/test_demo_effects.cc`
-
-### Phase 3: Training Pipeline
-
-- [ ] `training/train_cnn_v2.py` - Training script
-  - [ ] Static feature extraction function
-  - [ ] CNNv2 PyTorch model class
-  - [ ] Patch-based dataloader
-  - [ ] Training loop with checkpointing
-  - [ ] Command-line argument parsing
-  - [ ] Inference mode (ground truth generation)
-
-- [ ] `training/export_cnn_v2_shader.py` - Export script
-  - [ ] Checkpoint loading
-  - [ ] Weight extraction and f16 quantization
-  - [ ] Per-layer WGSL generation
-  - [ ] File output to workspace shaders/
-  - [ ] Metadata preservation
-
-### Phase 4: Tools & Validation
-
-- [x] HTML validation tool - WebGPU inference with layer visualization
-  - [ ] Command-line argument parsing
-  - [ ] Shader export orchestration
-  - [ ] Build orchestration
-  - [ ] Batch image processing
-  - [ ] Results display
-
-- [ ] `src/tools/cnn_test_main.cc` - Tool updates
-  - [ ] Add `--cnn-version v2` flag
-  - [ ] CNNv2Effect instantiation path
-  - [ ] Static features pass execution
-  - [ ] Multi-layer processing
-
-### Phase 5: Documentation
-
-- [ ] `doc/HOWTO.md` - Usage guide
-  - [ ] Training section (CNN v2)
-  - [ ] Export section
-  - [ ] Validation section
-  - [ ] Examples
-
-- [ ] `README.md` - Project overview update
-  - [ ] Mention CNN v2 capability
-
----
-
-## File Structure
-
-### New Files
-
-```
-# Shaders (generated by export script)
-workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl       # Static features compute
-workspaces/main/shaders/cnn_v2/cnn_v2_layer_0.wgsl      # Input layer (generated)
-workspaces/main/shaders/cnn_v2/cnn_v2_layer_1.wgsl      # Inner layer (generated)
-workspaces/main/shaders/cnn_v2/cnn_v2_layer_2.wgsl      # Output layer (generated)
-
-# C++ implementation
-src/effects/cnn_v2_effect.h                  # Effect class header
-src/effects/cnn_v2_effect.cc                 # Effect implementation
-
-# Python training/export
-training/train_cnn_v2.py                         # Training script
-training/export_cnn_v2_shader.py                 # Shader generator
-training/validation/                             # Test images directory
-
-# Validation
-tools/cnn_v2_test/index.html                     # WebGPU validation tool
-
-# Documentation
-doc/CNN_V2.md                                    # This file
-```
-
-### Modified Files
-
-```
-src/gpu/demo_effects.h                           # Add CNNv2Effect include
-CMakeLists.txt                                   # Add cnn_v2_effect.cc
-workspaces/main/assets.txt                       # Add cnn_v2 shaders
-workspaces/main/timeline.seq                     # Optional: add CNNv2Effect
-src/tests/gpu/test_demo_effects.cc               # Add CNNv2 test case
-src/tools/cnn_test_main.cc                       # Add --cnn-version v2
-doc/HOWTO.md                                     # Add CNN v2 sections
-TODO.md                                          # Add CNN v2 task
-```
-
-### Unchanged (v1 Preserved)
-
-```
-training/train_cnn.py                            # Original training
-src/effects/cnn_effect.*                     # Original effect
-workspaces/main/shaders/cnn_*.wgsl               # Original v1 shaders
-```
-
----
-
-## Performance Characteristics
-
-### Static Features Compute
-- **Cost:** ~0.1ms @ 1080p
-- **Frequency:** Once per frame
-- **Operations:** sin(), texture sampling, packing
-
-### CNN Layers (Example 3-layer)
-- **Layer0 (1×1, 8→16):** ~0.3ms
-- **Layer1 (3×3, 23→8):** ~0.8ms
-- **Layer2 (5×5, 15→4):** ~1.2ms
-- **Total:** ~2.4ms @ 1080p
-
-### Memory Usage
-- Static features: 1920×1080×8×2 = 33 MB (f16)
-- Layer buffers: 1920×1080×16×2 = 66 MB (max 16 channels)
-- Weights: ~6.4 KB (f16, in shader code)
-- **Total GPU memory:** ~100 MB
-
----
-
-## Size Budget
-
-### CNN v1 vs v2
-
-| Metric | v1 | v2 | Delta |
-|--------|----|----|-------|
-| Weights (count) | 800 | 3268 | +2468 |
-| Storage (f32) | 3.2 KB | 13.1 KB | +9.9 KB |
-| Storage (f16) | N/A | 6.5 KB | +6.5 KB |
-| Shader code | ~500 lines | ~800 lines | +300 lines |
-
-### Mitigation Strategies
-
-**Reduce channels:**
-- [16,8,4] → [8,4,4] saves ~50% weights
-- [16,8,4] → [4,4,4] saves ~60% weights
-
-**Smaller kernels:**
-- [1,3,5] → [1,3,3] saves ~30% weights
-- [1,3,5] → [1,1,3] saves ~50% weights
-
-**Quantization:**
-- int8 weights: saves 75% (requires QAT training)
-- 4-bit weights: saves 87.5% (extreme, needs research)
-
-**Target:** Keep CNN v2 under 10 KB for 64k demo constraint
-
----
-
-## Future Extensions
-
-### Flexible Feature Layout (Binary Format v3)
-
-**TODO:** Support arbitrary feature vector layouts and ordering in binary format.
-
-**Current Limitation:**
-- Feature layout hardcoded: `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]`
-- Shader must match training script exactly
-- Experimentation requires shader recompilation
-
-**Proposed Enhancement:**
-- Add feature descriptor to binary format header
-- Specify feature types, sources, and ordering
-- Runtime shader generation or dynamic feature indexing
-- Examples: `[R, G, B, dx, dy, uv_x, bias]` or `[mip1.r, mip2.g, laplacian, uv_x, sin20_x, bias]`
-
-**Benefits:**
-- Training experiments without C++/shader changes
-- A/B test different feature combinations
-- Single binary format, multiple architectures
-- Faster iteration on feature engineering
-
-**Implementation Options:**
-1. **Static approach:** Generate shader code from descriptor at load time
-2. **Dynamic approach:** Array-based indexing with feature map uniform
-3. **Hybrid:** Precompile common layouts, fallback to dynamic
-
-See `doc/CNN_V2_BINARY_FORMAT.md` for proposed descriptor format.
-
----
-
-### More Features (uint8 Packing)
-
-```wgsl
-// 16 uint8 features per texel (texture_storage_2d<rgba8unorm>)
-// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y,
-//  sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, variance, bias]
-```
-- Trade precision for quantity
-- Requires quantization-aware training
-
-### Temporal Features
-
-- Previous frame RGBA (motion awareness)
-- Optical flow vectors
-- Requires multi-frame buffer
-
-### Learned Position Encodings
-
-- Replace hand-crafted sin(10\*uv) with learned embeddings
-- Requires separate embedding network
-- Similar to NeRF position encoding
-
-### Dynamic Architecture
-
-- Runtime kernel size selection based on scene
-- Conditional layer execution (skip connections)
-- Layer pruning for performance
-
----
-
-## References
-
-- **v1 Implementation:** `src/effects/cnn_effect.*`
-- **Training Guide:** `doc/HOWTO.md` (CNN Training section)
-- **Test Tool:** `doc/CNN_TEST_TOOL.md`
-- **Shader System:** `doc/SEQUENCE.md`
-- **Size Measurement:** `doc/SIZE_MEASUREMENT.md`
-
----
-
-## Appendix: Design Decisions
-
-### Why Bias as Static Feature?
-
-**Alternatives considered:**
-1. Separate bias array per layer (Option B)
-2. Bias as static feature = 1.0 (Option A, chosen)
-
-**Decision rationale:**
-- Simpler shader code (fewer bindings)
-- Standard NN formulation (augmented input)
-- Saves 56-112 bytes per model
-- 7 features sufficient for v1 implementation
-- Can extend to uint8 packing if >7 features needed
-
-### Why Float16 for Weights?
-
-**Alternatives considered:**
-1. Keep f32 (larger, more accurate)
-2. Use f16 (smaller, GPU-native)
-3. Use int8 (smallest, needs QAT)
-
-**Decision rationale:**
-- f16 saves 50% vs f32 (critical for 64k target)
-- GPU-native support (pack2x16float in WGSL)
-- <0.1% accuracy loss (acceptable)
-- Simpler than int8 quantization
-
-### Why Multi-Frequency Position Encoding?
-
-**Inspiration:** NeRF (Neural Radiance Fields)
-
-**Benefits:**
-- Helps network learn high-frequency details
-- Better than raw UV coordinates
-- Small footprint (1D per frequency)
-
-**Future:** Add sin(20\*uv), sin(40\*uv) if >7 features available
-
----
-
-## Related Documentation
-
-- `doc/CNN_V2_BINARY_FORMAT.md` - Binary weight file specification (.bin format)
-- `doc/CNN_V2_WEB_TOOL.md` - WebGPU testing tool with layer visualization
-- `doc/CNN_TEST_TOOL.md` - C++ offline validation tool (deprecated)
-- `doc/HOWTO.md` - Training and validation workflows
-
----
-
-**Document Version:** 1.0
-**Last Updated:** 2026-02-12
-**Status:** Design approved, ready for implementation
diff --git a/doc/CNN_V2_BINARY_FORMAT.md b/doc/CNN_V2_BINARY_FORMAT.md
deleted file mode 100644
index 59c859d..0000000
--- a/doc/CNN_V2_BINARY_FORMAT.md
+++ /dev/null
@@ -1,235 +0,0 @@
-# CNN v2 Binary Weight Format Specification
-
-Binary format for storing trained CNN v2 weights with static feature architecture.
-
-**File Extension:** `.bin`
-**Byte Order:** Little-endian
-**Version:** 2.0 (supports mip-level for parametric features)
-**Backward Compatible:** Version 1.0 files supported (mip_level=0)
-
----
-
-## File Structure
-
-**Version 2 (current):**
-```
-┌─────────────────────┐
-│  Header (20 bytes)  │
-├─────────────────────┤
-│  Layer Info         │
-│  (20 bytes × N)     │
-├─────────────────────┤
-│  Weight Data        │
-│  (variable size)    │
-└─────────────────────┘
-```
-
-**Version 1 (legacy):**
-```
-┌─────────────────────┐
-│  Header (16 bytes)  │
-├─────────────────────┤
-│  Layer Info         │
-│  (20 bytes × N)     │
-├─────────────────────┤
-│  Weight Data        │
-│  (variable size)    │
-└─────────────────────┘
-```
-
----
-
-## Header
-
-**Version 2 (20 bytes):**
-
-| Offset | Type | Field          | Description                          |
-|--------|------|----------------|--------------------------------------|
-| 0x00   | u32  | magic          | Magic number: `0x32_4E_4E_43` ("CNN2") |
-| 0x04   | u32  | version        | Format version (2 for current)       |
-| 0x08   | u32  | num_layers     | Number of CNN layers (excludes static features) |
-| 0x0C   | u32  | total_weights  | Total f16 weight count across all layers |
-| 0x10   | u32  | mip_level      | Mip level for p0-p3 features (0=original, 1=half, 2=quarter, 3=eighth) |
-
-**Version 1 (16 bytes) - Legacy:**
-
-| Offset | Type | Field          | Description                          |
-|--------|------|----------------|--------------------------------------|
-| 0x00   | u32  | magic          | Magic number: `0x32_4E_4E_43` ("CNN2") |
-| 0x04   | u32  | version        | Format version (1)                   |
-| 0x08   | u32  | num_layers     | Number of CNN layers                 |
-| 0x0C   | u32  | total_weights  | Total f16 weight count               |
-
-**Note:** Loaders should check version field and handle both formats. Version 1 files treated as mip_level=0.
-
----
-
-## Layer Info (20 bytes per layer)
-
-Repeated `num_layers` times:
-- **Version 2:** Starting at offset 0x14 (20 bytes)
-- **Version 1:** Starting at offset 0x10 (16 bytes)
-
-| Offset      | Type | Field          | Description                          |
-|-------------|------|----------------|--------------------------------------|
-| 0x00        | u32  | kernel_size    | Convolution kernel dimension (3, 5, 7, etc.) |
-| 0x04        | u32  | in_channels    | Input channel count (includes 8 static features for Layer 1) |
-| 0x08        | u32  | out_channels   | Output channel count (max 8)         |
-| 0x0C        | u32  | weight_offset  | Weight array start index (f16 units, relative to weight data section) |
-| 0x10        | u32  | weight_count   | Number of f16 weights for this layer |
-
-**Layer Order:** Sequential (Layer 1, Layer 2, Layer 3, ...)
-
----
-
-## Weight Data (variable size)
-
-Starts at offset:
-- **Version 2:** `20 + (num_layers × 20)`
-- **Version 1:** `16 + (num_layers × 20)`
-
-**Format:** Packed f16 pairs stored as u32
-**Packing:** `u32 = (f16_hi << 16) | f16_lo`
-**Storage:** Sequential by layer, then by output channel, input channel, spatial position
-
-**Weight Indexing:**
-```
-weight_idx = output_ch × (in_channels × kernel_size²) +
-             input_ch × kernel_size² +
-             (ky × kernel_size + kx)
-```
-
-Where:
-- `output_ch` ∈ [0, out_channels)
-- `input_ch` ∈ [0, in_channels)
-- `ky`, `kx` ∈ [0, kernel_size)
-
-**Unpacking f16 from u32:**
-```c
-uint32_t packed = weights_buffer[weight_idx / 2];
-uint16_t f16_bits = (weight_idx % 2 == 0) ? (packed & 0xFFFF) : (packed >> 16);
-```
-
----
-
-## Example: 3-Layer Network (Version 2)
-
-**Configuration:**
-- Mip level: 0 (original resolution)
-- Layer 0: 12→4, kernel 3×3 (432 weights)
-- Layer 1: 12→4, kernel 3×3 (432 weights)
-- Layer 2: 12→4, kernel 3×3 (432 weights)
-
-**File Layout:**
-```
-Offset   Size   Content
-------   ----   -------
-0x00     20     Header (magic, version=2, layers=3, weights=1296, mip_level=0)
-0x14     20     Layer 0 info (kernel=3, in=12, out=4, offset=0, count=432)
-0x28     20     Layer 1 info (kernel=3, in=12, out=4, offset=432, count=432)
-0x3C     20     Layer 2 info (kernel=3, in=12, out=4, offset=864, count=432)
-0x50     2592   Weight data (1296 u32 packed f16 pairs)
-         ----
-Total:   2672 bytes (~2.6 KB)
-```
-
----
-
-## Static Features
-
-Not stored in .bin file (computed at runtime):
-
-**8D Input Features:**
-1. **p0** - Parametric feature 0 (from mip level)
-2. **p1** - Parametric feature 1 (from mip level)
-3. **p2** - Parametric feature 2 (from mip level)
-4. **p3** - Parametric feature 3 (depth or from mip level)
-5. **UV_X** - Normalized x coordinate [0,1]
-6. **UV_Y** - Normalized y coordinate [0,1]
-7. **sin(20 × UV_Y)** - Spatial frequency encoding (vertical, frequency=20)
-8. **1.0** - Bias term
-
-**Mip Level Usage (p0-p3):**
-- `mip_level=0`: RGB from original resolution (mip 0)
-- `mip_level=1`: RGB from half resolution (mip 1), upsampled
-- `mip_level=2`: RGB from quarter resolution (mip 2), upsampled
-- `mip_level=3`: RGB from eighth resolution (mip 3), upsampled
-
-**Layer 0** receives input RGBD (4D) + static features (8D) = 12D input → 4D output.
-**Layer 1+** receive previous layer output (4D) + static features (8D) = 12D input → 4D output.
-
----
-
-## Validation
-
-**Magic Check:**
-```c
-uint32_t magic;
-fread(&magic, 4, 1, fp);
-if (magic != 0x32_4E_4E_43) { error("Invalid CNN v2 file"); }
-```
-
-**Version Check:**
-```c
-uint32_t version;
-fread(&version, 4, 1, fp);
-if (version != 1 && version != 2) { error("Unsupported version"); }
-uint32_t header_size = (version == 1) ? 16 : 20;
-```
-
-**Size Check:**
-```c
-expected_size = header_size + (num_layers × 20) + (total_weights × 2);
-if (file_size != expected_size) { error("Size mismatch"); }
-```
-
-**Weight Offset Sanity:**
-```c
-// Each layer's offset should match cumulative count
-uint32_t cumulative = 0;
-for (int i = 0; i < num_layers; i++) {
-    if (layers[i].weight_offset != cumulative) { error("Invalid offset"); }
-    cumulative += layers[i].weight_count;
-}
-if (cumulative != total_weights) { error("Total mismatch"); }
-```
-
----
-
-## Future Extensions
-
-**TODO: Flexible Feature Layout**
-
-Current limitation: Feature vector layout is hardcoded as `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]`.
-
-Proposed enhancement for version 3:
-- Add feature descriptor section to header
-- Specify feature count, types, and ordering
-- Support arbitrary 7D feature combinations (e.g., `[R, G, B, dx, dy, uv_x, bias]`)
-- Allow runtime shader generation based on descriptor
-- Enable experimentation without recompiling shaders
-
-Example descriptor format:
-```
-struct FeatureDescriptor {
-  u32 feature_count;           // Number of features (typically 7-8)
-  u32 feature_types[8];        // Type enum per feature
-  u32 feature_sources[8];      // Source enum (mip0, mip1, gradient, etc.)
-  u32 reserved[8];             // Future use
-}
-```
-
-Benefits:
-- Training can experiment with different feature combinations
-- No shader recompilation needed
-- Single binary format supports multiple architectures
-- Easier A/B testing of feature effectiveness
-
----
-
-## Related Files
-
-- `training/export_cnn_v2_weights.py` - Binary export tool
-- `src/effects/cnn_v2_effect.cc` - C++ loader
-- `tools/cnn_v2_test/index.html` - WebGPU validator
-- `doc/CNN_V2.md` - Architecture design
diff --git a/doc/CNN_V2_DEBUG_TOOLS.md b/doc/CNN_V2_DEBUG_TOOLS.md
deleted file mode 100644
index 8d1289a..0000000
--- a/doc/CNN_V2_DEBUG_TOOLS.md
+++ /dev/null
@@ -1,143 +0,0 @@
-# CNN v2 Debugging Tools
-
-Tools for investigating CNN v2 mismatch between HTML tool and cnn_test.
-
----
-
-## Identity Weight Generator
-
-**Purpose:** Generate trivial .bin files with identity passthrough for debugging.
-
-**Script:** `training/gen_identity_weights.py`
-
-**Usage:**
-```bash
-# 1×1 identity (default)
-./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity.bin
-
-# 3×3 identity
-./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity_3x3.bin --kernel-size 3
-
-# Mix mode: 50-50 blend (0.5*p0+0.5*p4, etc)
-./training/gen_identity_weights.py output.bin --mix
-
-# Static features only: p4→ch0, p5→ch1, p6→ch2, p7→ch3
-./training/gen_identity_weights.py output.bin --p47
-
-# Custom mip level
-./training/gen_identity_weights.py output.bin --kernel-size 1 --mip-level 2
-```
-
-**Output:**
-- Single layer, 12D→4D (4 input channels + 8 static features)
-- Identity mode: Output Ch{0,1,2,3} = Input Ch{0,1,2,3}
-- Mix mode (--mix): Output Ch{i} = 0.5*Input Ch{i} + 0.5*Input Ch{i+4} (50-50 blend, avoids overflow)
-- Static mode (--p47): Output Ch{i} = Input Ch{i+4} (static features only, visualizes p4-p7)
-- Minimal file size (~136 bytes for 1×1, ~904 bytes for 3×3)
-
-**Validation:**
-Load in HTML tool or cnn_test - output should match input (RGB only, ignoring static features).
-
----
-
-## Composited Layer Visualization
-
-**Purpose:** Save current layer view as single composited image (4 channels side-by-side, grayscale).
-
-**Location:** HTML tool - "Layer Visualization" panel
-
-**Usage:**
-1. Load image + weights in HTML tool
-2. Select layer to visualize (Static 0-3, Static 4-7, Layer 0, Layer 1, etc.)
-3. Click "Save Composited" button
-4. Downloads PNG: `composited_layer{N}_{W}x{H}.png`
-
-**Output:**
-- 4 channels stacked horizontally
-- Grayscale representation
-- Useful for comparing layer activations across tools
-
----
-
-## Debugging Strategy
-
-### Track a) Binary Conversion Chain
-
-**Hypothesis:** Conversion error in .bin ↔ base64 ↔ Float32Array
-
-**Test:**
-1. Generate identity weights:
-   ```bash
-   ./training/gen_identity_weights.py workspaces/main/weights/test_identity.bin
-   ```
-
-2. Load in HTML tool - output should match input RGB
-
-3. If mismatch:
-   - Check Python export: f16 packing in `export_cnn_v2_weights.py` line 105
-   - Check HTML parsing: `unpackF16()` in `index.html` line 805-815
-   - Check weight indexing: `get_weight()` shader function
-
-**Key locations:**
-- Python: `np.float16` → `view(np.uint32)` (line 105 of export script)
-- JS: `DataView` → `unpackF16()` → manual f16 decode (line 773-803)
-- WGSL: `unpack2x16float()` built-in (line 492 of shader)
-
-### Track b) Layer Visualization
-
-**Purpose:** Confirm layer outputs match between HTML and C++
-
-**Method:**
-1. Run identical input through both tools
-2. Save composited layers from HTML tool
-3. Compare with cnn_test output
-4. Use identity weights to isolate weight loading from computation
-
-### Track c) Trivial Test Case
-
-**Use identity weights to test:**
-- Weight loading (binary parsing)
-- Feature generation (static features)
-- Convolution (should be passthrough)
-- Output packing
-
-**Expected behavior:**
-- Input RGB → Output RGB (exact match)
-- Static features ignored (all zeros in identity matrix)
-
----
-
-## Known Issues
-
-### ~~Layer 0 Visualization Scale~~ [FIXED]
-
-**Issue:** Layer 0 output displayed at 0.5× brightness (divided by 2).
-
-**Cause:** Line 1530 used `vizScale = 0.5` for all CNN layers, but Layer 0 is clamped [0,1] and doesn't need dimming.
-
-**Fix:** Use scale 1.0 for Layer 0 output (layerIdx=1), 0.5 only for middle layers (ReLU, unbounded).
-
-### Remaining Mismatch
-
-**Current:** HTML tool and cnn_test produce different outputs for same input/weights.
-
-**Suspects:**
-1. F16 unpacking difference (CPU vs GPU vs JS)
-2. Static feature generation (RGBD, UV, sin encoding)
-3. Convolution kernel iteration order
-4. Output packing/unpacking
-
-**Next steps:**
-1. Test with identity weights (eliminates weight loading)
-2. Compare composited layer outputs
-3. Add debug visualization for static features
-4. Hex dump comparison (first 8 pixels) - use `--debug-hex` flag in cnn_test
-
----
-
-## Related Documentation
-
-- `doc/CNN_V2.md` - CNN v2 architecture
-- `doc/CNN_V2_WEB_TOOL.md` - HTML tool documentation
-- `doc/CNN_TEST_TOOL.md` - cnn_test CLI tool
-- `training/export_cnn_v2_weights.py` - Binary export format
diff --git a/doc/CNN_V2_WEB_TOOL.md b/doc/CNN_V2_WEB_TOOL.md
deleted file mode 100644
index b6f5b0b..0000000
--- a/doc/CNN_V2_WEB_TOOL.md
+++ /dev/null
@@ -1,348 +0,0 @@
-# CNN v2 Web Testing Tool
-
-Browser-based WebGPU tool for validating CNN v2 inference with layer visualization and weight inspection.
-
-**Location:** `tools/cnn_v2_test/index.html`
-
----
-
-## Status (2026-02-13)
-
-**Working:**
-- ✅ WebGPU initialization and device setup
-- ✅ Binary weight file parsing (v1 and v2 formats)
-- ✅ Automatic mip-level detection from binary format v2
-- ✅ Weight statistics (min/max per layer)
-- ✅ UI layout with collapsible panels
-- ✅ Mode switching (Activations/Weights tabs)
-- ✅ Canvas context management (2D for weights, WebGPU for activations)
-- ✅ Weight visualization infrastructure (layer selection, grid layout)
-- ✅ Layer naming matches codebase convention (Layer 0, Layer 1, Layer 2)
-- ✅ Static features split visualization (Static 0-3, Static 4-7)
-- ✅ All layers visible including output layer (Layer 2)
-- ✅ Video playback support (MP4, WebM) with frame-by-frame controls
-- ✅ Video looping (automatic continuous playback)
-- ✅ Mip level selection (p0-p3 features at different resolutions)
-
-**Recent Changes (Latest):**
-- Binary format v2 support: Reads mip_level from 20-byte header
-- Backward compatible: v1 (16-byte header) → mip_level=0
-- Auto-update UI dropdown when loading weights with mip_level
-- Display mip_level in metadata panel
-- Code refactoring: Extracted FULLSCREEN_QUAD_VS shader (reused 3× across pipelines)
-- Added helper methods: `getDimensions()`, `setVideoControlsEnabled()`
-- Improved code organization with section headers and comments
-- Moved Mip Level selector to bottom of left sidebar (removed "Features (p0-p3)" label)
-- Added `loop` attribute to video element for automatic continuous playback
-
-**Previous Fixes:**
-- Fixed Layer 2 not appearing (was excluded from layerOutputs due to isOutput check)
-- Fixed canvas context switching (force clear before recreation)
-- Added Static 0-3 / Static 4-7 buttons to view all 8 static feature channels
-- Aligned naming with train_cnn_v2.py/.wgsl: Layer 0, Layer 1, Layer 2 (not Layer 1, 2, 3)
-- Disabled Static buttons in weights mode (no learnable weights)
-
-**Known Issues:**
-- Layer activation visualization may show black if texture data not properly unpacked
-- Weight kernel display depends on correct 2D context creation after canvas recreation
-
----
-
-## Architecture
-
-### File Structure
-- Single-file HTML tool (~1100 lines)
-- Embedded shaders: STATIC_SHADER, CNN_SHADER, DISPLAY_SHADER, LAYER_VIZ_SHADER
-- Shared WGSL component: FULLSCREEN_QUAD_VS (reused across render pipelines)
-- **Embedded default weights:** DEFAULT_WEIGHTS_B64 (base64-encoded binary v2)
-  - Current: 4 layers (3×3, 5×5, 3×3, 3×3), 2496 f16 weights, mip_level=2
-  - Source: `workspaces/main/weights/cnn_v2_weights.bin`
-  - Updates: Re-encode binary with `base64 -i <file>` and update constant
-- Pure WebGPU (no external dependencies)
-
-### Code Organization
-
-**Recent Refactoring (2026-02-13):**
-- Extracted `FULLSCREEN_QUAD_VS` constant: Reused fullscreen quad vertex shader (2 triangles covering NDC)
-- Added helper methods to CNNTester class:
-  - `getDimensions()`: Returns current source dimensions (video or image)
-  - `setVideoControlsEnabled(enabled)`: Centralized video control enable/disable
-- Consolidated duplicate vertex shader code (used in mipmap generation, display, layer visualization)
-- Added section headers in JavaScript for better navigation
-- Improved inline comments explaining shader architecture
-
-**Benefits:**
-- Reduced code duplication (~40 lines saved)
-- Easier maintenance (single source of truth for fullscreen quad)
-- Clearer separation of concerns
-
-### Key Components
-
-**1. Weight Parsing**
-- Reads binary format v2: header (20B) + layer info (20B×N) + f16 weights
-- Backward compatible with v1: header (16B), mip_level defaults to 0
-- Computes min/max per layer via f16 unpacking
-- Stores `{ layers[], weights[], mipLevel, fileSize }`
-- Auto-sets UI mip-level dropdown from loaded weights
-
-**2. CNN Pipeline**
-- Static features computation (RGBD + UV + sin + bias → 7D packed)
-- Layer-by-layer convolution with storage buffer weights
-- Ping-pong buffers for intermediate results
-- Copy to persistent textures for visualization
-
-**3. Visualization Modes**
-
-**Activations Mode:**
-- 4 grayscale views per layer (channels 0-3 of up to 8 total)
-- WebGPU compute → unpack f16 → scale → grayscale
-- Auto-scale: Static features = 1.0, CNN layers = 0.2
-- Static features: Shows R,G,B,D (first 4 of 8: RGBD+UV+sin+bias)
-- CNN layers: Shows first 4 output channels
-
-**Weights Mode:**
-- 2D canvas rendering per output channel
-- Shows all input kernels horizontally
-- Normalized by layer min/max → [0, 1] → grayscale
-- 20px cells, 2px padding between kernels
-
-### Texture Management
-
-**Persistent Storage (layerTextures[]):**
-- One texture per layer output (static + all CNN layers)
-- `rgba32uint` format (packed f16 data)
-- `COPY_DST` usage for storing results
-
-**Compute Buffers (computeTextures[]):**
-- 2 textures for ping-pong computation
-- Reused across all layers
-- `COPY_SRC` usage for copying to persistent storage
-
-**Pipeline:**
-```
-Static pass → copy to layerTextures[0]
-For each CNN layer i:
-  Compute (ping-pong) → copy to layerTextures[i+1]
-```
-
-### Layer Indexing
-
-**UI Layer Buttons:**
-- "Static" → layerOutputs[0] (7D input features)
-- "Layer 1" → layerOutputs[1] (CNN layer 1 output, uses weights.layers[0])
-- "Layer 2" → layerOutputs[2] (CNN layer 2 output, uses weights.layers[1])
-- "Layer N" → layerOutputs[N] (CNN layer N output, uses weights.layers[N-1])
-
-**Weights Table:**
-- "Layer 1" → weights.layers[0] (first CNN layer weights)
-- "Layer 2" → weights.layers[1] (second CNN layer weights)
-- "Layer N" → weights.layers[N-1]
-
-**Consistency:** Both UI and weights table use same numbering (1, 2, 3...) for CNN layers.
-
----
-
-## Known Issues
-
-### Issue #1: Layer Activations Show Black
-
-**Symptom:**
-- All 4 channel canvases render black
-- UV gradient test (debug mode 10) works
-- Raw packed data test (mode 11) shows black
-- Unpacked f16 test (mode 12) shows black
-
-**Diagnosis:**
-- Texture access works (UV gradient visible)
-- Texture data is all zeros (packed.x = 0)
-- Textures being read are empty
-
-**Root Cause:**
-- `copyTextureToTexture` operations may not be executing
-- Possible ordering issue (copies not submitted before visualization)
-- Alternative: textures created with wrong usage flags
-
-**Investigation Steps Taken:**
-1. Added `onSubmittedWorkDone()` wait before visualization
-2. Verified texture creation with `COPY_SRC` and `COPY_DST` flags
-3. Confirmed separate texture allocation per layer (no aliasing)
-4. Added debug shader modes to isolate issue
-
-**Next Steps:**
-- Verify encoder contains copy commands (add debug logging)
-- Check if compute passes actually write data (add known-value test)
-- Test copyTextureToTexture in isolation
-- Consider CPU readback to verify texture contents
-
-### Issue #2: Weight Visualization Empty
-
-**Symptom:**
-- Canvases created with correct dimensions (logged)
-- No visual output (black canvases)
-- Console logs show method execution
-
-**Potential Causes:**
-1. Weight indexing calculation incorrect
-2. Canvas not properly attached to DOM when rendering
-3. 2D context operations not flushing
-4. Min/max normalization producing black (all values equal?)
-
-**Debug Added:**
-- Comprehensive logging of dimensions, indices, ranges
-- Canvas context check before rendering
-
-**Next Steps:**
-- Add test rendering (fixed gradient) to verify 2D context works
-- Log sample weight values to verify data access
-- Check if canvas is visible in DOM inspector
-- Verify min/max calculation produces valid range
-
----
-
-## UI Layout
-
-### Header
-- Controls: Blend slider, Depth input, View mode display
-- Drop zone for .bin weight files
-
-### Content Area
-
-**Left Sidebar (300px):**
-1. Drop zone for .bin weight files
-2. Weights Info panel (file size, layer table with min/max)
-3. Weights Visualization panel (per-layer kernel display)
-4. **Mip Level selector** (bottom) - Select p0/p1/p2 for static features
-
-**Main Canvas (center):**
-- CNN output display with video controls (Play/Pause, Frame ◄/►)
-- Supports both PNG images and video files (MP4, WebM)
-- Video loops automatically for continuous playback
-
-**Right Sidebar (panels):**
-1. **Layer Visualization Panel** (top, flex: 1)
-   - Layer selection buttons (Static 0-3, Static 4-7, Layer 0, Layer 1, ...)
-   - 2×2 grid of channel views (grayscale activations)
-   - 4× zoom view at bottom
-
-### Footer
-- Status line (GPU timing, dimensions, mode)
-- Console log (scrollable, color-coded)
-
----
-
-## Shader Details
-
-### LAYER_VIZ_SHADER
-
-**Purpose:** Display single channel from packed layer texture
-
-**Inputs:**
-- `@binding(0) layer_tex: texture_2d<u32>` - Packed f16 layer data
-- `@binding(1) viz_params: vec2<f32>` - (channel_idx, scale)
-
-**Debug Modes:**
-- Channel 10: UV gradient (texture coordinate test)
-- Channel 11: Raw packed u32 data
-- Channel 12: First unpacked f16 value
-
-**Normal Operation:**
-- Unpack all 8 f16 channels from rgba32uint
-- Select channel by index (0-7)
-- Apply scale factor (1.0 for static, 0.2 for CNN)
-- Clamp to [0, 1] and output grayscale
-
-**Scale Rationale:**
-- Static features (RGBD, UV): already in [0, 1] range
-- CNN activations: post-ReLU [0, ~5], need scaling for visibility
-
----
-
-## Binary Weight Format
-
-See `doc/CNN_V2_BINARY_FORMAT.md` for complete specification.
-
-**Quick Summary:**
-- Header: 16 bytes (magic, version, layer count, total weights)
-- Layer info: 20 bytes × N (kernel size, channels, offsets)
-- Weights: Packed f16 pairs as u32
-
----
-
-## Testing Workflow
-
-### Load & Parse
-1. Drop PNG image → displays original
-2. Drop .bin weights → parses and shows info table
-3. Auto-runs CNN pipeline
-
-### Verify Pipeline
-1. Check console for "Running CNN pipeline"
-2. Verify "Completed in Xms"
-3. Check "Layer visualization ready: N layers"
-
-### Debug Activations
-1. Select "Activations" tab
-2. Click layer buttons to switch
-3. Check console for texture/canvas logs
-4. If black: note which debug modes work (UV vs data)
-
-### Debug Weights
-1. Select "Weights" tab
-2. Click Layer 1 or Layer 2 (Layer 0 has no weights)
-3. Check console for "Visualizing Layer N weights"
-4. Check canvas dimensions logged
-5. Verify weight range is non-trivial (not [0, 0])
-
----
-
-## Integration with Main Project
-
-**Training Pipeline:**
-```bash
-# Generate weights
-./training/train_cnn_v2.py --export-binary
-
-# Test in browser
-open tools/cnn_v2_test/index.html
-# Drop: workspaces/main/cnn_v2_weights.bin
-# Drop: training/input/test.png
-```
-
-**Validation:**
-- Compare against demo CNNv2Effect (visual check)
-- Verify layer count matches binary file
-- Check weight ranges match training logs
-
----
-
-## Future Enhancements
-
-- [ ] Fix layer activation visualization (black texture issue)
-- [ ] Fix weight kernel display (empty canvas issue)
-- [ ] Add per-channel auto-scaling (compute min/max from visible data)
-- [ ] Export rendered outputs (download PNG)
-- [ ] Side-by-side comparison with original
-- [ ] Heatmap mode (color-coded activations)
-- [ ] Weight statistics overlay (mean, std, sparsity)
-- [ ] Batch processing (multiple images in sequence)
-- [ ] Integration with Python training (live reload)
-
----
-
-## Code Metrics
-
-- Total lines: ~1100
-- JavaScript: ~700 lines
-- WGSL shaders: ~300 lines
-- HTML/CSS: ~100 lines
-
-**Dependencies:** None (pure WebGPU + HTML5)
-
----
-
-## Related Files
-
-- `doc/CNN_V2.md` - CNN v2 architecture and design
-- `doc/CNN_TEST_TOOL.md` - C++ offline testing tool (deprecated)
-- `training/train_cnn_v2.py` - Training script with binary export
-- `workspaces/main/cnn_v2_weights.bin` - Trained weights
diff --git a/doc/COMPLETED.md b/doc/COMPLETED.md
index 55fac50..8d30cca 100644
--- a/doc/COMPLETED.md
+++ b/doc/COMPLETED.md
@@ -67,7 +67,7 @@ Use `read @doc/archive/FILENAME.md` to access archived documents.
     - **Changes**:
       - Added `get_common_uniforms()` helper to Effect base class
       - Refactored all render()/compute() signatures from 5 parameters to single `CommonPostProcessUniforms&`
-      - Fixed uninitialized uniforms in CircleMaskEffect and CNNEffect
+      - Fixed uninitialized uniforms in CircleMaskEffect and CNNv1Effect
       - Updated 19 effect implementations + headers
       - Fixed WGSL syntax error in FlashEffect (u.audio_intensity → audio_intensity)
     - **Impact**:
@@ -93,7 +93,7 @@ Use `read @doc/archive/FILENAME.md` to access archived documents.
       - All 36 tests pass (100%)
       - Processes 64×64 test image successfully
       - Ready for ground-truth validation vs Python training script
-      - Documented in `doc/CNN_TEST_TOOL.md`
+      - Documented in `cnn_v1/docs/CNN_TEST_TOOL.md`
 
 ## Recently Completed (February 10, 2026)
 
@@ -103,7 +103,7 @@ Use `read @doc/archive/FILENAME.md` to access archived documents.
       - Created `BindGroupLayoutBuilder` and `BindGroupBuilder` for declarative bind group creation
       - Created `RenderPipelineBuilder` to simplify pipeline setup with ShaderComposer integration
       - Created `SamplerCache` singleton to deduplicate sampler instances
-      - Refactored `post_process_helper.cc`, `cnn_effect.cc`, `rotating_cube_effect.cc`
+      - Refactored `post_process_helper.cc`, `cnn_v1_effect.cc`, `rotating_cube_effect.cc`
     - **Result**:
       - Bind group creation: 19 instances reduced from 14→4 lines each
       - Pipeline creation: 30-50 lines reduced to 8 lines
diff --git a/doc/HEADLESS_MODE.md b/doc/HEADLESS_MODE.md
index f139317..85abbaf 100644
--- a/doc/HEADLESS_MODE.md
+++ b/doc/HEADLESS_MODE.md
@@ -17,10 +17,18 @@ cmake --build build_headless -j4
 # Custom duration
 ./build_headless/demo64k --headless --duration 60
 
-# Audio validation
+# Audio validation (full demo or 60s)
 ./build_headless/demo64k --dump-wav test.wav
+
+# Render specific time range
+./build_headless/demo64k --dump-wav test.wav --dump-wav-start 10 --dump-wav-duration 5
 ```
 
+**WAV Dump Options:**
+- `--dump-wav [FILE]` - Output filename (default: audio_dump.wav)
+- `--dump-wav-start TIME` - Start at time (seeks first, default: 0)
+- `--dump-wav-duration TIME` - Duration limit (default: demo length or 60s)
+
 Test script: `./scripts/test_headless.sh`
 
 ## vs STRIP_EXTERNAL_LIBS
diff --git a/doc/HOWTO.md b/doc/HOWTO.md
index 506bf0a..4cafaa2 100644
--- a/doc/HOWTO.md
+++ b/doc/HOWTO.md
@@ -25,7 +25,15 @@ cmake -S . -B build
 cmake --build build -j4
 ./build/demo64k
 ```
-Options: `--fullscreen`, `--resolution WxH`, `--seek TIME`, `--hot-reload`
+
+**CLI Options:**
+- `--fullscreen` - Fullscreen mode
+- `--resolution WxH` - Window resolution (e.g., 1920x1080)
+- `--seek TIME` - Start at time (seconds)
+- `--hot-reload` - Watch config files for changes
+- `--dump-wav [FILE]` - Render audio to WAV file
+- `--dump-wav-start TIME` - Start WAV dump at time (seeks first)
+- `--dump-wav-duration TIME` - Limit WAV dump duration
 
 ### Production Builds
 ```bash
@@ -92,7 +100,7 @@ make run_util_tests     # Utility tests
 Extracts patches at salient points, trains on center pixels only (matches WGSL sliding window):
 ```bash
 # Train with 32×32 patches at detected corners/edges
-./training/train_cnn.py \
+./cnn_v1/training/train_cnn.py \
   --input training/input/ --target training/output/ \
   --patch-size 32 --patches-per-image 64 --detector harris \
   --layers 3 --kernel_sizes 3,5,3 --epochs 5000 --batch_size 16 \
@@ -109,7 +117,7 @@ Extracts patches at salient points, trains on center pixels only (matches WGSL s
 ### Full-Image
 Processes entire image with sliding window (matches WGSL):
 ```bash
-./training/train_cnn.py \
+./cnn_v1/training/train_cnn.py \
   --input training/input/ --target training/output/ \
   --layers 3 --kernel_sizes 3,5,3 --epochs 10000 --batch_size 8 \
   --checkpoint-every 1000
@@ -118,10 +126,10 @@ Processes entire image with sliding window (matches WGSL):
 ### Export & Validation
 ```bash
 # Generate shaders from checkpoint
-./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth
+./cnn_v1/training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth
 
 # Generate ground truth (sliding window, no tiling)
-./training/train_cnn.py --infer input.png \
+./cnn_v1/training/train_cnn.py --infer input.png \
   --export-only checkpoints/checkpoint_epoch_5000.pth \
   --output ground_truth.png
 ```
@@ -137,31 +145,31 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding
 **Complete Pipeline** (recommended):
 ```bash
 # Train → Export → Build → Validate (default config)
-./scripts/train_cnn_v2_full.sh
+./cnn_v2/scripts/train_cnn_v2_full.sh
 
 # Rapid debug (1 layer, 3×3, 5 epochs)
-./scripts/train_cnn_v2_full.sh --num-layers 1 --kernel-sizes 3 --epochs 5 --output-weights test.bin
+./cnn_v2/scripts/train_cnn_v2_full.sh --num-layers 1 --kernel-sizes 3 --epochs 5 --output-weights test.bin
 
 # Custom training parameters
-./scripts/train_cnn_v2_full.sh --epochs 500 --batch-size 32 --checkpoint-every 100
+./cnn_v2/scripts/train_cnn_v2_full.sh --epochs 500 --batch-size 32 --checkpoint-every 100
 
 # Custom architecture
-./scripts/train_cnn_v2_full.sh --kernel-sizes 3,5,3 --num-layers 3 --mip-level 1
+./cnn_v2/scripts/train_cnn_v2_full.sh --kernel-sizes 3,5,3 --num-layers 3 --mip-level 1
 
 # Custom output path
-./scripts/train_cnn_v2_full.sh --output-weights workspaces/test/cnn_weights.bin
+./cnn_v2/scripts/train_cnn_v2_full.sh --output-weights workspaces/test/cnn_weights.bin
 
 # Grayscale loss (compute loss on luminance instead of RGBA)
-./scripts/train_cnn_v2_full.sh --grayscale-loss
+./cnn_v2/scripts/train_cnn_v2_full.sh --grayscale-loss
 
 # Custom directories
-./scripts/train_cnn_v2_full.sh --input training/input --target training/target_2
+./cnn_v2/scripts/train_cnn_v2_full.sh --input training/input --target training/target_2
 
 # Full-image mode (instead of patch-based)
-./scripts/train_cnn_v2_full.sh --full-image --image-size 256
+./cnn_v2/scripts/train_cnn_v2_full.sh --full-image --image-size 256
 
 # See all options
-./scripts/train_cnn_v2_full.sh --help
+./cnn_v2/scripts/train_cnn_v2_full.sh --help
 ```
 
 **Defaults:** 200 epochs, 3×3 kernels, 8→4→4 channels, batch-size 16, patch-based (8×8, harris detector).
@@ -176,33 +184,33 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding
 **Validation Only** (skip training):
 ```bash
 # Use latest checkpoint
-./scripts/train_cnn_v2_full.sh --validate
+./cnn_v2/scripts/train_cnn_v2_full.sh --validate
 
 # Use specific checkpoint
-./scripts/train_cnn_v2_full.sh --validate checkpoints/checkpoint_epoch_50.pth
+./cnn_v2/scripts/train_cnn_v2_full.sh --validate checkpoints/checkpoint_epoch_50.pth
 ```
 
 **Manual Training:**
 ```bash
 # Default config
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --epochs 100 --batch-size 16 --checkpoint-every 5
 
 # Custom architecture (per-layer kernel sizes)
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --kernel-sizes 1,3,5 \
   --epochs 5000 --batch-size 16
 
 # Mip-level for p0-p3 features (0=original, 1=half, 2=quarter, 3=eighth)
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --mip-level 1 \
   --epochs 100 --batch-size 16
 
 # Grayscale loss (compute loss on luminance Y = 0.299*R + 0.587*G + 0.114*B)
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --grayscale-loss \
   --epochs 100 --batch-size 16
@@ -228,7 +236,7 @@ Use `--quiet` for streamlined output in scripts (used automatically by train_cnn
 
 ```
 
-**Validation:** Use HTML tool (`tools/cnn_v2_test/index.html`) for CNN v2 validation. See `doc/CNN_V2_WEB_TOOL.md`.
+**Validation:** Use HTML tool (`cnn_v2/tools/cnn_v2_test/index.html`) for CNN v2 validation. See `cnn_v2/docs/CNN_V2_WEB_TOOL.md`.
 
 ---
 
@@ -252,6 +260,21 @@ Features: Drag/drop, beat-based editing, audio playback, waveform visualization,
 
 ## Audio
 
+### Rendering Audio to WAV
+
+```bash
+# Render full demo
+./build/demo64k --dump-wav output.wav
+
+# Render specific time range
+./build/demo64k --dump-wav output.wav --dump-wav-start 10 --dump-wav-duration 5
+
+# Render first 30 seconds
+./build/demo64k --dump-wav output.wav --dump-wav-duration 30
+```
+
+### API Usage
+
 ```cpp
 #include "audio/audio_engine.h"
 
@@ -262,6 +285,7 @@ g_audio_engine.update(music_time);
 g_audio_engine.shutdown();
 audio_shutdown();
 ```
+
 See `doc/TRACKER.md` for music system.
 
 ---
@@ -299,11 +323,11 @@ See `doc/ASSET_SYSTEM.md` and `doc/WORKSPACE_SYSTEM.md`.
 
 **Status:**
 - **CNN v2:** ✅ Fully functional, matches CNNv2Effect
-- **CNN v1:** ⚠️ Produces incorrect output, use CNNEffect in demo for validation
+- **CNN v1:** ⚠️ Produces incorrect output, use CNNv1Effect in demo for validation
 
 **Note:** `--weights` loads layer count and kernel sizes from the binary file, overriding `--layers` and forcing CNN v2.
 
-See `doc/CNN_TEST_TOOL.md` for full documentation.
+See `cnn_v1/docs/CNN_TEST_TOOL.md` for full documentation.
 
 ---