1 files changed, 75 insertions, 99 deletions
diff --git a/doc/CNN_TEST_TOOL.md b/doc/CNN_TEST_TOOL.md
index e7d679e..ee0d9c5 100644
--- a/doc/CNN_TEST_TOOL.md
+++ b/doc/CNN_TEST_TOOL.md
@@ -1,31 +1,37 @@
 # CNN Shader Testing Tool
 
-Standalone tool for validating trained CNN shaders with GPU-to-CPU readback.
+Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Supports both CNN v1 (render pipeline) and v2 (compute, storage buffer).
 
 ---
 
 ## Purpose
 
-- Validate trained weights (`cnn_weights_generated.wgsl`) against ground truth
+- Validate trained weights against ground truth
 - Debug CNN layer behavior in isolation
-- Generate test outputs for patch-based training workflow
-- Match Python training script's inference mode (`train_cnn.py --infer`)
+- Generate test outputs for training workflow
+- Match Python training script's inference mode
 
 ---
 
 ## Architecture
 
-**Two-part implementation:**
+**Two implementations:**
 
-1. **Core GPU utility:** `src/gpu/texture_readback.{h,cc}` (~150 lines)
-   - Synchronous texture-to-CPU readback
-   - Reusable for screenshots, validation, video export
-   - Protected with STRIP_ALL (0 bytes in release builds)
+1. **CNN v1** (render pipeline, texture atlas weights)
+   - 3 fixed layers
+   - RGBA16Float intermediates
+   - BGRA8Unorm final output
 
-2. **Standalone tool:** `tools/cnn_test.cc` (~450 lines)
-   - Custom CNN inference pipeline
-   - No MainSequence dependency
-   - Asset-based shader loading with automatic include resolution
+2. **CNN v2** (compute shaders, storage buffer weights)
+   - Dynamic layer count from binary
+   - 7D static features (RGBD + UV + sin + bias)
+   - RGBA32Uint packed f16 intermediates
+   - Storage buffer: ~3-5 KB weights
+
+**Core GPU utility:** `src/gpu/texture_readback.{h,cc}`
+- Synchronous texture-to-CPU readback
+- Supports RGBA16Float, RGBA32Uint, BGRA8Unorm
+- Protected with STRIP_ALL (0 bytes in release)
 
 ---
 
@@ -35,24 +41,28 @@ Standalone tool for validating trained CNN shaders with GPU-to-CPU readback.
 cnn_test input.png output.png [OPTIONS]
 
 OPTIONS:
-  --blend F         Final blend amount (0.0-1.0, default: 1.0)
-  --format ppm|png  Output format (default: png)
-  --help            Show usage
+  --cnn-version N          CNN version: 1 (default) or 2
+  --blend F                Final blend amount (0.0-1.0, default: 1.0)
+  --format ppm|png         Output format (default: png)
+  --layers N               Number of CNN layers (1-10, v1 only, default: 3)
+  --save-intermediates DIR Save intermediate layers to directory
+  --debug-hex              Print first 8 pixels as hex (debug)
+  --help                   Show usage
 ```
 
 **Examples:**
 ```bash
-# Full CNN processing
-./build/cnn_test input.png output.png
+# CNN v1 (render pipeline, 3 layers)
+./build/cnn_test input.png output.png --cnn-version 1
 
-# 50% blend with original
-./build/cnn_test input.png output.png --blend 0.5
+# CNN v2 (compute, storage buffer, dynamic layers)
+./build/cnn_test input.png output.png --cnn-version 2
 
-# No CNN effect (original passthrough)
-./build/cnn_test input.png output.png --blend 0.0
+# 50% blend with original (v2)
+./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5
 
-# PPM output format
-./build/cnn_test input.png output.ppm --format ppm
+# Debug hex dump
+./build/cnn_test input.png output.png --cnn-version 2 --debug-hex
 ```
 
 ---
@@ -90,25 +100,24 @@ std::vector<uint8_t> OffscreenRenderTarget::read_pixels() {
 }
 ```
 
-### CNN Processing Pipeline
+### CNN v1 Pipeline (Render)
 
-**Fixed 3-layer architecture** (matches trained CNN):
-1. Layer 0: Initial convolution
-2. Layer 1: Intermediate convolution
-3. Layer 2: Final convolution + blend with original
+**Fixed 3-layer architecture:**
+- Ping-pong RGBA16Float textures
+- CNNLayerParams (binding 3): layer_index, blend_amount
+- Shader composer resolves #include directives
 
-**Ping-pong textures:**
-- 2 intermediate render targets
-- 1 original input reference (binding 4)
+### CNN v2 Pipeline (Compute)
 
-**Uniforms:**
-- `CommonPostProcessUniforms` (binding 2): resolution, aspect_ratio, time, beat, audio_intensity
-- `CNNLayerParams` (binding 3): layer_index, blend_amount
+**Dynamic layer architecture:**
+1. **Static features compute:** Generate 7D features (RGBD + UV + sin + bias)
+2. **Layer computes:** N layers from binary weights (3-5 typically)
+   - Storage buffer weights (read-only)
+   - RGBA32Uint packed f16 textures (ping-pong)
+   - CNNv2LayerParams: kernel_size, channels, weight_offset, blend
+3. **Readback:** RGBA32Uint → f16 decode → u8 clamp
 
-**Shader composition:**
-- Uses `ShaderComposer::Get()` via `RenderPipelineBuilder`
-- Automatically resolves `#include` directives
-- Registers CNN snippets: activation, conv3×3, conv5×5, weights
+**Binary format:** Header (20B) + layer info (20B×N) + f16 weights
 
 ---
 
@@ -144,51 +153,34 @@ cmake --build build -j4
 
 ---
 
-## Validation Workflow
+## Validation Workflow (CNN v2)
 
-### 1. Ground Truth Generation
+### 1. Train and Export
 ```bash
-# Generate ground truth from Python
-./training/train_cnn.py --infer test.png \
-  --export-only training/checkpoints/checkpoint_epoch_5000.pth \
-  --output ground_truth.png
+# Train and export weights
+./scripts/train_cnn_v2_full.sh --epochs 200 --batch-size 16
 ```
 
 ### 2. Tool Inference
 ```bash
-# Run tool (always 3 layers, matching trained CNN)
-./build/cnn_test test.png tool_output.png --blend 1.0
+# Run tool with v2
+./build/cnn_test training/input/img_000.png output.png --cnn-version 2
 ```
 
-### 3. Comparison
-```bash
-# Compare (MSE should be low)
-python -c "
-import numpy as np
-from PIL import Image
-gt = np.array(Image.open('ground_truth.png'))
-out = np.array(Image.open('tool_output.png'))
-mse = np.mean((gt.astype(float) - out.astype(float)) ** 2)
-print(f'MSE: {mse:.4f}')
-assert mse < 10.0, f'MSE too high: {mse}'
-"
-```
+### 3. Visual Comparison
+Compare output.png with training/target_X/img_000.png
 
 ---
 
-## Known Issues
+## Status
 
-**BUG: CNN produces incorrect output (all white)**
-- Readback works correctly (see Technical Notes below)
-- Shader compiles and executes without errors
-- Output is all white (255) regardless of input or blend setting
-- **Likely causes:**
-  - Uniform buffer layout mismatch between C++ and WGSL
-  - Texture binding issue (input not sampled correctly)
-  - Weight matrix initialization problem
-- CNNEffect works correctly in demo (visual validation confirms)
-- **Status:** Under investigation - rendering pipeline differs from demo's CNNEffect
-- **Workaround:** Use CNNEffect visual validation in demo until tool fixed
+**CNN v1:** Builds and runs, produces incorrect output (all white). Use CNNEffect in demo for visual validation.
+
+**CNN v2:** ✅ Fully functional. Tested and working.
+- Loads binary weights from `workspaces/main/weights/cnn_v2_weights.bin`
+- Matches CNNv2Effect architecture
+- Produces correct output
+- Recommended for validation
 
 ---
 
@@ -214,41 +206,25 @@ assert mse < 10.0, f'MSE too high: {mse}'
 
 ## Limitations
 
-- **Fixed layer count:** Cannot run partial networks (3 layers hardcoded)
+- **CNN v1:** Produces incorrect output, use for debugging only
 - **Single image:** Batch processing requires shell loop
 - **No real-time preview:** Offline processing only
-- **PNG input only:** Uses stb_image (JPEG/PNG/BMP/TGA supported)
-
----
-
-## Future Enhancements
-
-- Batch processing (directory input)
-- Interactive preview mode
-- Per-layer weight inspection
-- Checksum validation against training checkpoints
-- CUDA/Metal direct backends (bypass WebGPU overhead)
+- **PNG input:** stb_image (JPEG/PNG/BMP/TGA also supported)
 
 ---
 
 ## Technical Notes
 
-**Number of layers is fixed by trained CNN architecture:**
-- Defined in `cnn_weights_generated.wgsl`
-- Cannot meaningfully run partial networks (layer outputs have different formats/ranges)
-- Tool always processes full 3-layer stack
-
-**Blend parameter:**
-- Applied only to final layer (layer 2)
-- Intermediate layers always use blend=1.0
-- `mix(input, cnn_output, blend_amount)` in shader
+**CNN v2 f16 decoding:**
+- RGBA32Uint texture stores 8×f16 as 4×u32
+- Custom decoder: extract u16, decode f16→f32, clamp [0,1]→u8
+- Handles denormals, infinity, NaN
 
 **Cross-platform:**
-- Tested on macOS (native WebGPU)
-- Builds on Windows via mingw-w64 cross-compile
-- Linux support via native WebGPU
+- macOS, Linux (native WebGPU)
+- Windows (mingw-w64 cross-compile)
 
 **Size impact:**
-- Debug/STRIP_ALL=OFF: ~150 lines compiled
-- STRIP_ALL=ON: 0 bytes (entirely compiled out)
-- FINAL_STRIP=ON: 0 bytes (tool not built)
+- Debug/STRIP_ALL=OFF: compiled
+- STRIP_ALL=ON: 0 bytes (compiled out)
+- FINAL_STRIP=ON: tool not built