summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorskal <pascal.massimino@gmail.com>2026-02-15 18:52:48 +0100
committerskal <pascal.massimino@gmail.com>2026-02-15 18:52:48 +0100
commitd4b67e2f6ab48ab9ec658140be4f1999f604559a (patch)
tree2502b0dc89748f7cfe674d3c177bd1528ce1c231 /doc
parent161a59fa50bb92e3664c389fa03b95aefe349b3f (diff)
archive(cnn): move CNN v1 to cnn_v1/ subdirectory
Consolidate CNN v1 (CNNEffect) into dedicated directory: - C++ effect: src/effects → cnn_v1/src/ - Shaders: workspaces/main/shaders/cnn → cnn_v1/shaders/ - Training: training/train_cnn.py → cnn_v1/training/ - Docs: doc/CNN*.md → cnn_v1/docs/ Updated all references: - CMake source list - C++ includes (relative paths: ../../cnn_v1/src/) - Asset paths (../../cnn_v1/shaders/) - Documentation cross-references CNN v1 remains active in timeline. For new work, use CNN v2 with enhanced features (7D static, storage buffer, sigmoid activation). Tests: 34/34 passing (100%)
Diffstat (limited to 'doc')
-rw-r--r--doc/CNN.md79
-rw-r--r--doc/CNN_BIAS_FIX_2026-02.md85
-rw-r--r--doc/CNN_DEBUG.md43
-rw-r--r--doc/CNN_EFFECT.md400
-rw-r--r--doc/CNN_FLATTEN_ANALYSIS.md189
-rw-r--r--doc/CNN_RGBD_GRAYSCALE_SUMMARY.md136
-rw-r--r--doc/CNN_TEST_TOOL.md244
7 files changed, 0 insertions, 1176 deletions
diff --git a/doc/CNN.md b/doc/CNN.md
deleted file mode 100644
index 2dc3362..0000000
--- a/doc/CNN.md
+++ /dev/null
@@ -1,79 +0,0 @@
-# Convolutional Neural Net Shader (CNN) post-processing
-
-**Status:** ✅ Foundation implemented (single-layer, expandable to multi-pass)
-
-## Idea
-
-Have the input 3d scene be processed by a multi-layer CNN trained on the side.
-Input: some rendered scene.
-Output: 'stylized' scene with CNN post-processing.
-
-**See `doc/CNN_EFFECT.md` for implementation details, usage, and API reference.**
-
-## Shader implementation
-
-### input / output
-
-Need 1 texture buffer per CNN layer.
-Input (r,g,b,1/z) for layer 0 (render 3d scene), or output from layer N-1 for layer N.
-output: (r,g,b, alpha). Don't need the 1/z information (can be fetched from input)
-
-### size of one layer
-
-Notation:
-S: the number of input samples from layer N-1.
-Example: 3x3 input -> S = 3x3 = 9.
-
-Each S samples is 4 values (r,g,b, w=1/z).
-
-Each sample is processed by a mat4 matrix. 4 input => 4 output.
-
-Weight matrix = S x mat4
-
-Final bias: 4 values.
-
-WGSL code example: See file CNN.shader
-
-### Layers
-
-we need 3 or 4 layer ?
-Several different shaders for each layer.
-Ping-pong for input/output texture buffer between each layers?
-
-## Implementation Status
-
-**Completed:**
-- ✅ Modular WGSL shader architecture (6 snippet files)
-- ✅ CNNEffect C++ class (single-layer rendering)
-- ✅ ShaderComposer integration (#include resolution)
-- ✅ Asset registration (7 new shader assets)
-- ✅ Test coverage (test_demo_effects.cc)
-- ✅ Placeholder identity weights for testing
-
-**Size:** ~3-4 KB shader code + ~2-4 KB weights = **5-8 KB total**
-
-**Pending:**
-- ⏳ Training script (`scripts/train_cnn.py`) to generate real weights
-- ⏳ Multi-layer rendering with ping-pong textures
-- ⏳ Weight quantization for size optimization
-
----
-
-## Training (To Be Implemented)
-
-The layer weight/bias data are hard-coded in the shaders.
-Training workflow:
-
-1. Prepare image pairs (before: raw render, after: target style)
-2. Run `python scripts/train_cnn.py --input scene.png --target stylized.png`
-3. Script generates `cnn_weights_generated.wgsl`
-4. Rebuild: `cmake --build build -j4`
-
-**Reference:** File `CNN.py` contains training example (needs adaptation).
-
-Need a repository of reference image pairs (before/after) for training and validation.
-Each input image is randomly sampled into 3×3 patch of (r,g,b,1/z) input samples.
-And trained to match the (r,g,b,a) output.
-
-Training generates the .wgsl code for layers' shaders.
-
diff --git a/doc/CNN_BIAS_FIX_2026-02.md b/doc/CNN_BIAS_FIX_2026-02.md
deleted file mode 100644
index 26db8eb..0000000
--- a/doc/CNN_BIAS_FIX_2026-02.md
+++ /dev/null
@@ -1,85 +0,0 @@
-# CNN Bias Accumulation Fix (2026-02-11)
-
-## Problem
-Bias was being added multiple times in shader convolution loops (once per kernel position), causing mismatch between PyTorch training and WGSL inference.
-
-## Root Cause
-**Location**: `training/train_cnn.py:381, 398`
-
-When exporting weights to WGSL, bias was replicated for every kernel position. The shader loops through positions doing:
-```wgsl
-sum += dot(weights[pos], rgbd) + dot(weights[pos+1], in1); // in1.w = 1.0
-```
-
-For 3×3 kernel (9 positions), bias added 9×. For 5×5, added 25×.
-
-## Fix
-Divide bias by `num_positions` during export:
-```python
-# Final layer (7→1)
-v1.append(f"{bias[0] / num_positions:.6f}")
-
-# Inner layers (7→4)
-v1.append(f"{bias[out_c] / num_positions:.6f}")
-```
-
-Shader accumulates bias × num_positions = original bias (correct).
-
----
-
-## Additional Improvements
-
-### 1. RGBA Output Support
-**train_cnn.py**: Now saves 4-channel RGBA PNG preserving alpha from input:
-```python
-alpha = img_tensor[0, 3:4, :, :].permute(1, 2, 0).numpy()
-output_rgba = np.concatenate([output, alpha], axis=2)
-Image.fromarray((output_rgba * 255).astype(np.uint8), mode='RGBA')
-```
-
-Intermediate layers also save RGBA if 4-channel.
-
-### 2. Debug Hex Output
-**Both tools** support `--debug-hex` to print first 8 pixels as hex:
-```bash
-./training/train_cnn.py --infer input.png --export-only checkpoint.pth --debug-hex
-./build/cnn_test input.png output.png --debug-hex
-```
-
-Output format: `[0] 0xRRGGBBAA` for pixel-level comparison.
-
-### 3. Cleanup
-Removed sRGB/linear_png debug code from `cnn_test.cc` (simplified PNG saving).
-
----
-
-## Files Modified
-- `training/train_cnn.py`: Bias fix, RGBA output, --debug-hex
-- `tools/cnn_test.cc`: --debug-hex, remove linear_png
-- `workspaces/main/shaders/cnn/cnn_weights_generated.wgsl`: Regenerated with fixed bias
-
-## Testing
-```bash
-# Train with fixed export
-./training/train_cnn.py --input training/input/ --target training/output/ \
- --layers 3 --kernel_sizes 3,3,3 --epochs 5000
-
-# Generate ground truth
-./training/train_cnn.py --infer input.png --export-only checkpoint.pth \
- --output ground_truth.png --debug-hex
-
-# Run GPU tool
-./build/cnn_test input.png tool_output.png --debug-hex
-
-# Compare hex output for first 8 pixels
-```
-
----
-
-## Status
-✅ Bias accumulation bug fixed
-✅ RGBA output with alpha preservation
-✅ Debug hex comparison tool
-✅ Weights regenerated
-
-Commit: `8ff8c56`
diff --git a/doc/CNN_DEBUG.md b/doc/CNN_DEBUG.md
deleted file mode 100644
index ba220a0..0000000
--- a/doc/CNN_DEBUG.md
+++ /dev/null
@@ -1,43 +0,0 @@
-# CNN Effect Black Screen Bug - Resolution (2026-02)
-
-## Problem
-CNN post-processing effect showed black screen when activated at 11.50s, despite scene rendering correctly before CNN started.
-
-## Root Causes
-
-### Bug 1: Framebuffer Capture Timing
-**Location**: `src/gpu/effect.cc`
-**Issue**: Capture ran INSIDE post-effect loop after ping-pong buffer swaps. CNN layers 1+ captured wrong buffer (output being written to, not scene).
-**Fix**: Moved capture before loop starts (lines 308-346). Capture now copies `framebuffer_a` to `captured_frame` auxiliary texture ONCE before any post-effects run.
-
-### Bug 2: Missing Uniforms Update ⚠️ CRITICAL
-**Location**: `src/effects/cnn_effect.cc`
-**Issue**: `CNNEffect::update_bind_group()` never updated `uniforms_` buffer. `uniforms.resolution` uninitialized (0,0 or garbage) → UV calculation `p.xy / uniforms.resolution` produced NaN → all texture samples black.
-**Fix**: Added uniforms update before bind group creation (lines 132-142):
-```cpp
-const CommonPostProcessUniforms u = {
- .resolution = {(float)width_, (float)height_},
- .aspect_ratio = (float)width_ / (float)height_,
- .time = 0.0f,
- .beat = 0.0f,
- .audio_intensity = 0.0f,
-};
-uniforms_.update(ctx_.queue, u);
-```
-
-## Key Lessons
-
-1. **All post-process effects MUST update `uniforms_` buffer** - Required for UV calculations and shader parameters
-2. **Framebuffer capture timing is critical** - Must happen before post-chain ping-pong starts
-3. **Uninitialized uniforms cause silent failures** - Produces black output without validation errors
-4. **Post-effects must render or chain breaks** - `loadOp=Load` preserves previous (black) content if no draw call executes
-
-## Files Modified
-- `src/gpu/effect.cc`: Lines 308-346 (capture timing)
-- `src/effects/cnn_effect.cc`: Lines 132-142 (uniforms update)
-
-## Verification
-Test: `demo64k --seek 11.5`
-- ✅ Scene visible with RotatingCube
-- ✅ CNN stylization applied
-- ✅ All 3 layers process with correct original texture reference
diff --git a/doc/CNN_EFFECT.md b/doc/CNN_EFFECT.md
deleted file mode 100644
index 40f095e..0000000
--- a/doc/CNN_EFFECT.md
+++ /dev/null
@@ -1,400 +0,0 @@
-# CNN Post-Processing Effect
-
-Neural network-based stylization for rendered scenes.
-
----
-
-## Overview
-
-Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead.
-
-**Key Features:**
-- Position-aware layer 0 (coordinate input for vignetting, edge effects)
-- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining
-- Original input available to all layers via framebuffer capture
-- Configurable final blend with original scene
-- Modular WGSL shader architecture
-- Hardcoded weights (trained offline via PyTorch)
-- ~5-8 KB binary footprint
-
----
-
-## Architecture
-
-### RGBD → Grayscale Pipeline
-
-**Input:** RGBD (RGB + inverse depth D=1/z)
-**Output:** Grayscale (1 channel)
-**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
-
-**Architecture:**
-- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD
-- **Final layer (N-1):** Conv2d(7→1) - output grayscale
-
-```wgsl
-// Inner layers: 7→4 (RGBD output, vec4-optimized)
-fn cnn_conv3x3_7to4(
- tex: texture_2d<f32>,
- samp: sampler,
- uv: vec2<f32>,
- resolution: vec2<f32>,
- gray: f32, # Grayscale [-1,1]
- weights: array<vec4<f32>, 72> # 9 pos × 4 ch × 2 vec4 (8 floats per filter)
-) -> vec4<f32>
-
-// Final layer: 7→1 (grayscale output, vec4-optimized)
-fn cnn_conv3x3_7to1(
- tex: texture_2d<f32>,
- samp: sampler,
- uv: vec2<f32>,
- resolution: vec2<f32>,
- gray: f32,
- weights: array<vec4<f32>, 18> # 9 pos × 2 vec4 (8 floats per filter)
-) -> f32
-```
-
-**Input normalization:**
-- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1]
-- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1]
-- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))`
-- **Inter-layer data** stays in [-1,1] (no denormalization)
-- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1]
-
-**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer
-
-### Multi-Layer Architecture
-
-CNNEffect supports multi-layer networks via automatic effect chaining:
-
-1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7`
-2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2)
-3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"`
-4. **Original input binding**: All layers access original via `@binding(4)`
-5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)`
-
-**Framebuffer Capture API:**
-- `Effect::needs_framebuffer_capture()` - effect requests pre-capture
-- MainSequence automatically blits input → `"captured_frame"` auxiliary texture
-- Generic mechanism usable by any effect
-
-### File Structure
-
-```
-src/effects/
- cnn_effect.h/cc # CNNEffect class + framebuffer capture
-
-workspaces/main/shaders/cnn/
- cnn_activation.wgsl # tanh, ReLU, sigmoid, leaky_relu
- cnn_conv3x3.wgsl # 3×3 convolution (standard + coord-aware)
- cnn_conv5x5.wgsl # 5×5 convolution (standard + coord-aware)
- cnn_conv7x7.wgsl # 7×7 convolution (standard + coord-aware)
- cnn_weights_generated.wgsl # Weight arrays (auto-generated by train_cnn.py)
- cnn_layer.wgsl # Main shader with layer switches (auto-generated by train_cnn.py)
-```
-
----
-
-## Training Workflow
-
-### 1. Prepare Training Data
-
-Input/target image pairs:
-```
-training/input/img_000.png # RGBA (RGB + alpha)
-training/output/img_000.png # Grayscale target
-```
-
-**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily.
-
-### 2. Train Network
-
-**Patch-based (Recommended)** - Preserves natural pixel scale:
-```bash
-python3 training/train_cnn.py \
- --input training/input --target training/output \
- --patch-size 32 --patches-per-image 64 --detector harris \
- --layers 3 --kernel-sizes 3,5,3 \
- --epochs 5000 --batch-size 16 --checkpoint-every 1000
-```
-
-**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges)
-
-**Full-image (Legacy)** - Resizes to 256×256:
-```bash
-python3 training/train_cnn.py \
- --input training/input --target training/output \
- --layers 3 --kernel-sizes 3,5,3 \
- --epochs 10000 --batch-size 8 --checkpoint-every 1000
-```
-
-**Auto-generates:**
-- `cnn_weights_generated.wgsl` - Weight arrays
-- `cnn_layer.wgsl` - Layer shader
-
-### 3. Export & Validate
-
-```bash
-# Export shaders
-./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth
-
-# Generate ground truth
-./training/train_cnn.py --infer input.png \
- --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png
-```
-
-### 4. Rebuild Demo
-
-```bash
-cmake --build build -j4 && ./build/demo64k
-```
-
----
-
-## Usage
-
-### C++ Integration
-
-**Single layer (manual):**
-```cpp
-#include "effects/cnn_effect.h"
-
-CNNEffectParams p;
-p.layer_index = 0;
-p.total_layers = 1;
-p.blend_amount = 1.0f;
-auto cnn = std::make_shared<CNNEffect>(ctx, p);
-timeline.add_effect(cnn, start_time, end_time);
-```
-
-**Multi-layer (automatic via timeline compiler):**
-
-Use timeline syntax - `seq_compiler` expands to multiple instances.
-
-### Timeline Examples
-
-**Single-layer CNN (full stylization):**
-```
-SEQUENCE 10.0 0
- EFFECT + Hybrid3DEffect 0.00 5.00
- EFFECT + CNNEffect 0.50 5.00 layers=1
-```
-
-**Multi-layer CNN with blend:**
-```
-SEQUENCE 10.0 0
- EFFECT + Hybrid3DEffect 0.00 5.00
- EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7
-```
-
-Expands to:
-```cpp
-// Layer 0 (captures original, blend=1.0)
-{
- CNNEffectParams p;
- p.layer_index = 0;
- p.total_layers = 3;
- p.blend_amount = 1.0f;
- seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1);
-}
-// Layer 1 (blend=1.0)
-{
- CNNEffectParams p;
- p.layer_index = 1;
- p.total_layers = 3;
- p.blend_amount = 1.0f;
- seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2);
-}
-// Layer 2 (final blend=0.7)
-{
- CNNEffectParams p;
- p.layer_index = 2;
- p.total_layers = 3;
- p.blend_amount = 0.7f;
- seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3);
-}
-```
-
----
-
-## Shader Structure
-
-**Bindings:**
-```wgsl
-@group(0) @binding(0) var smplr: sampler;
-@group(0) @binding(1) var txt: texture_2d<f32>; // Current layer input
-@group(0) @binding(2) var<uniform> uniforms: CommonUniforms;
-@group(0) @binding(3) var<uniform> params: CNNLayerParams;
-@group(0) @binding(4) var original_input: texture_2d<f32>; // Layer 0 input (captured)
-```
-
-**Fragment shader logic:**
-```wgsl
-@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> {
- let uv = p.xy / uniforms.resolution;
- let original_raw = textureSample(original_input, smplr, uv);
- let original = (original_raw - 0.5) * 2.0; // Normalize to [-1,1]
- let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722));
- var result = vec4<f32>(0.0);
-
- if (params.layer_index == 0) {
- result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution,
- weights_layer0);
- result = cnn_tanh(result);
- }
- else if (params.layer_index == 1) {
- result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution,
- gray, weights_layer1);
- result = cnn_tanh(result);
- }
- // ... other layers
-
- // Blend with ORIGINAL input (not previous layer)
- return mix(original_raw, result, params.blend_amount);
-}
-```
-
-**Weight Storage (vec4-optimized):**
-
-**Inner layers (7→4 RGBD output):**
-```wgsl
-// Structure: array<vec4<f32>, 72>
-// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
-const weights_layer0: array<vec4<f32>, 72> = array(
- vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0_ch0 (rgba weights)
- vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0_ch0 (uv, gray, bias)
- vec4<f32>(w1_r, w1_g, w1_b, w1_d), // pos0_ch1 (rgba weights)
- vec4<f32>(w1_u, w1_v, w1_gray, bias1), // pos0_ch1 (uv, gray, bias)
- // ... 68 more vec4s
-);
-```
-
-**Final layer (7→1 grayscale output):**
-```wgsl
-// Structure: array<vec4<f32>, 18>
-// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
-const weights_layerN: array<vec4<f32>, 18> = array(
- vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0 (rgba weights)
- vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0 (uv, gray, bias)
- // ... 16 more vec4s
-);
-```
-
-**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs.
-
----
-
-## Size Budget
-
-| Component | Size | Notes |
-|-----------|------|-------|
-| Activation functions | ~200 B | 4 functions |
-| Conv3x3 (standard + coord) | ~500 B | Both variants |
-| Conv5x5 (standard + coord) | ~700 B | Both variants |
-| Conv7x7 (standard + coord) | ~900 B | Both variants |
-| Main shader | ~800 B | Layer composition |
-| C++ implementation | ~300 B | Effect class |
-| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) |
-| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes |
-| **Total** | **5-9 KB** | Acceptable for 64k |
-
-**Optimization strategies:**
-- Quantize weights (float32 → int8)
-- Prune near-zero weights
-- Use separable convolutions
-
----
-
-## Testing
-
-```bash
-./build/test_demo_effects # CNN construction/shader tests
-./build/demo64k # Visual test
-```
-
----
-
-## Blend Parameter Behavior
-
-**blend_amount** controls final compositing with original:
-- `blend=0.0`: Pure original (no CNN effect)
-- `blend=0.5`: 50% original + 50% CNN
-- `blend=1.0`: Pure CNN output (full stylization)
-
-**Important:** Blend uses captured layer 0 input, not previous layer output.
-
-**Example use cases:**
-- `blend=1.0`: Full stylization (default)
-- `blend=0.7`: Subtle effect preserving original details
-- `blend=0.3`: Light artistic touch
-
-## Troubleshooting
-
-**Shader compilation fails:**
-- Check `cnn_weights_generated.wgsl` syntax
-- Verify snippets registered in `shaders.cc::InitShaderComposer()`
-- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`)
-
-**Black/corrupted output:**
-- Weights untrained (identity placeholder)
-- Check `captured_frame` auxiliary texture is registered
-- Verify layer priorities in timeline are sequential
-
-**Wrong blend result:**
-- Ensure layer 0 has `needs_framebuffer_capture() == true`
-- Check MainSequence framebuffer capture logic
-- Verify `original_input` binding is populated
-
-**Training loss not decreasing:**
-- Lower learning rate (`--learning-rate 0.0001`)
-- More epochs (`--epochs 1000`)
-- Check input/target image alignment
-
----
-
-## Vec4 Optimization
-
-**Architecture:** Weights stored as vec4 pairs for SIMD efficiency.
-
-**Input representation:**
-```wgsl
-let rgbd = textureSample(...); // vec4: [r, g, b, d]
-let in1 = vec4<f32>(uv_norm, gray, 1.0); // vec4: [u, v, gray, 1.0]
-```
-
-**Weight indexing:**
-```wgsl
-var pos = 0; // Direct weight array index
-for (var dy = -1; dy <= 1; dy++) {
- for (var dx = -1; dx <= 1; dx++) {
- // Unrolled channel loop (4 output channels)
- sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1);
- sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1);
- sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1);
- sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1);
- pos += 8; // 4 channels × 2 vec4s per channel
- }
-}
-```
-
-**Benefits:**
-- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs)
-- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment)
-- **Bias integration:** Free via `[..., 1.0]` component (no separate add)
-- **Code simplicity:** Eliminates inner loop, direct indexing with `pos`
-- **Performance:** 2-3× GPU throughput improvement over scalar version
-
-**Weight layout per filter (8 floats):**
-- vec4[0]: [w_r, w_g, w_b, w_d] (rgba input weights)
-- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias)
-
-**3×3 kernel sizes:**
-- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes)
-- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes)
-
----
-
-## References
-
-- **Training Script:** `training/train_cnn.py`
-- **Shader Composition:** `doc/SEQUENCE.md`
-- **Effect System:** `src/gpu/effect.h`
diff --git a/doc/CNN_FLATTEN_ANALYSIS.md b/doc/CNN_FLATTEN_ANALYSIS.md
deleted file mode 100644
index bf63c5d..0000000
--- a/doc/CNN_FLATTEN_ANALYSIS.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# CNN Shader Flatten Mode - Technical Analysis
-
-**Status:** Analysis complete - flatten mode NOT RECOMMENDED
-
-**Date:** February 2026
-
----
-
-## Context
-
-Current CNN architecture uses **3 sequential render passes** (linear chaining):
-- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer
-- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer
-- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original
-
-Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers.
-
----
-
-## Current Architecture
-
-**Shader Structure:**
-- 1 pipeline with layer branching (`layer_index` uniform)
-- 5 bindings: sampler, input texture, uniforms, layer params, original capture
-- Total shader size: ~8 KB (snippets + weights)
-
-**Performance Profile:**
-- 3 render pass dispatches
-- 2 framebuffer writes + reads between layers
-- Memory bandwidth: ~2× framebuffer size per layer
-- Register pressure: Low (per-layer isolation)
-
-**Weight Buffer:** 290 vec4s (4.6 KB) - already unified
-
----
-
-## Flatten Approaches Evaluated
-
-### Option A: Full Flatten (All 3 Layers)
-
-**Cascading Receptive Field:**
-
-To compute final output at position (x, y):
-- Layer 2 needs 3×3 neighborhood of Layer 1 outputs
-- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs
-- Each Layer 0 output needs 5×5 neighborhood of input samples
-
-**Effective input sampling:** 9×9 pixels (vs current 5×5 max)
-
-**Intermediate Storage (per thread/pixel):**
-```
-Layer 0 outputs: 5×5 positions × 4 channels = 100 floats
-Layer 1 outputs: 3×3 positions × 4 channels = 36 floats
- TOTAL = 136 floats (544 bytes)
-```
-
-**GPU Register Pressure:**
-- Modern GPUs: 32-64 KB registers per SM, shared across warps
-- 544 bytes/thread → max 64 threads/SM (**low occupancy**)
-- Current multi-pass: ~4-8 bytes/thread (high occupancy)
-
-**Pros:**
-- 1 dispatch vs 3 (reduce CPU overhead)
-- Zero framebuffer bandwidth between layers
-
-**Cons:**
-- **Severe register pressure** (10-20× increase)
-- Reduced occupancy → potential performance loss
-- Complex shader (harder debug, larger binary)
-- 9×9 input sampling
-
-**Assessment:** ❌ **Not Recommended**
-Register cost outweighs bandwidth savings.
-
----
-
-### Option B: Partial Flatten (Layers 1 + 2)
-
-Keep Layer 0 separate, flatten only Layers 1 and 2.
-
-**Pass Structure:**
-1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer
-2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader
-
-**Intermediate Storage:**
-```
-Layer 0 samples: 3×3 × 4 = 36 floats (read once)
-Layer 1 outputs: 3×3 × 4 = 36 floats (computed)
- TOTAL = 72 floats (288 bytes)
-```
-
-**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs
-
-**Pros:**
-- 2 passes vs 3 (33% reduction)
-- 1 framebuffer write saved
-- More manageable register usage
-
-**Cons:**
-- Still significant register pressure (288 bytes vs ~8 bytes baseline)
-- Medium complexity increase
-- Layer 0 (heaviest kernel) still separate
-
-**Assessment:** ⚠️ **Marginal Benefit**
-Saves 1 pass but register cost still high.
-
----
-
-### Option C: Keep Current Multi-Pass ✅
-
-**Rationale:**
-- Current architecture well-suited to GPU design (high throughput via parallelism)
-- Minimal register usage → high occupancy → hides memory latency
-- Framebuffer bandwidth cost < register pressure cost
-- Clean separation aids debugging/iteration
-- Modular (easy to add/remove layers)
-
-**Alternative Optimizations (if bandwidth critical):**
-1. Merge passes via render pass load/store ops (Vulkan subpasses)
-2. Reduce intermediate channel count (4→3 or 2)
-3. Hybrid: Compute shaders + workgroup shared memory
-4. Layer pruning (2-layer vs 3-layer quality comparison)
-
----
-
-## Recommendation
-
-**✅ Keep current multi-pass architecture**
-
-### Decision Matrix
-
-| Factor | Multi-Pass | Partial Flatten | Full Flatten |
-|--------|-----------|----------------|--------------|
-| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme |
-| Occupancy | ✅ High | ⚠️ Medium | ❌ Low |
-| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest |
-| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High |
-| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard |
-| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest |
-
-**Modern GPU Architecture Favors:**
-- High parallelism (many small threads) over complex threads
-- Hiding latency via occupancy over minimizing operations
-- Memory bandwidth via caching, not elimination
-
----
-
-## Alternative: Compute Shader + Shared Memory
-
-**If bandwidth becomes critical:**
-- Use compute shader with workgroup shared memory
-- Load tile + halos into shared memory (9×9 input samples)
-- Compute all 3 layers for tile interior (avoids redundant sampling)
-- Requires explicit synchronization (`workgroupBarrier`)
-
-**Trade-offs:**
-- ✅ Low register pressure + low bandwidth
-- ❌ Compute pipeline complexity (no render pass integration)
-- ❌ Tile edge handling
-- ❌ Larger code size
-
----
-
-## Conclusion
-
-Current 3-pass architecture is **appropriate for demo64k**:
-- Size-efficient (modular shaders)
-- Performance adequate (bandwidth not bottleneck)
-- Maintainable (clean layer isolation)
-
-**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint.
-
-### Size Optimization Alternatives (Better ROI)
-
-If size optimization critical, focus on:
-1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization)
-2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s)
-3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels)
-
-These yield better size/performance than shader architecture changes.
-
----
-
-## References
-
-- `doc/CNN_EFFECT.md` - CNN implementation details
-- `doc/CNN.md` - High-level CNN design
-- `src/effects/cnn_effect.cc` - Current implementation
-- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets
diff --git a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md b/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md
deleted file mode 100644
index 3439f2c..0000000
--- a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md
+++ /dev/null
@@ -1,136 +0,0 @@
-# CNN RGBD→Grayscale Architecture Implementation
-
-## Summary
-
-Implemented CNN architecture upgrade: RGBD input → grayscale output with 7-channel augmented input.
-
-## Changes Made
-
-### Architecture
-
-**Input:** RGBD (4 channels: RGB + inverse depth D=1/z)
-**Output:** Grayscale (1 channel)
-**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]
-
-**Layer Configuration:**
-- Inner layers (0..N-2): Conv2d(7→4) - output RGBD with tanh activation
-- Final layer (N-1): Conv2d(7→1) - output grayscale, no activation
-
-### Input Normalization (all to [-1,1])
-
-- **RGBD:** `(rgbd - 0.5) * 2`
-- **UV coords:** `(uv - 0.5) * 2`
-- **Grayscale:** `dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722))` (computed once, passed as parameter)
-
-**Rationale:** Zero-centered inputs for tanh activation, better gradient flow.
-
-### Modified Files
-
-**Training (`/Users/skal/demo/training/train_cnn.py`):**
-1. Removed `CoordConv2d` class
-2. Updated `SimpleCNN`:
- - Inner layers: `Conv2d(7, 4)` - RGBD output
- - Final layer: `Conv2d(7, 1)` - grayscale output
-3. Updated `forward()`:
- - Normalize RGBD/coords/gray to [-1,1]
- - Concatenate 7-channel input for each layer
- - Apply tanh (inner) or none (final)
- - Denormalize final output
-4. Updated `export_weights_to_wgsl()`:
- - Inner: `array<array<f32, 8>, 36>` (9 pos × 4 ch × 8 values)
- - Final: `array<array<f32, 8>, 9>` (9 pos × 8 values)
-5. Updated `generate_layer_shader()`:
- - Use `cnn_conv3x3_7to4` for inner layers
- - Use `cnn_conv3x3_7to1` for final layer
- - Denormalize outputs from [-1,1] to [0,1]
-6. Updated `ImagePairDataset`:
- - Load RGBA input (was RGB)
-
-**Shaders (`/Users/skal/demo/workspaces/main/shaders/cnn/cnn_conv3x3.wgsl`):**
-1. Added `cnn_conv3x3_7to4()`:
- - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter)
- - 4-channel output: RGBD
- - Weights: `array<array<f32, 8>, 36>`
-2. Added `cnn_conv3x3_7to1()`:
- - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter)
- - 1-channel output: grayscale
- - Weights: `array<array<f32, 8>, 9>`
-3. Optimized: gray computed once in caller using `dot()`, not per-function
-
-**Documentation (`/Users/skal/demo/doc/CNN_EFFECT.md`):**
-1. Updated architecture section with RGBD→grayscale pipeline
-2. Updated training data requirements (RGBA input)
-3. Updated weight storage format
-
-### No C++ Changes
-
-CNNLayerParams and bind groups remain unchanged.
-
-## Data Flow
-
-1. Layer 0 captures original RGBD to `captured_frame`
-2. Each layer:
- - Samples previous layer output (RGBD in [0,1])
- - Normalizes RGBD to [-1,1]
- - Computes gray once using `dot()` (fs_main level)
- - Normalizes UV coords to [-1,1] (inside conv functions)
- - Concatenates 7-channel input
- - Applies convolution with layer-specific weights
- - Outputs RGBD (inner) or grayscale (final) in [-1,1]
- - Applies tanh (inner only)
- - Denormalizes to [0,1] for texture storage
- - Blends with original
-
-## Next Steps
-
-1. **Prepare RGBD training data:**
- - Input: RGBA images (RGB + depth in alpha)
- - Target: Grayscale stylized output
-
-2. **Train network:**
- ```bash
- python3 training/train_cnn.py \
- --input training/input \
- --target training/output \
- --layers 3 \
- --epochs 1000
- ```
-
-3. **Verify generated shaders:**
- - Check `cnn_weights_generated.wgsl` structure
- - Check `cnn_layer.wgsl` uses new conv functions
-
-4. **Test in demo:**
- ```bash
- cmake --build build -j4
- ./build/demo64k
- ```
-
-## Design Rationale
-
-**Why [-1,1] normalization?**
-- Centered inputs for tanh (operates best around 0)
-- Better gradient flow
-- Standard ML practice for normalized data
-
-**Why RGBD throughout vs RGB?**
-- Depth information propagates through network
-- Enables depth-aware stylization
-- Consistent 4-channel processing
-
-**Why 7-channel input?**
-- Coordinates: position-dependent effects (vignettes)
-- Grayscale: luminance-aware processing
-- RGBD: full color+depth information
-- Enables richer feature learning
-
-## Testing Checklist
-
-- [ ] Train network with RGBD input data
-- [ ] Verify `cnn_weights_generated.wgsl` structure
-- [ ] Verify `cnn_layer.wgsl` uses `7to4`/`7to1` functions
-- [ ] Build demo without errors
-- [ ] Visual test: inner layers show RGBD evolution
-- [ ] Visual test: final layer produces grayscale
-- [ ] Visual test: blending works correctly
-- [ ] Compare quality with previous RGB→RGB architecture
diff --git a/doc/CNN_TEST_TOOL.md b/doc/CNN_TEST_TOOL.md
deleted file mode 100644
index 4307894..0000000
--- a/doc/CNN_TEST_TOOL.md
+++ /dev/null
@@ -1,244 +0,0 @@
-# CNN Shader Testing Tool
-
-Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Supports both CNN v1 (render pipeline) and v2 (compute, storage buffer).
-
----
-
-## Purpose
-
-- Validate trained weights against ground truth
-- Debug CNN layer behavior in isolation
-- Generate test outputs for training workflow
-- Match Python training script's inference mode
-
----
-
-## Architecture
-
-**Two implementations:**
-
-1. **CNN v1** (render pipeline, texture atlas weights)
- - 3 fixed layers
- - RGBA16Float intermediates
- - BGRA8Unorm final output
-
-2. **CNN v2** (compute shaders, storage buffer weights)
- - Dynamic layer count from binary
- - 7D static features (RGBD + UV + sin + bias)
- - RGBA32Uint packed f16 intermediates
- - Storage buffer: ~3-5 KB weights
-
-**Core GPU utility:** `src/gpu/texture_readback.{h,cc}`
-- Synchronous texture-to-CPU readback
-- Supports RGBA16Float, RGBA32Uint, BGRA8Unorm
-- Protected with STRIP_ALL (0 bytes in release)
-
----
-
-## Usage
-
-```bash
-cnn_test input.png output.png [OPTIONS]
-
-OPTIONS:
- --cnn-version N CNN version: 1 (default) or 2 (ignored with --weights)
- --weights PATH Load weights from .bin (forces CNN v2, overrides layer config)
- --blend F Final blend amount (0.0-1.0, default: 1.0)
- --format ppm|png Output format (default: png)
- --layers N Number of CNN layers (1-10, v1 only, default: 3, ignored with --weights)
- --save-intermediates DIR Save intermediate layers to directory
- --debug-hex Print first 8 pixels as hex (debug)
- --help Show usage
-```
-
-**Examples:**
-```bash
-# CNN v1 (render pipeline, 3 layers)
-./build/cnn_test input.png output.png --cnn-version 1
-
-# CNN v2 (compute, storage buffer, uses asset system weights)
-./build/cnn_test input.png output.png --cnn-version 2
-
-# CNN v2 with runtime weight loading (loads layer config from .bin)
-./build/cnn_test input.png output.png --weights checkpoints/checkpoint_epoch_100.pth.bin
-
-# 50% blend with original (v2)
-./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5
-
-# Debug hex dump
-./build/cnn_test input.png output.png --cnn-version 2 --debug-hex
-```
-
-**Important:** When using `--weights`, the layer count and kernel sizes are read from the binary file header, overriding any `--layers` or `--cnn-version` arguments.
-
----
-
-## Implementation Details
-
-### Core Readback Utility
-
-**File:** `src/gpu/texture_readback.{h,cc}`
-
-**Function:**
-```cpp
-std::vector<uint8_t> read_texture_pixels(
- WGPUInstance instance,
- WGPUDevice device,
- WGPUTexture texture,
- int width,
- int height);
-```
-
-**Features:**
-- Returns BGRA8 format (4 bytes per pixel)
-- Synchronous blocking operation
-- Cross-platform async callback handling (Win32 vs Native API)
-- Automatic staging buffer creation and cleanup
-
-**Refactored OffscreenRenderTarget:**
-```cpp
-std::vector<uint8_t> OffscreenRenderTarget::read_pixels() {
-#if !defined(STRIP_ALL)
- return read_texture_pixels(instance_, device_, texture_, width_, height_);
-#else
- return std::vector<uint8_t>();
-#endif
-}
-```
-
-### CNN v1 Pipeline (Render)
-
-**Fixed 3-layer architecture:**
-- Ping-pong RGBA16Float textures
-- CNNLayerParams (binding 3): layer_index, blend_amount
-- Shader composer resolves #include directives
-
-### CNN v2 Pipeline (Compute)
-
-**Dynamic layer architecture:**
-1. **Static features compute:** Generate 7D features (RGBD + UV + sin + bias)
-2. **Layer computes:** N layers from binary weights (3-5 typically)
- - Storage buffer weights (read-only)
- - RGBA32Uint packed f16 textures (ping-pong)
- - CNNv2LayerParams: kernel_size, channels, weight_offset, blend
-3. **Readback:** RGBA32Uint → f16 decode → u8 clamp
-
-**Binary format:** Header (20B) + layer info (20B×N) + f16 weights
-
-**Weight Loading:**
-- **Without `--weights`:** Loads from asset system (`ASSET_WEIGHTS_CNN_V2`)
-- **With `--weights PATH`:** Loads from external `.bin` file (e.g., checkpoint exports)
- - Layer count and kernel sizes parsed from binary header
- - Overrides any `--layers` or `--cnn-version` arguments
- - Enables runtime testing of training checkpoints without rebuild
-
----
-
-## Build Integration
-
-**CMakeLists.txt:**
-
-1. Added `src/gpu/texture_readback.cc` to GPU_SOURCES (both sections)
-2. Tool target:
-```cmake
-add_executable(cnn_test
- tools/cnn_test.cc
- src/tests/common/webgpu_test_fixture.cc
- src/tests/common/offscreen_render_target.cc
- ${PLATFORM_SOURCES}
- ${GEN_DEMO_CC})
-
-target_link_libraries(cnn_test PRIVATE
- gpu util procedural ${DEMO_LIBS})
-
-add_dependencies(cnn_test generate_demo_assets)
-
-target_compile_definitions(cnn_test PRIVATE
- STB_IMAGE_IMPLEMENTATION
- STB_IMAGE_WRITE_IMPLEMENTATION)
-```
-
-**Build:**
-```bash
-cmake -S . -B build -DDEMO_BUILD_TOOLS=ON
-cmake --build build -j4
-```
-
----
-
-## Validation Workflow (CNN v2)
-
-### 1. Train and Export
-```bash
-# Train and export weights
-./scripts/train_cnn_v2_full.sh --epochs 200 --batch-size 16
-```
-
-### 2. Tool Inference
-```bash
-# Run tool with v2
-./build/cnn_test training/input/img_000.png output.png --cnn-version 2
-```
-
-### 3. Visual Comparison
-Compare output.png with training/target_X/img_000.png
-
----
-
-## Status
-
-**CNN v1:** Builds and runs, produces incorrect output (all white). Use CNNEffect in demo for visual validation.
-
-**CNN v2:** ⚠️ Partially functional. Readback works but output differs from HTML validation tool.
-- Loads binary weights from `workspaces/main/weights/cnn_v2_weights.bin`
-- Matches CNNv2Effect architecture
-- **Known Issue:** Visual output differs from `tools/cnn_v2_test/index.html` despite matching shader code
-- Root cause under investigation (weight indexing? texture sampling? activation clamping?)
-- Use HTML tool (`tools/cnn_v2_test/index.html`) for accurate validation
-
----
-
-## Technical Notes (Readback Fix)
-
-**Original Bug:** Buffer mapping returned `WGPUMapAsyncStatus_Unknown` (status=5)
-
-**Root Cause:** Callback mode mismatch
-- Used `WGPUCallbackMode_WaitAnyOnly` (fires only during `wgpuInstanceWaitAny`)
-- Called `wgpuInstanceProcessEvents` in wait loop (wrong API for this mode)
-- Callback never fired → timeout → empty buffer
-
-**Fix Applied:**
-1. Changed callback mode to `WGPUCallbackMode_AllowProcessEvents`
-2. Replaced `wgpuInstanceProcessEvents` with `wgpuDevicePoll(device, true, nullptr)`
-3. Added pre-mapping device poll to ensure copy completes
-
-**Relevant Code:** `src/gpu/texture_readback.cc` lines 97-110
-
-**Reference:** WebGPU spec - Asynchronous Operations, Callback Modes
-
----
-
-## Limitations
-
-- **CNN v1:** Produces incorrect output, use for debugging only
-- **Single image:** Batch processing requires shell loop
-- **No real-time preview:** Offline processing only
-- **PNG input:** stb_image (JPEG/PNG/BMP/TGA also supported)
-
----
-
-## Technical Notes
-
-**CNN v2 f16 decoding:**
-- RGBA32Uint texture stores 8×f16 as 4×u32
-- Custom decoder: extract u16, decode f16→f32, clamp [0,1]→u8
-- Handles denormals, infinity, NaN
-
-**Cross-platform:**
-- macOS, Linux (native WebGPU)
-- Windows (mingw-w64 cross-compile)
-
-**Size impact:**
-- Debug/STRIP_ALL=OFF: compiled
-- STRIP_ALL=ON: 0 bytes (compiled out)
-- FINAL_STRIP=ON: tool not built