diff options
| author | skal <pascal.massimino@gmail.com> | 2026-02-15 18:52:48 +0100 |
|---|---|---|
| committer | skal <pascal.massimino@gmail.com> | 2026-02-15 18:52:48 +0100 |
| commit | d4b67e2f6ab48ab9ec658140be4f1999f604559a (patch) | |
| tree | 2502b0dc89748f7cfe674d3c177bd1528ce1c231 /doc | |
| parent | 161a59fa50bb92e3664c389fa03b95aefe349b3f (diff) | |
archive(cnn): move CNN v1 to cnn_v1/ subdirectory
Consolidate CNN v1 (CNNEffect) into dedicated directory:
- C++ effect: src/effects → cnn_v1/src/
- Shaders: workspaces/main/shaders/cnn → cnn_v1/shaders/
- Training: training/train_cnn.py → cnn_v1/training/
- Docs: doc/CNN*.md → cnn_v1/docs/
Updated all references:
- CMake source list
- C++ includes (relative paths: ../../cnn_v1/src/)
- Asset paths (../../cnn_v1/shaders/)
- Documentation cross-references
CNN v1 remains active in timeline. For new work, use CNN v2 with
enhanced features (7D static, storage buffer, sigmoid activation).
Tests: 34/34 passing (100%)
Diffstat (limited to 'doc')
| -rw-r--r-- | doc/CNN.md | 79 | ||||
| -rw-r--r-- | doc/CNN_BIAS_FIX_2026-02.md | 85 | ||||
| -rw-r--r-- | doc/CNN_DEBUG.md | 43 | ||||
| -rw-r--r-- | doc/CNN_EFFECT.md | 400 | ||||
| -rw-r--r-- | doc/CNN_FLATTEN_ANALYSIS.md | 189 | ||||
| -rw-r--r-- | doc/CNN_RGBD_GRAYSCALE_SUMMARY.md | 136 | ||||
| -rw-r--r-- | doc/CNN_TEST_TOOL.md | 244 |
7 files changed, 0 insertions, 1176 deletions
diff --git a/doc/CNN.md b/doc/CNN.md deleted file mode 100644 index 2dc3362..0000000 --- a/doc/CNN.md +++ /dev/null @@ -1,79 +0,0 @@ -# Convolutional Neural Net Shader (CNN) post-processing - -**Status:** ✅ Foundation implemented (single-layer, expandable to multi-pass) - -## Idea - -Have the input 3d scene be processed by a multi-layer CNN trained on the side. -Input: some rendered scene. -Output: 'stylized' scene with CNN post-processing. - -**See `doc/CNN_EFFECT.md` for implementation details, usage, and API reference.** - -## Shader implementation - -### input / output - -Need 1 texture buffer per CNN layer. -Input (r,g,b,1/z) for layer 0 (render 3d scene), or output from layer N-1 for layer N. -output: (r,g,b, alpha). Don't need the 1/z information (can be fetched from input) - -### size of one layer - -Notation: -S: the number of input samples from layer N-1. -Example: 3x3 input -> S = 3x3 = 9. - -Each S samples is 4 values (r,g,b, w=1/z). - -Each sample is processed by a mat4 matrix. 4 input => 4 output. - -Weight matrix = S x mat4 - -Final bias: 4 values. - -WGSL code example: See file CNN.shader - -### Layers - -we need 3 or 4 layer ? -Several different shaders for each layer. -Ping-pong for input/output texture buffer between each layers? - -## Implementation Status - -**Completed:** -- ✅ Modular WGSL shader architecture (6 snippet files) -- ✅ CNNEffect C++ class (single-layer rendering) -- ✅ ShaderComposer integration (#include resolution) -- ✅ Asset registration (7 new shader assets) -- ✅ Test coverage (test_demo_effects.cc) -- ✅ Placeholder identity weights for testing - -**Size:** ~3-4 KB shader code + ~2-4 KB weights = **5-8 KB total** - -**Pending:** -- ⏳ Training script (`scripts/train_cnn.py`) to generate real weights -- ⏳ Multi-layer rendering with ping-pong textures -- ⏳ Weight quantization for size optimization - ---- - -## Training (To Be Implemented) - -The layer weight/bias data are hard-coded in the shaders. -Training workflow: - -1. Prepare image pairs (before: raw render, after: target style) -2. Run `python scripts/train_cnn.py --input scene.png --target stylized.png` -3. Script generates `cnn_weights_generated.wgsl` -4. Rebuild: `cmake --build build -j4` - -**Reference:** File `CNN.py` contains training example (needs adaptation). - -Need a repository of reference image pairs (before/after) for training and validation. -Each input image is randomly sampled into 3×3 patch of (r,g,b,1/z) input samples. -And trained to match the (r,g,b,a) output. - -Training generates the .wgsl code for layers' shaders. - diff --git a/doc/CNN_BIAS_FIX_2026-02.md b/doc/CNN_BIAS_FIX_2026-02.md deleted file mode 100644 index 26db8eb..0000000 --- a/doc/CNN_BIAS_FIX_2026-02.md +++ /dev/null @@ -1,85 +0,0 @@ -# CNN Bias Accumulation Fix (2026-02-11) - -## Problem -Bias was being added multiple times in shader convolution loops (once per kernel position), causing mismatch between PyTorch training and WGSL inference. - -## Root Cause -**Location**: `training/train_cnn.py:381, 398` - -When exporting weights to WGSL, bias was replicated for every kernel position. The shader loops through positions doing: -```wgsl -sum += dot(weights[pos], rgbd) + dot(weights[pos+1], in1); // in1.w = 1.0 -``` - -For 3×3 kernel (9 positions), bias added 9×. For 5×5, added 25×. - -## Fix -Divide bias by `num_positions` during export: -```python -# Final layer (7→1) -v1.append(f"{bias[0] / num_positions:.6f}") - -# Inner layers (7→4) -v1.append(f"{bias[out_c] / num_positions:.6f}") -``` - -Shader accumulates bias × num_positions = original bias (correct). - ---- - -## Additional Improvements - -### 1. RGBA Output Support -**train_cnn.py**: Now saves 4-channel RGBA PNG preserving alpha from input: -```python -alpha = img_tensor[0, 3:4, :, :].permute(1, 2, 0).numpy() -output_rgba = np.concatenate([output, alpha], axis=2) -Image.fromarray((output_rgba * 255).astype(np.uint8), mode='RGBA') -``` - -Intermediate layers also save RGBA if 4-channel. - -### 2. Debug Hex Output -**Both tools** support `--debug-hex` to print first 8 pixels as hex: -```bash -./training/train_cnn.py --infer input.png --export-only checkpoint.pth --debug-hex -./build/cnn_test input.png output.png --debug-hex -``` - -Output format: `[0] 0xRRGGBBAA` for pixel-level comparison. - -### 3. Cleanup -Removed sRGB/linear_png debug code from `cnn_test.cc` (simplified PNG saving). - ---- - -## Files Modified -- `training/train_cnn.py`: Bias fix, RGBA output, --debug-hex -- `tools/cnn_test.cc`: --debug-hex, remove linear_png -- `workspaces/main/shaders/cnn/cnn_weights_generated.wgsl`: Regenerated with fixed bias - -## Testing -```bash -# Train with fixed export -./training/train_cnn.py --input training/input/ --target training/output/ \ - --layers 3 --kernel_sizes 3,3,3 --epochs 5000 - -# Generate ground truth -./training/train_cnn.py --infer input.png --export-only checkpoint.pth \ - --output ground_truth.png --debug-hex - -# Run GPU tool -./build/cnn_test input.png tool_output.png --debug-hex - -# Compare hex output for first 8 pixels -``` - ---- - -## Status -✅ Bias accumulation bug fixed -✅ RGBA output with alpha preservation -✅ Debug hex comparison tool -✅ Weights regenerated - -Commit: `8ff8c56` diff --git a/doc/CNN_DEBUG.md b/doc/CNN_DEBUG.md deleted file mode 100644 index ba220a0..0000000 --- a/doc/CNN_DEBUG.md +++ /dev/null @@ -1,43 +0,0 @@ -# CNN Effect Black Screen Bug - Resolution (2026-02) - -## Problem -CNN post-processing effect showed black screen when activated at 11.50s, despite scene rendering correctly before CNN started. - -## Root Causes - -### Bug 1: Framebuffer Capture Timing -**Location**: `src/gpu/effect.cc` -**Issue**: Capture ran INSIDE post-effect loop after ping-pong buffer swaps. CNN layers 1+ captured wrong buffer (output being written to, not scene). -**Fix**: Moved capture before loop starts (lines 308-346). Capture now copies `framebuffer_a` to `captured_frame` auxiliary texture ONCE before any post-effects run. - -### Bug 2: Missing Uniforms Update ⚠️ CRITICAL -**Location**: `src/effects/cnn_effect.cc` -**Issue**: `CNNEffect::update_bind_group()` never updated `uniforms_` buffer. `uniforms.resolution` uninitialized (0,0 or garbage) → UV calculation `p.xy / uniforms.resolution` produced NaN → all texture samples black. -**Fix**: Added uniforms update before bind group creation (lines 132-142): -```cpp -const CommonPostProcessUniforms u = { - .resolution = {(float)width_, (float)height_}, - .aspect_ratio = (float)width_ / (float)height_, - .time = 0.0f, - .beat = 0.0f, - .audio_intensity = 0.0f, -}; -uniforms_.update(ctx_.queue, u); -``` - -## Key Lessons - -1. **All post-process effects MUST update `uniforms_` buffer** - Required for UV calculations and shader parameters -2. **Framebuffer capture timing is critical** - Must happen before post-chain ping-pong starts -3. **Uninitialized uniforms cause silent failures** - Produces black output without validation errors -4. **Post-effects must render or chain breaks** - `loadOp=Load` preserves previous (black) content if no draw call executes - -## Files Modified -- `src/gpu/effect.cc`: Lines 308-346 (capture timing) -- `src/effects/cnn_effect.cc`: Lines 132-142 (uniforms update) - -## Verification -Test: `demo64k --seek 11.5` -- ✅ Scene visible with RotatingCube -- ✅ CNN stylization applied -- ✅ All 3 layers process with correct original texture reference diff --git a/doc/CNN_EFFECT.md b/doc/CNN_EFFECT.md deleted file mode 100644 index 40f095e..0000000 --- a/doc/CNN_EFFECT.md +++ /dev/null @@ -1,400 +0,0 @@ -# CNN Post-Processing Effect - -Neural network-based stylization for rendered scenes. - ---- - -## Overview - -Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead. - -**Key Features:** -- Position-aware layer 0 (coordinate input for vignetting, edge effects) -- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining -- Original input available to all layers via framebuffer capture -- Configurable final blend with original scene -- Modular WGSL shader architecture -- Hardcoded weights (trained offline via PyTorch) -- ~5-8 KB binary footprint - ---- - -## Architecture - -### RGBD → Grayscale Pipeline - -**Input:** RGBD (RGB + inverse depth D=1/z) -**Output:** Grayscale (1 channel) -**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1] - -**Architecture:** -- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD -- **Final layer (N-1):** Conv2d(7→1) - output grayscale - -```wgsl -// Inner layers: 7→4 (RGBD output, vec4-optimized) -fn cnn_conv3x3_7to4( - tex: texture_2d<f32>, - samp: sampler, - uv: vec2<f32>, - resolution: vec2<f32>, - gray: f32, # Grayscale [-1,1] - weights: array<vec4<f32>, 72> # 9 pos × 4 ch × 2 vec4 (8 floats per filter) -) -> vec4<f32> - -// Final layer: 7→1 (grayscale output, vec4-optimized) -fn cnn_conv3x3_7to1( - tex: texture_2d<f32>, - samp: sampler, - uv: vec2<f32>, - resolution: vec2<f32>, - gray: f32, - weights: array<vec4<f32>, 18> # 9 pos × 2 vec4 (8 floats per filter) -) -> f32 -``` - -**Input normalization:** -- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1] -- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1] -- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))` -- **Inter-layer data** stays in [-1,1] (no denormalization) -- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1] - -**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer - -### Multi-Layer Architecture - -CNNEffect supports multi-layer networks via automatic effect chaining: - -1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7` -2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2) -3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"` -4. **Original input binding**: All layers access original via `@binding(4)` -5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)` - -**Framebuffer Capture API:** -- `Effect::needs_framebuffer_capture()` - effect requests pre-capture -- MainSequence automatically blits input → `"captured_frame"` auxiliary texture -- Generic mechanism usable by any effect - -### File Structure - -``` -src/effects/ - cnn_effect.h/cc # CNNEffect class + framebuffer capture - -workspaces/main/shaders/cnn/ - cnn_activation.wgsl # tanh, ReLU, sigmoid, leaky_relu - cnn_conv3x3.wgsl # 3×3 convolution (standard + coord-aware) - cnn_conv5x5.wgsl # 5×5 convolution (standard + coord-aware) - cnn_conv7x7.wgsl # 7×7 convolution (standard + coord-aware) - cnn_weights_generated.wgsl # Weight arrays (auto-generated by train_cnn.py) - cnn_layer.wgsl # Main shader with layer switches (auto-generated by train_cnn.py) -``` - ---- - -## Training Workflow - -### 1. Prepare Training Data - -Input/target image pairs: -``` -training/input/img_000.png # RGBA (RGB + alpha) -training/output/img_000.png # Grayscale target -``` - -**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily. - -### 2. Train Network - -**Patch-based (Recommended)** - Preserves natural pixel scale: -```bash -python3 training/train_cnn.py \ - --input training/input --target training/output \ - --patch-size 32 --patches-per-image 64 --detector harris \ - --layers 3 --kernel-sizes 3,5,3 \ - --epochs 5000 --batch-size 16 --checkpoint-every 1000 -``` - -**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges) - -**Full-image (Legacy)** - Resizes to 256×256: -```bash -python3 training/train_cnn.py \ - --input training/input --target training/output \ - --layers 3 --kernel-sizes 3,5,3 \ - --epochs 10000 --batch-size 8 --checkpoint-every 1000 -``` - -**Auto-generates:** -- `cnn_weights_generated.wgsl` - Weight arrays -- `cnn_layer.wgsl` - Layer shader - -### 3. Export & Validate - -```bash -# Export shaders -./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth - -# Generate ground truth -./training/train_cnn.py --infer input.png \ - --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png -``` - -### 4. Rebuild Demo - -```bash -cmake --build build -j4 && ./build/demo64k -``` - ---- - -## Usage - -### C++ Integration - -**Single layer (manual):** -```cpp -#include "effects/cnn_effect.h" - -CNNEffectParams p; -p.layer_index = 0; -p.total_layers = 1; -p.blend_amount = 1.0f; -auto cnn = std::make_shared<CNNEffect>(ctx, p); -timeline.add_effect(cnn, start_time, end_time); -``` - -**Multi-layer (automatic via timeline compiler):** - -Use timeline syntax - `seq_compiler` expands to multiple instances. - -### Timeline Examples - -**Single-layer CNN (full stylization):** -``` -SEQUENCE 10.0 0 - EFFECT + Hybrid3DEffect 0.00 5.00 - EFFECT + CNNEffect 0.50 5.00 layers=1 -``` - -**Multi-layer CNN with blend:** -``` -SEQUENCE 10.0 0 - EFFECT + Hybrid3DEffect 0.00 5.00 - EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7 -``` - -Expands to: -```cpp -// Layer 0 (captures original, blend=1.0) -{ - CNNEffectParams p; - p.layer_index = 0; - p.total_layers = 3; - p.blend_amount = 1.0f; - seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1); -} -// Layer 1 (blend=1.0) -{ - CNNEffectParams p; - p.layer_index = 1; - p.total_layers = 3; - p.blend_amount = 1.0f; - seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2); -} -// Layer 2 (final blend=0.7) -{ - CNNEffectParams p; - p.layer_index = 2; - p.total_layers = 3; - p.blend_amount = 0.7f; - seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3); -} -``` - ---- - -## Shader Structure - -**Bindings:** -```wgsl -@group(0) @binding(0) var smplr: sampler; -@group(0) @binding(1) var txt: texture_2d<f32>; // Current layer input -@group(0) @binding(2) var<uniform> uniforms: CommonUniforms; -@group(0) @binding(3) var<uniform> params: CNNLayerParams; -@group(0) @binding(4) var original_input: texture_2d<f32>; // Layer 0 input (captured) -``` - -**Fragment shader logic:** -```wgsl -@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> { - let uv = p.xy / uniforms.resolution; - let original_raw = textureSample(original_input, smplr, uv); - let original = (original_raw - 0.5) * 2.0; // Normalize to [-1,1] - let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722)); - var result = vec4<f32>(0.0); - - if (params.layer_index == 0) { - result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution, - weights_layer0); - result = cnn_tanh(result); - } - else if (params.layer_index == 1) { - result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution, - gray, weights_layer1); - result = cnn_tanh(result); - } - // ... other layers - - // Blend with ORIGINAL input (not previous layer) - return mix(original_raw, result, params.blend_amount); -} -``` - -**Weight Storage (vec4-optimized):** - -**Inner layers (7→4 RGBD output):** -```wgsl -// Structure: array<vec4<f32>, 72> -// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) -const weights_layer0: array<vec4<f32>, 72> = array( - vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0_ch0 (rgba weights) - vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0_ch0 (uv, gray, bias) - vec4<f32>(w1_r, w1_g, w1_b, w1_d), // pos0_ch1 (rgba weights) - vec4<f32>(w1_u, w1_v, w1_gray, bias1), // pos0_ch1 (uv, gray, bias) - // ... 68 more vec4s -); -``` - -**Final layer (7→1 grayscale output):** -```wgsl -// Structure: array<vec4<f32>, 18> -// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) -const weights_layerN: array<vec4<f32>, 18> = array( - vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0 (rgba weights) - vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0 (uv, gray, bias) - // ... 16 more vec4s -); -``` - -**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs. - ---- - -## Size Budget - -| Component | Size | Notes | -|-----------|------|-------| -| Activation functions | ~200 B | 4 functions | -| Conv3x3 (standard + coord) | ~500 B | Both variants | -| Conv5x5 (standard + coord) | ~700 B | Both variants | -| Conv7x7 (standard + coord) | ~900 B | Both variants | -| Main shader | ~800 B | Layer composition | -| C++ implementation | ~300 B | Effect class | -| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) | -| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes | -| **Total** | **5-9 KB** | Acceptable for 64k | - -**Optimization strategies:** -- Quantize weights (float32 → int8) -- Prune near-zero weights -- Use separable convolutions - ---- - -## Testing - -```bash -./build/test_demo_effects # CNN construction/shader tests -./build/demo64k # Visual test -``` - ---- - -## Blend Parameter Behavior - -**blend_amount** controls final compositing with original: -- `blend=0.0`: Pure original (no CNN effect) -- `blend=0.5`: 50% original + 50% CNN -- `blend=1.0`: Pure CNN output (full stylization) - -**Important:** Blend uses captured layer 0 input, not previous layer output. - -**Example use cases:** -- `blend=1.0`: Full stylization (default) -- `blend=0.7`: Subtle effect preserving original details -- `blend=0.3`: Light artistic touch - -## Troubleshooting - -**Shader compilation fails:** -- Check `cnn_weights_generated.wgsl` syntax -- Verify snippets registered in `shaders.cc::InitShaderComposer()` -- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`) - -**Black/corrupted output:** -- Weights untrained (identity placeholder) -- Check `captured_frame` auxiliary texture is registered -- Verify layer priorities in timeline are sequential - -**Wrong blend result:** -- Ensure layer 0 has `needs_framebuffer_capture() == true` -- Check MainSequence framebuffer capture logic -- Verify `original_input` binding is populated - -**Training loss not decreasing:** -- Lower learning rate (`--learning-rate 0.0001`) -- More epochs (`--epochs 1000`) -- Check input/target image alignment - ---- - -## Vec4 Optimization - -**Architecture:** Weights stored as vec4 pairs for SIMD efficiency. - -**Input representation:** -```wgsl -let rgbd = textureSample(...); // vec4: [r, g, b, d] -let in1 = vec4<f32>(uv_norm, gray, 1.0); // vec4: [u, v, gray, 1.0] -``` - -**Weight indexing:** -```wgsl -var pos = 0; // Direct weight array index -for (var dy = -1; dy <= 1; dy++) { - for (var dx = -1; dx <= 1; dx++) { - // Unrolled channel loop (4 output channels) - sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1); - sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1); - sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1); - sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1); - pos += 8; // 4 channels × 2 vec4s per channel - } -} -``` - -**Benefits:** -- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs) -- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment) -- **Bias integration:** Free via `[..., 1.0]` component (no separate add) -- **Code simplicity:** Eliminates inner loop, direct indexing with `pos` -- **Performance:** 2-3× GPU throughput improvement over scalar version - -**Weight layout per filter (8 floats):** -- vec4[0]: [w_r, w_g, w_b, w_d] (rgba input weights) -- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias) - -**3×3 kernel sizes:** -- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes) -- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes) - ---- - -## References - -- **Training Script:** `training/train_cnn.py` -- **Shader Composition:** `doc/SEQUENCE.md` -- **Effect System:** `src/gpu/effect.h` diff --git a/doc/CNN_FLATTEN_ANALYSIS.md b/doc/CNN_FLATTEN_ANALYSIS.md deleted file mode 100644 index bf63c5d..0000000 --- a/doc/CNN_FLATTEN_ANALYSIS.md +++ /dev/null @@ -1,189 +0,0 @@ -# CNN Shader Flatten Mode - Technical Analysis - -**Status:** Analysis complete - flatten mode NOT RECOMMENDED - -**Date:** February 2026 - ---- - -## Context - -Current CNN architecture uses **3 sequential render passes** (linear chaining): -- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer -- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer -- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original - -Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers. - ---- - -## Current Architecture - -**Shader Structure:** -- 1 pipeline with layer branching (`layer_index` uniform) -- 5 bindings: sampler, input texture, uniforms, layer params, original capture -- Total shader size: ~8 KB (snippets + weights) - -**Performance Profile:** -- 3 render pass dispatches -- 2 framebuffer writes + reads between layers -- Memory bandwidth: ~2× framebuffer size per layer -- Register pressure: Low (per-layer isolation) - -**Weight Buffer:** 290 vec4s (4.6 KB) - already unified - ---- - -## Flatten Approaches Evaluated - -### Option A: Full Flatten (All 3 Layers) - -**Cascading Receptive Field:** - -To compute final output at position (x, y): -- Layer 2 needs 3×3 neighborhood of Layer 1 outputs -- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs -- Each Layer 0 output needs 5×5 neighborhood of input samples - -**Effective input sampling:** 9×9 pixels (vs current 5×5 max) - -**Intermediate Storage (per thread/pixel):** -``` -Layer 0 outputs: 5×5 positions × 4 channels = 100 floats -Layer 1 outputs: 3×3 positions × 4 channels = 36 floats - TOTAL = 136 floats (544 bytes) -``` - -**GPU Register Pressure:** -- Modern GPUs: 32-64 KB registers per SM, shared across warps -- 544 bytes/thread → max 64 threads/SM (**low occupancy**) -- Current multi-pass: ~4-8 bytes/thread (high occupancy) - -**Pros:** -- 1 dispatch vs 3 (reduce CPU overhead) -- Zero framebuffer bandwidth between layers - -**Cons:** -- **Severe register pressure** (10-20× increase) -- Reduced occupancy → potential performance loss -- Complex shader (harder debug, larger binary) -- 9×9 input sampling - -**Assessment:** ❌ **Not Recommended** -Register cost outweighs bandwidth savings. - ---- - -### Option B: Partial Flatten (Layers 1 + 2) - -Keep Layer 0 separate, flatten only Layers 1 and 2. - -**Pass Structure:** -1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer -2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader - -**Intermediate Storage:** -``` -Layer 0 samples: 3×3 × 4 = 36 floats (read once) -Layer 1 outputs: 3×3 × 4 = 36 floats (computed) - TOTAL = 72 floats (288 bytes) -``` - -**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs - -**Pros:** -- 2 passes vs 3 (33% reduction) -- 1 framebuffer write saved -- More manageable register usage - -**Cons:** -- Still significant register pressure (288 bytes vs ~8 bytes baseline) -- Medium complexity increase -- Layer 0 (heaviest kernel) still separate - -**Assessment:** ⚠️ **Marginal Benefit** -Saves 1 pass but register cost still high. - ---- - -### Option C: Keep Current Multi-Pass ✅ - -**Rationale:** -- Current architecture well-suited to GPU design (high throughput via parallelism) -- Minimal register usage → high occupancy → hides memory latency -- Framebuffer bandwidth cost < register pressure cost -- Clean separation aids debugging/iteration -- Modular (easy to add/remove layers) - -**Alternative Optimizations (if bandwidth critical):** -1. Merge passes via render pass load/store ops (Vulkan subpasses) -2. Reduce intermediate channel count (4→3 or 2) -3. Hybrid: Compute shaders + workgroup shared memory -4. Layer pruning (2-layer vs 3-layer quality comparison) - ---- - -## Recommendation - -**✅ Keep current multi-pass architecture** - -### Decision Matrix - -| Factor | Multi-Pass | Partial Flatten | Full Flatten | -|--------|-----------|----------------|--------------| -| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme | -| Occupancy | ✅ High | ⚠️ Medium | ❌ Low | -| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest | -| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High | -| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard | -| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest | - -**Modern GPU Architecture Favors:** -- High parallelism (many small threads) over complex threads -- Hiding latency via occupancy over minimizing operations -- Memory bandwidth via caching, not elimination - ---- - -## Alternative: Compute Shader + Shared Memory - -**If bandwidth becomes critical:** -- Use compute shader with workgroup shared memory -- Load tile + halos into shared memory (9×9 input samples) -- Compute all 3 layers for tile interior (avoids redundant sampling) -- Requires explicit synchronization (`workgroupBarrier`) - -**Trade-offs:** -- ✅ Low register pressure + low bandwidth -- ❌ Compute pipeline complexity (no render pass integration) -- ❌ Tile edge handling -- ❌ Larger code size - ---- - -## Conclusion - -Current 3-pass architecture is **appropriate for demo64k**: -- Size-efficient (modular shaders) -- Performance adequate (bandwidth not bottleneck) -- Maintainable (clean layer isolation) - -**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint. - -### Size Optimization Alternatives (Better ROI) - -If size optimization critical, focus on: -1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization) -2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s) -3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels) - -These yield better size/performance than shader architecture changes. - ---- - -## References - -- `doc/CNN_EFFECT.md` - CNN implementation details -- `doc/CNN.md` - High-level CNN design -- `src/effects/cnn_effect.cc` - Current implementation -- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets diff --git a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md b/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md deleted file mode 100644 index 3439f2c..0000000 --- a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md +++ /dev/null @@ -1,136 +0,0 @@ -# CNN RGBD→Grayscale Architecture Implementation - -## Summary - -Implemented CNN architecture upgrade: RGBD input → grayscale output with 7-channel augmented input. - -## Changes Made - -### Architecture - -**Input:** RGBD (4 channels: RGB + inverse depth D=1/z) -**Output:** Grayscale (1 channel) -**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1] - -**Layer Configuration:** -- Inner layers (0..N-2): Conv2d(7→4) - output RGBD with tanh activation -- Final layer (N-1): Conv2d(7→1) - output grayscale, no activation - -### Input Normalization (all to [-1,1]) - -- **RGBD:** `(rgbd - 0.5) * 2` -- **UV coords:** `(uv - 0.5) * 2` -- **Grayscale:** `dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722))` (computed once, passed as parameter) - -**Rationale:** Zero-centered inputs for tanh activation, better gradient flow. - -### Modified Files - -**Training (`/Users/skal/demo/training/train_cnn.py`):** -1. Removed `CoordConv2d` class -2. Updated `SimpleCNN`: - - Inner layers: `Conv2d(7, 4)` - RGBD output - - Final layer: `Conv2d(7, 1)` - grayscale output -3. Updated `forward()`: - - Normalize RGBD/coords/gray to [-1,1] - - Concatenate 7-channel input for each layer - - Apply tanh (inner) or none (final) - - Denormalize final output -4. Updated `export_weights_to_wgsl()`: - - Inner: `array<array<f32, 8>, 36>` (9 pos × 4 ch × 8 values) - - Final: `array<array<f32, 8>, 9>` (9 pos × 8 values) -5. Updated `generate_layer_shader()`: - - Use `cnn_conv3x3_7to4` for inner layers - - Use `cnn_conv3x3_7to1` for final layer - - Denormalize outputs from [-1,1] to [0,1] -6. Updated `ImagePairDataset`: - - Load RGBA input (was RGB) - -**Shaders (`/Users/skal/demo/workspaces/main/shaders/cnn/cnn_conv3x3.wgsl`):** -1. Added `cnn_conv3x3_7to4()`: - - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter) - - 4-channel output: RGBD - - Weights: `array<array<f32, 8>, 36>` -2. Added `cnn_conv3x3_7to1()`: - - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter) - - 1-channel output: grayscale - - Weights: `array<array<f32, 8>, 9>` -3. Optimized: gray computed once in caller using `dot()`, not per-function - -**Documentation (`/Users/skal/demo/doc/CNN_EFFECT.md`):** -1. Updated architecture section with RGBD→grayscale pipeline -2. Updated training data requirements (RGBA input) -3. Updated weight storage format - -### No C++ Changes - -CNNLayerParams and bind groups remain unchanged. - -## Data Flow - -1. Layer 0 captures original RGBD to `captured_frame` -2. Each layer: - - Samples previous layer output (RGBD in [0,1]) - - Normalizes RGBD to [-1,1] - - Computes gray once using `dot()` (fs_main level) - - Normalizes UV coords to [-1,1] (inside conv functions) - - Concatenates 7-channel input - - Applies convolution with layer-specific weights - - Outputs RGBD (inner) or grayscale (final) in [-1,1] - - Applies tanh (inner only) - - Denormalizes to [0,1] for texture storage - - Blends with original - -## Next Steps - -1. **Prepare RGBD training data:** - - Input: RGBA images (RGB + depth in alpha) - - Target: Grayscale stylized output - -2. **Train network:** - ```bash - python3 training/train_cnn.py \ - --input training/input \ - --target training/output \ - --layers 3 \ - --epochs 1000 - ``` - -3. **Verify generated shaders:** - - Check `cnn_weights_generated.wgsl` structure - - Check `cnn_layer.wgsl` uses new conv functions - -4. **Test in demo:** - ```bash - cmake --build build -j4 - ./build/demo64k - ``` - -## Design Rationale - -**Why [-1,1] normalization?** -- Centered inputs for tanh (operates best around 0) -- Better gradient flow -- Standard ML practice for normalized data - -**Why RGBD throughout vs RGB?** -- Depth information propagates through network -- Enables depth-aware stylization -- Consistent 4-channel processing - -**Why 7-channel input?** -- Coordinates: position-dependent effects (vignettes) -- Grayscale: luminance-aware processing -- RGBD: full color+depth information -- Enables richer feature learning - -## Testing Checklist - -- [ ] Train network with RGBD input data -- [ ] Verify `cnn_weights_generated.wgsl` structure -- [ ] Verify `cnn_layer.wgsl` uses `7to4`/`7to1` functions -- [ ] Build demo without errors -- [ ] Visual test: inner layers show RGBD evolution -- [ ] Visual test: final layer produces grayscale -- [ ] Visual test: blending works correctly -- [ ] Compare quality with previous RGB→RGB architecture diff --git a/doc/CNN_TEST_TOOL.md b/doc/CNN_TEST_TOOL.md deleted file mode 100644 index 4307894..0000000 --- a/doc/CNN_TEST_TOOL.md +++ /dev/null @@ -1,244 +0,0 @@ -# CNN Shader Testing Tool - -Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Supports both CNN v1 (render pipeline) and v2 (compute, storage buffer). - ---- - -## Purpose - -- Validate trained weights against ground truth -- Debug CNN layer behavior in isolation -- Generate test outputs for training workflow -- Match Python training script's inference mode - ---- - -## Architecture - -**Two implementations:** - -1. **CNN v1** (render pipeline, texture atlas weights) - - 3 fixed layers - - RGBA16Float intermediates - - BGRA8Unorm final output - -2. **CNN v2** (compute shaders, storage buffer weights) - - Dynamic layer count from binary - - 7D static features (RGBD + UV + sin + bias) - - RGBA32Uint packed f16 intermediates - - Storage buffer: ~3-5 KB weights - -**Core GPU utility:** `src/gpu/texture_readback.{h,cc}` -- Synchronous texture-to-CPU readback -- Supports RGBA16Float, RGBA32Uint, BGRA8Unorm -- Protected with STRIP_ALL (0 bytes in release) - ---- - -## Usage - -```bash -cnn_test input.png output.png [OPTIONS] - -OPTIONS: - --cnn-version N CNN version: 1 (default) or 2 (ignored with --weights) - --weights PATH Load weights from .bin (forces CNN v2, overrides layer config) - --blend F Final blend amount (0.0-1.0, default: 1.0) - --format ppm|png Output format (default: png) - --layers N Number of CNN layers (1-10, v1 only, default: 3, ignored with --weights) - --save-intermediates DIR Save intermediate layers to directory - --debug-hex Print first 8 pixels as hex (debug) - --help Show usage -``` - -**Examples:** -```bash -# CNN v1 (render pipeline, 3 layers) -./build/cnn_test input.png output.png --cnn-version 1 - -# CNN v2 (compute, storage buffer, uses asset system weights) -./build/cnn_test input.png output.png --cnn-version 2 - -# CNN v2 with runtime weight loading (loads layer config from .bin) -./build/cnn_test input.png output.png --weights checkpoints/checkpoint_epoch_100.pth.bin - -# 50% blend with original (v2) -./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5 - -# Debug hex dump -./build/cnn_test input.png output.png --cnn-version 2 --debug-hex -``` - -**Important:** When using `--weights`, the layer count and kernel sizes are read from the binary file header, overriding any `--layers` or `--cnn-version` arguments. - ---- - -## Implementation Details - -### Core Readback Utility - -**File:** `src/gpu/texture_readback.{h,cc}` - -**Function:** -```cpp -std::vector<uint8_t> read_texture_pixels( - WGPUInstance instance, - WGPUDevice device, - WGPUTexture texture, - int width, - int height); -``` - -**Features:** -- Returns BGRA8 format (4 bytes per pixel) -- Synchronous blocking operation -- Cross-platform async callback handling (Win32 vs Native API) -- Automatic staging buffer creation and cleanup - -**Refactored OffscreenRenderTarget:** -```cpp -std::vector<uint8_t> OffscreenRenderTarget::read_pixels() { -#if !defined(STRIP_ALL) - return read_texture_pixels(instance_, device_, texture_, width_, height_); -#else - return std::vector<uint8_t>(); -#endif -} -``` - -### CNN v1 Pipeline (Render) - -**Fixed 3-layer architecture:** -- Ping-pong RGBA16Float textures -- CNNLayerParams (binding 3): layer_index, blend_amount -- Shader composer resolves #include directives - -### CNN v2 Pipeline (Compute) - -**Dynamic layer architecture:** -1. **Static features compute:** Generate 7D features (RGBD + UV + sin + bias) -2. **Layer computes:** N layers from binary weights (3-5 typically) - - Storage buffer weights (read-only) - - RGBA32Uint packed f16 textures (ping-pong) - - CNNv2LayerParams: kernel_size, channels, weight_offset, blend -3. **Readback:** RGBA32Uint → f16 decode → u8 clamp - -**Binary format:** Header (20B) + layer info (20B×N) + f16 weights - -**Weight Loading:** -- **Without `--weights`:** Loads from asset system (`ASSET_WEIGHTS_CNN_V2`) -- **With `--weights PATH`:** Loads from external `.bin` file (e.g., checkpoint exports) - - Layer count and kernel sizes parsed from binary header - - Overrides any `--layers` or `--cnn-version` arguments - - Enables runtime testing of training checkpoints without rebuild - ---- - -## Build Integration - -**CMakeLists.txt:** - -1. Added `src/gpu/texture_readback.cc` to GPU_SOURCES (both sections) -2. Tool target: -```cmake -add_executable(cnn_test - tools/cnn_test.cc - src/tests/common/webgpu_test_fixture.cc - src/tests/common/offscreen_render_target.cc - ${PLATFORM_SOURCES} - ${GEN_DEMO_CC}) - -target_link_libraries(cnn_test PRIVATE - gpu util procedural ${DEMO_LIBS}) - -add_dependencies(cnn_test generate_demo_assets) - -target_compile_definitions(cnn_test PRIVATE - STB_IMAGE_IMPLEMENTATION - STB_IMAGE_WRITE_IMPLEMENTATION) -``` - -**Build:** -```bash -cmake -S . -B build -DDEMO_BUILD_TOOLS=ON -cmake --build build -j4 -``` - ---- - -## Validation Workflow (CNN v2) - -### 1. Train and Export -```bash -# Train and export weights -./scripts/train_cnn_v2_full.sh --epochs 200 --batch-size 16 -``` - -### 2. Tool Inference -```bash -# Run tool with v2 -./build/cnn_test training/input/img_000.png output.png --cnn-version 2 -``` - -### 3. Visual Comparison -Compare output.png with training/target_X/img_000.png - ---- - -## Status - -**CNN v1:** Builds and runs, produces incorrect output (all white). Use CNNEffect in demo for visual validation. - -**CNN v2:** ⚠️ Partially functional. Readback works but output differs from HTML validation tool. -- Loads binary weights from `workspaces/main/weights/cnn_v2_weights.bin` -- Matches CNNv2Effect architecture -- **Known Issue:** Visual output differs from `tools/cnn_v2_test/index.html` despite matching shader code -- Root cause under investigation (weight indexing? texture sampling? activation clamping?) -- Use HTML tool (`tools/cnn_v2_test/index.html`) for accurate validation - ---- - -## Technical Notes (Readback Fix) - -**Original Bug:** Buffer mapping returned `WGPUMapAsyncStatus_Unknown` (status=5) - -**Root Cause:** Callback mode mismatch -- Used `WGPUCallbackMode_WaitAnyOnly` (fires only during `wgpuInstanceWaitAny`) -- Called `wgpuInstanceProcessEvents` in wait loop (wrong API for this mode) -- Callback never fired → timeout → empty buffer - -**Fix Applied:** -1. Changed callback mode to `WGPUCallbackMode_AllowProcessEvents` -2. Replaced `wgpuInstanceProcessEvents` with `wgpuDevicePoll(device, true, nullptr)` -3. Added pre-mapping device poll to ensure copy completes - -**Relevant Code:** `src/gpu/texture_readback.cc` lines 97-110 - -**Reference:** WebGPU spec - Asynchronous Operations, Callback Modes - ---- - -## Limitations - -- **CNN v1:** Produces incorrect output, use for debugging only -- **Single image:** Batch processing requires shell loop -- **No real-time preview:** Offline processing only -- **PNG input:** stb_image (JPEG/PNG/BMP/TGA also supported) - ---- - -## Technical Notes - -**CNN v2 f16 decoding:** -- RGBA32Uint texture stores 8×f16 as 4×u32 -- Custom decoder: extract u16, decode f16→f32, clamp [0,1]→u8 -- Handles denormals, infinity, NaN - -**Cross-platform:** -- macOS, Linux (native WebGPU) -- Windows (mingw-w64 cross-compile) - -**Size impact:** -- Debug/STRIP_ALL=OFF: compiled -- STRIP_ALL=ON: 0 bytes (compiled out) -- FINAL_STRIP=ON: tool not built |
