diff options
Diffstat (limited to 'doc/CNN_EFFECT.md')
| -rw-r--r-- | doc/CNN_EFFECT.md | 400 |
1 files changed, 0 insertions, 400 deletions
diff --git a/doc/CNN_EFFECT.md b/doc/CNN_EFFECT.md deleted file mode 100644 index 40f095e..0000000 --- a/doc/CNN_EFFECT.md +++ /dev/null @@ -1,400 +0,0 @@ -# CNN Post-Processing Effect - -Neural network-based stylization for rendered scenes. - ---- - -## Overview - -Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead. - -**Key Features:** -- Position-aware layer 0 (coordinate input for vignetting, edge effects) -- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining -- Original input available to all layers via framebuffer capture -- Configurable final blend with original scene -- Modular WGSL shader architecture -- Hardcoded weights (trained offline via PyTorch) -- ~5-8 KB binary footprint - ---- - -## Architecture - -### RGBD → Grayscale Pipeline - -**Input:** RGBD (RGB + inverse depth D=1/z) -**Output:** Grayscale (1 channel) -**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1] - -**Architecture:** -- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD -- **Final layer (N-1):** Conv2d(7→1) - output grayscale - -```wgsl -// Inner layers: 7→4 (RGBD output, vec4-optimized) -fn cnn_conv3x3_7to4( - tex: texture_2d<f32>, - samp: sampler, - uv: vec2<f32>, - resolution: vec2<f32>, - gray: f32, # Grayscale [-1,1] - weights: array<vec4<f32>, 72> # 9 pos × 4 ch × 2 vec4 (8 floats per filter) -) -> vec4<f32> - -// Final layer: 7→1 (grayscale output, vec4-optimized) -fn cnn_conv3x3_7to1( - tex: texture_2d<f32>, - samp: sampler, - uv: vec2<f32>, - resolution: vec2<f32>, - gray: f32, - weights: array<vec4<f32>, 18> # 9 pos × 2 vec4 (8 floats per filter) -) -> f32 -``` - -**Input normalization:** -- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1] -- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1] -- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))` -- **Inter-layer data** stays in [-1,1] (no denormalization) -- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1] - -**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer - -### Multi-Layer Architecture - -CNNEffect supports multi-layer networks via automatic effect chaining: - -1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7` -2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2) -3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"` -4. **Original input binding**: All layers access original via `@binding(4)` -5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)` - -**Framebuffer Capture API:** -- `Effect::needs_framebuffer_capture()` - effect requests pre-capture -- MainSequence automatically blits input → `"captured_frame"` auxiliary texture -- Generic mechanism usable by any effect - -### File Structure - -``` -src/effects/ - cnn_effect.h/cc # CNNEffect class + framebuffer capture - -workspaces/main/shaders/cnn/ - cnn_activation.wgsl # tanh, ReLU, sigmoid, leaky_relu - cnn_conv3x3.wgsl # 3×3 convolution (standard + coord-aware) - cnn_conv5x5.wgsl # 5×5 convolution (standard + coord-aware) - cnn_conv7x7.wgsl # 7×7 convolution (standard + coord-aware) - cnn_weights_generated.wgsl # Weight arrays (auto-generated by train_cnn.py) - cnn_layer.wgsl # Main shader with layer switches (auto-generated by train_cnn.py) -``` - ---- - -## Training Workflow - -### 1. Prepare Training Data - -Input/target image pairs: -``` -training/input/img_000.png # RGBA (RGB + alpha) -training/output/img_000.png # Grayscale target -``` - -**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily. - -### 2. Train Network - -**Patch-based (Recommended)** - Preserves natural pixel scale: -```bash -python3 training/train_cnn.py \ - --input training/input --target training/output \ - --patch-size 32 --patches-per-image 64 --detector harris \ - --layers 3 --kernel-sizes 3,5,3 \ - --epochs 5000 --batch-size 16 --checkpoint-every 1000 -``` - -**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges) - -**Full-image (Legacy)** - Resizes to 256×256: -```bash -python3 training/train_cnn.py \ - --input training/input --target training/output \ - --layers 3 --kernel-sizes 3,5,3 \ - --epochs 10000 --batch-size 8 --checkpoint-every 1000 -``` - -**Auto-generates:** -- `cnn_weights_generated.wgsl` - Weight arrays -- `cnn_layer.wgsl` - Layer shader - -### 3. Export & Validate - -```bash -# Export shaders -./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth - -# Generate ground truth -./training/train_cnn.py --infer input.png \ - --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png -``` - -### 4. Rebuild Demo - -```bash -cmake --build build -j4 && ./build/demo64k -``` - ---- - -## Usage - -### C++ Integration - -**Single layer (manual):** -```cpp -#include "effects/cnn_effect.h" - -CNNEffectParams p; -p.layer_index = 0; -p.total_layers = 1; -p.blend_amount = 1.0f; -auto cnn = std::make_shared<CNNEffect>(ctx, p); -timeline.add_effect(cnn, start_time, end_time); -``` - -**Multi-layer (automatic via timeline compiler):** - -Use timeline syntax - `seq_compiler` expands to multiple instances. - -### Timeline Examples - -**Single-layer CNN (full stylization):** -``` -SEQUENCE 10.0 0 - EFFECT + Hybrid3DEffect 0.00 5.00 - EFFECT + CNNEffect 0.50 5.00 layers=1 -``` - -**Multi-layer CNN with blend:** -``` -SEQUENCE 10.0 0 - EFFECT + Hybrid3DEffect 0.00 5.00 - EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7 -``` - -Expands to: -```cpp -// Layer 0 (captures original, blend=1.0) -{ - CNNEffectParams p; - p.layer_index = 0; - p.total_layers = 3; - p.blend_amount = 1.0f; - seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1); -} -// Layer 1 (blend=1.0) -{ - CNNEffectParams p; - p.layer_index = 1; - p.total_layers = 3; - p.blend_amount = 1.0f; - seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2); -} -// Layer 2 (final blend=0.7) -{ - CNNEffectParams p; - p.layer_index = 2; - p.total_layers = 3; - p.blend_amount = 0.7f; - seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3); -} -``` - ---- - -## Shader Structure - -**Bindings:** -```wgsl -@group(0) @binding(0) var smplr: sampler; -@group(0) @binding(1) var txt: texture_2d<f32>; // Current layer input -@group(0) @binding(2) var<uniform> uniforms: CommonUniforms; -@group(0) @binding(3) var<uniform> params: CNNLayerParams; -@group(0) @binding(4) var original_input: texture_2d<f32>; // Layer 0 input (captured) -``` - -**Fragment shader logic:** -```wgsl -@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> { - let uv = p.xy / uniforms.resolution; - let original_raw = textureSample(original_input, smplr, uv); - let original = (original_raw - 0.5) * 2.0; // Normalize to [-1,1] - let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722)); - var result = vec4<f32>(0.0); - - if (params.layer_index == 0) { - result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution, - weights_layer0); - result = cnn_tanh(result); - } - else if (params.layer_index == 1) { - result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution, - gray, weights_layer1); - result = cnn_tanh(result); - } - // ... other layers - - // Blend with ORIGINAL input (not previous layer) - return mix(original_raw, result, params.blend_amount); -} -``` - -**Weight Storage (vec4-optimized):** - -**Inner layers (7→4 RGBD output):** -```wgsl -// Structure: array<vec4<f32>, 72> -// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) -const weights_layer0: array<vec4<f32>, 72> = array( - vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0_ch0 (rgba weights) - vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0_ch0 (uv, gray, bias) - vec4<f32>(w1_r, w1_g, w1_b, w1_d), // pos0_ch1 (rgba weights) - vec4<f32>(w1_u, w1_v, w1_gray, bias1), // pos0_ch1 (uv, gray, bias) - // ... 68 more vec4s -); -``` - -**Final layer (7→1 grayscale output):** -```wgsl -// Structure: array<vec4<f32>, 18> -// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) -const weights_layerN: array<vec4<f32>, 18> = array( - vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0 (rgba weights) - vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0 (uv, gray, bias) - // ... 16 more vec4s -); -``` - -**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs. - ---- - -## Size Budget - -| Component | Size | Notes | -|-----------|------|-------| -| Activation functions | ~200 B | 4 functions | -| Conv3x3 (standard + coord) | ~500 B | Both variants | -| Conv5x5 (standard + coord) | ~700 B | Both variants | -| Conv7x7 (standard + coord) | ~900 B | Both variants | -| Main shader | ~800 B | Layer composition | -| C++ implementation | ~300 B | Effect class | -| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) | -| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes | -| **Total** | **5-9 KB** | Acceptable for 64k | - -**Optimization strategies:** -- Quantize weights (float32 → int8) -- Prune near-zero weights -- Use separable convolutions - ---- - -## Testing - -```bash -./build/test_demo_effects # CNN construction/shader tests -./build/demo64k # Visual test -``` - ---- - -## Blend Parameter Behavior - -**blend_amount** controls final compositing with original: -- `blend=0.0`: Pure original (no CNN effect) -- `blend=0.5`: 50% original + 50% CNN -- `blend=1.0`: Pure CNN output (full stylization) - -**Important:** Blend uses captured layer 0 input, not previous layer output. - -**Example use cases:** -- `blend=1.0`: Full stylization (default) -- `blend=0.7`: Subtle effect preserving original details -- `blend=0.3`: Light artistic touch - -## Troubleshooting - -**Shader compilation fails:** -- Check `cnn_weights_generated.wgsl` syntax -- Verify snippets registered in `shaders.cc::InitShaderComposer()` -- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`) - -**Black/corrupted output:** -- Weights untrained (identity placeholder) -- Check `captured_frame` auxiliary texture is registered -- Verify layer priorities in timeline are sequential - -**Wrong blend result:** -- Ensure layer 0 has `needs_framebuffer_capture() == true` -- Check MainSequence framebuffer capture logic -- Verify `original_input` binding is populated - -**Training loss not decreasing:** -- Lower learning rate (`--learning-rate 0.0001`) -- More epochs (`--epochs 1000`) -- Check input/target image alignment - ---- - -## Vec4 Optimization - -**Architecture:** Weights stored as vec4 pairs for SIMD efficiency. - -**Input representation:** -```wgsl -let rgbd = textureSample(...); // vec4: [r, g, b, d] -let in1 = vec4<f32>(uv_norm, gray, 1.0); // vec4: [u, v, gray, 1.0] -``` - -**Weight indexing:** -```wgsl -var pos = 0; // Direct weight array index -for (var dy = -1; dy <= 1; dy++) { - for (var dx = -1; dx <= 1; dx++) { - // Unrolled channel loop (4 output channels) - sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1); - sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1); - sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1); - sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1); - pos += 8; // 4 channels × 2 vec4s per channel - } -} -``` - -**Benefits:** -- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs) -- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment) -- **Bias integration:** Free via `[..., 1.0]` component (no separate add) -- **Code simplicity:** Eliminates inner loop, direct indexing with `pos` -- **Performance:** 2-3× GPU throughput improvement over scalar version - -**Weight layout per filter (8 floats):** -- vec4[0]: [w_r, w_g, w_b, w_d] (rgba input weights) -- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias) - -**3×3 kernel sizes:** -- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes) -- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes) - ---- - -## References - -- **Training Script:** `training/train_cnn.py` -- **Shader Composition:** `doc/SEQUENCE.md` -- **Effect System:** `src/gpu/effect.h` |
