diff options
Diffstat (limited to 'cnn_v1/docs/CNN_V1_EFFECT.md')
| -rw-r--r-- | cnn_v1/docs/CNN_V1_EFFECT.md | 400 |
1 files changed, 400 insertions, 0 deletions
diff --git a/cnn_v1/docs/CNN_V1_EFFECT.md b/cnn_v1/docs/CNN_V1_EFFECT.md new file mode 100644 index 0000000..40f095e --- /dev/null +++ b/cnn_v1/docs/CNN_V1_EFFECT.md @@ -0,0 +1,400 @@ +# CNN Post-Processing Effect + +Neural network-based stylization for rendered scenes. + +--- + +## Overview + +Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead. + +**Key Features:** +- Position-aware layer 0 (coordinate input for vignetting, edge effects) +- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining +- Original input available to all layers via framebuffer capture +- Configurable final blend with original scene +- Modular WGSL shader architecture +- Hardcoded weights (trained offline via PyTorch) +- ~5-8 KB binary footprint + +--- + +## Architecture + +### RGBD → Grayscale Pipeline + +**Input:** RGBD (RGB + inverse depth D=1/z) +**Output:** Grayscale (1 channel) +**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1] + +**Architecture:** +- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD +- **Final layer (N-1):** Conv2d(7→1) - output grayscale + +```wgsl +// Inner layers: 7→4 (RGBD output, vec4-optimized) +fn cnn_conv3x3_7to4( + tex: texture_2d<f32>, + samp: sampler, + uv: vec2<f32>, + resolution: vec2<f32>, + gray: f32, # Grayscale [-1,1] + weights: array<vec4<f32>, 72> # 9 pos × 4 ch × 2 vec4 (8 floats per filter) +) -> vec4<f32> + +// Final layer: 7→1 (grayscale output, vec4-optimized) +fn cnn_conv3x3_7to1( + tex: texture_2d<f32>, + samp: sampler, + uv: vec2<f32>, + resolution: vec2<f32>, + gray: f32, + weights: array<vec4<f32>, 18> # 9 pos × 2 vec4 (8 floats per filter) +) -> f32 +``` + +**Input normalization:** +- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1] +- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1] +- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))` +- **Inter-layer data** stays in [-1,1] (no denormalization) +- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1] + +**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer + +### Multi-Layer Architecture + +CNNEffect supports multi-layer networks via automatic effect chaining: + +1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7` +2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2) +3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"` +4. **Original input binding**: All layers access original via `@binding(4)` +5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)` + +**Framebuffer Capture API:** +- `Effect::needs_framebuffer_capture()` - effect requests pre-capture +- MainSequence automatically blits input → `"captured_frame"` auxiliary texture +- Generic mechanism usable by any effect + +### File Structure + +``` +src/effects/ + cnn_effect.h/cc # CNNEffect class + framebuffer capture + +workspaces/main/shaders/cnn/ + cnn_activation.wgsl # tanh, ReLU, sigmoid, leaky_relu + cnn_conv3x3.wgsl # 3×3 convolution (standard + coord-aware) + cnn_conv5x5.wgsl # 5×5 convolution (standard + coord-aware) + cnn_conv7x7.wgsl # 7×7 convolution (standard + coord-aware) + cnn_weights_generated.wgsl # Weight arrays (auto-generated by train_cnn.py) + cnn_layer.wgsl # Main shader with layer switches (auto-generated by train_cnn.py) +``` + +--- + +## Training Workflow + +### 1. Prepare Training Data + +Input/target image pairs: +``` +training/input/img_000.png # RGBA (RGB + alpha) +training/output/img_000.png # Grayscale target +``` + +**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily. + +### 2. Train Network + +**Patch-based (Recommended)** - Preserves natural pixel scale: +```bash +python3 training/train_cnn.py \ + --input training/input --target training/output \ + --patch-size 32 --patches-per-image 64 --detector harris \ + --layers 3 --kernel-sizes 3,5,3 \ + --epochs 5000 --batch-size 16 --checkpoint-every 1000 +``` + +**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges) + +**Full-image (Legacy)** - Resizes to 256×256: +```bash +python3 training/train_cnn.py \ + --input training/input --target training/output \ + --layers 3 --kernel-sizes 3,5,3 \ + --epochs 10000 --batch-size 8 --checkpoint-every 1000 +``` + +**Auto-generates:** +- `cnn_weights_generated.wgsl` - Weight arrays +- `cnn_layer.wgsl` - Layer shader + +### 3. Export & Validate + +```bash +# Export shaders +./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth + +# Generate ground truth +./training/train_cnn.py --infer input.png \ + --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png +``` + +### 4. Rebuild Demo + +```bash +cmake --build build -j4 && ./build/demo64k +``` + +--- + +## Usage + +### C++ Integration + +**Single layer (manual):** +```cpp +#include "effects/cnn_effect.h" + +CNNEffectParams p; +p.layer_index = 0; +p.total_layers = 1; +p.blend_amount = 1.0f; +auto cnn = std::make_shared<CNNEffect>(ctx, p); +timeline.add_effect(cnn, start_time, end_time); +``` + +**Multi-layer (automatic via timeline compiler):** + +Use timeline syntax - `seq_compiler` expands to multiple instances. + +### Timeline Examples + +**Single-layer CNN (full stylization):** +``` +SEQUENCE 10.0 0 + EFFECT + Hybrid3DEffect 0.00 5.00 + EFFECT + CNNEffect 0.50 5.00 layers=1 +``` + +**Multi-layer CNN with blend:** +``` +SEQUENCE 10.0 0 + EFFECT + Hybrid3DEffect 0.00 5.00 + EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7 +``` + +Expands to: +```cpp +// Layer 0 (captures original, blend=1.0) +{ + CNNEffectParams p; + p.layer_index = 0; + p.total_layers = 3; + p.blend_amount = 1.0f; + seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1); +} +// Layer 1 (blend=1.0) +{ + CNNEffectParams p; + p.layer_index = 1; + p.total_layers = 3; + p.blend_amount = 1.0f; + seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2); +} +// Layer 2 (final blend=0.7) +{ + CNNEffectParams p; + p.layer_index = 2; + p.total_layers = 3; + p.blend_amount = 0.7f; + seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3); +} +``` + +--- + +## Shader Structure + +**Bindings:** +```wgsl +@group(0) @binding(0) var smplr: sampler; +@group(0) @binding(1) var txt: texture_2d<f32>; // Current layer input +@group(0) @binding(2) var<uniform> uniforms: CommonUniforms; +@group(0) @binding(3) var<uniform> params: CNNLayerParams; +@group(0) @binding(4) var original_input: texture_2d<f32>; // Layer 0 input (captured) +``` + +**Fragment shader logic:** +```wgsl +@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> { + let uv = p.xy / uniforms.resolution; + let original_raw = textureSample(original_input, smplr, uv); + let original = (original_raw - 0.5) * 2.0; // Normalize to [-1,1] + let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722)); + var result = vec4<f32>(0.0); + + if (params.layer_index == 0) { + result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution, + weights_layer0); + result = cnn_tanh(result); + } + else if (params.layer_index == 1) { + result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution, + gray, weights_layer1); + result = cnn_tanh(result); + } + // ... other layers + + // Blend with ORIGINAL input (not previous layer) + return mix(original_raw, result, params.blend_amount); +} +``` + +**Weight Storage (vec4-optimized):** + +**Inner layers (7→4 RGBD output):** +```wgsl +// Structure: array<vec4<f32>, 72> +// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) +const weights_layer0: array<vec4<f32>, 72> = array( + vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0_ch0 (rgba weights) + vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0_ch0 (uv, gray, bias) + vec4<f32>(w1_r, w1_g, w1_b, w1_d), // pos0_ch1 (rgba weights) + vec4<f32>(w1_u, w1_v, w1_gray, bias1), // pos0_ch1 (uv, gray, bias) + // ... 68 more vec4s +); +``` + +**Final layer (7→1 grayscale output):** +```wgsl +// Structure: array<vec4<f32>, 18> +// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) +const weights_layerN: array<vec4<f32>, 18> = array( + vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0 (rgba weights) + vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0 (uv, gray, bias) + // ... 16 more vec4s +); +``` + +**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs. + +--- + +## Size Budget + +| Component | Size | Notes | +|-----------|------|-------| +| Activation functions | ~200 B | 4 functions | +| Conv3x3 (standard + coord) | ~500 B | Both variants | +| Conv5x5 (standard + coord) | ~700 B | Both variants | +| Conv7x7 (standard + coord) | ~900 B | Both variants | +| Main shader | ~800 B | Layer composition | +| C++ implementation | ~300 B | Effect class | +| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) | +| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes | +| **Total** | **5-9 KB** | Acceptable for 64k | + +**Optimization strategies:** +- Quantize weights (float32 → int8) +- Prune near-zero weights +- Use separable convolutions + +--- + +## Testing + +```bash +./build/test_demo_effects # CNN construction/shader tests +./build/demo64k # Visual test +``` + +--- + +## Blend Parameter Behavior + +**blend_amount** controls final compositing with original: +- `blend=0.0`: Pure original (no CNN effect) +- `blend=0.5`: 50% original + 50% CNN +- `blend=1.0`: Pure CNN output (full stylization) + +**Important:** Blend uses captured layer 0 input, not previous layer output. + +**Example use cases:** +- `blend=1.0`: Full stylization (default) +- `blend=0.7`: Subtle effect preserving original details +- `blend=0.3`: Light artistic touch + +## Troubleshooting + +**Shader compilation fails:** +- Check `cnn_weights_generated.wgsl` syntax +- Verify snippets registered in `shaders.cc::InitShaderComposer()` +- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`) + +**Black/corrupted output:** +- Weights untrained (identity placeholder) +- Check `captured_frame` auxiliary texture is registered +- Verify layer priorities in timeline are sequential + +**Wrong blend result:** +- Ensure layer 0 has `needs_framebuffer_capture() == true` +- Check MainSequence framebuffer capture logic +- Verify `original_input` binding is populated + +**Training loss not decreasing:** +- Lower learning rate (`--learning-rate 0.0001`) +- More epochs (`--epochs 1000`) +- Check input/target image alignment + +--- + +## Vec4 Optimization + +**Architecture:** Weights stored as vec4 pairs for SIMD efficiency. + +**Input representation:** +```wgsl +let rgbd = textureSample(...); // vec4: [r, g, b, d] +let in1 = vec4<f32>(uv_norm, gray, 1.0); // vec4: [u, v, gray, 1.0] +``` + +**Weight indexing:** +```wgsl +var pos = 0; // Direct weight array index +for (var dy = -1; dy <= 1; dy++) { + for (var dx = -1; dx <= 1; dx++) { + // Unrolled channel loop (4 output channels) + sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1); + sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1); + sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1); + sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1); + pos += 8; // 4 channels × 2 vec4s per channel + } +} +``` + +**Benefits:** +- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs) +- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment) +- **Bias integration:** Free via `[..., 1.0]` component (no separate add) +- **Code simplicity:** Eliminates inner loop, direct indexing with `pos` +- **Performance:** 2-3× GPU throughput improvement over scalar version + +**Weight layout per filter (8 floats):** +- vec4[0]: [w_r, w_g, w_b, w_d] (rgba input weights) +- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias) + +**3×3 kernel sizes:** +- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes) +- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes) + +--- + +## References + +- **Training Script:** `training/train_cnn.py` +- **Shader Composition:** `doc/SEQUENCE.md` +- **Effect System:** `src/gpu/effect.h` |
