diff options
Diffstat (limited to 'cnn_v1/docs')
| -rw-r--r-- | cnn_v1/docs/CNN.md | 79 | ||||
| -rw-r--r-- | cnn_v1/docs/CNN_BIAS_FIX_2026-02.md | 85 | ||||
| -rw-r--r-- | cnn_v1/docs/CNN_DEBUG.md | 43 | ||||
| -rw-r--r-- | cnn_v1/docs/CNN_FLATTEN_ANALYSIS.md | 189 | ||||
| -rw-r--r-- | cnn_v1/docs/CNN_RGBD_GRAYSCALE_SUMMARY.md | 136 | ||||
| -rw-r--r-- | cnn_v1/docs/CNN_TEST_TOOL.md | 244 | ||||
| -rw-r--r-- | cnn_v1/docs/CNN_V1_EFFECT.md | 400 |
7 files changed, 1176 insertions, 0 deletions
diff --git a/cnn_v1/docs/CNN.md b/cnn_v1/docs/CNN.md new file mode 100644 index 0000000..5d9a667 --- /dev/null +++ b/cnn_v1/docs/CNN.md @@ -0,0 +1,79 @@ +# Convolutional Neural Net Shader (CNN) post-processing + +**Status:** ✅ Foundation implemented (single-layer, expandable to multi-pass) + +## Idea + +Have the input 3d scene be processed by a multi-layer CNN trained on the side. +Input: some rendered scene. +Output: 'stylized' scene with CNN post-processing. + +**See `CNN_V1_EFFECT.md` for implementation details, usage, and API reference.** + +## Shader implementation + +### input / output + +Need 1 texture buffer per CNN layer. +Input (r,g,b,1/z) for layer 0 (render 3d scene), or output from layer N-1 for layer N. +output: (r,g,b, alpha). Don't need the 1/z information (can be fetched from input) + +### size of one layer + +Notation: +S: the number of input samples from layer N-1. +Example: 3x3 input -> S = 3x3 = 9. + +Each S samples is 4 values (r,g,b, w=1/z). + +Each sample is processed by a mat4 matrix. 4 input => 4 output. + +Weight matrix = S x mat4 + +Final bias: 4 values. + +WGSL code example: See file CNN.shader + +### Layers + +we need 3 or 4 layer ? +Several different shaders for each layer. +Ping-pong for input/output texture buffer between each layers? + +## Implementation Status + +**Completed:** +- ✅ Modular WGSL shader architecture (6 snippet files) +- ✅ CNNEffect C++ class (single-layer rendering) +- ✅ ShaderComposer integration (#include resolution) +- ✅ Asset registration (7 new shader assets) +- ✅ Test coverage (test_demo_effects.cc) +- ✅ Placeholder identity weights for testing + +**Size:** ~3-4 KB shader code + ~2-4 KB weights = **5-8 KB total** + +**Pending:** +- ⏳ Training script (`scripts/train_cnn.py`) to generate real weights +- ⏳ Multi-layer rendering with ping-pong textures +- ⏳ Weight quantization for size optimization + +--- + +## Training (To Be Implemented) + +The layer weight/bias data are hard-coded in the shaders. +Training workflow: + +1. Prepare image pairs (before: raw render, after: target style) +2. Run `python scripts/train_cnn.py --input scene.png --target stylized.png` +3. Script generates `cnn_weights_generated.wgsl` +4. Rebuild: `cmake --build build -j4` + +**Reference:** File `CNN.py` contains training example (needs adaptation). + +Need a repository of reference image pairs (before/after) for training and validation. +Each input image is randomly sampled into 3×3 patch of (r,g,b,1/z) input samples. +And trained to match the (r,g,b,a) output. + +Training generates the .wgsl code for layers' shaders. + diff --git a/cnn_v1/docs/CNN_BIAS_FIX_2026-02.md b/cnn_v1/docs/CNN_BIAS_FIX_2026-02.md new file mode 100644 index 0000000..26db8eb --- /dev/null +++ b/cnn_v1/docs/CNN_BIAS_FIX_2026-02.md @@ -0,0 +1,85 @@ +# CNN Bias Accumulation Fix (2026-02-11) + +## Problem +Bias was being added multiple times in shader convolution loops (once per kernel position), causing mismatch between PyTorch training and WGSL inference. + +## Root Cause +**Location**: `training/train_cnn.py:381, 398` + +When exporting weights to WGSL, bias was replicated for every kernel position. The shader loops through positions doing: +```wgsl +sum += dot(weights[pos], rgbd) + dot(weights[pos+1], in1); // in1.w = 1.0 +``` + +For 3×3 kernel (9 positions), bias added 9×. For 5×5, added 25×. + +## Fix +Divide bias by `num_positions` during export: +```python +# Final layer (7→1) +v1.append(f"{bias[0] / num_positions:.6f}") + +# Inner layers (7→4) +v1.append(f"{bias[out_c] / num_positions:.6f}") +``` + +Shader accumulates bias × num_positions = original bias (correct). + +--- + +## Additional Improvements + +### 1. RGBA Output Support +**train_cnn.py**: Now saves 4-channel RGBA PNG preserving alpha from input: +```python +alpha = img_tensor[0, 3:4, :, :].permute(1, 2, 0).numpy() +output_rgba = np.concatenate([output, alpha], axis=2) +Image.fromarray((output_rgba * 255).astype(np.uint8), mode='RGBA') +``` + +Intermediate layers also save RGBA if 4-channel. + +### 2. Debug Hex Output +**Both tools** support `--debug-hex` to print first 8 pixels as hex: +```bash +./training/train_cnn.py --infer input.png --export-only checkpoint.pth --debug-hex +./build/cnn_test input.png output.png --debug-hex +``` + +Output format: `[0] 0xRRGGBBAA` for pixel-level comparison. + +### 3. Cleanup +Removed sRGB/linear_png debug code from `cnn_test.cc` (simplified PNG saving). + +--- + +## Files Modified +- `training/train_cnn.py`: Bias fix, RGBA output, --debug-hex +- `tools/cnn_test.cc`: --debug-hex, remove linear_png +- `workspaces/main/shaders/cnn/cnn_weights_generated.wgsl`: Regenerated with fixed bias + +## Testing +```bash +# Train with fixed export +./training/train_cnn.py --input training/input/ --target training/output/ \ + --layers 3 --kernel_sizes 3,3,3 --epochs 5000 + +# Generate ground truth +./training/train_cnn.py --infer input.png --export-only checkpoint.pth \ + --output ground_truth.png --debug-hex + +# Run GPU tool +./build/cnn_test input.png tool_output.png --debug-hex + +# Compare hex output for first 8 pixels +``` + +--- + +## Status +✅ Bias accumulation bug fixed +✅ RGBA output with alpha preservation +✅ Debug hex comparison tool +✅ Weights regenerated + +Commit: `8ff8c56` diff --git a/cnn_v1/docs/CNN_DEBUG.md b/cnn_v1/docs/CNN_DEBUG.md new file mode 100644 index 0000000..ba220a0 --- /dev/null +++ b/cnn_v1/docs/CNN_DEBUG.md @@ -0,0 +1,43 @@ +# CNN Effect Black Screen Bug - Resolution (2026-02) + +## Problem +CNN post-processing effect showed black screen when activated at 11.50s, despite scene rendering correctly before CNN started. + +## Root Causes + +### Bug 1: Framebuffer Capture Timing +**Location**: `src/gpu/effect.cc` +**Issue**: Capture ran INSIDE post-effect loop after ping-pong buffer swaps. CNN layers 1+ captured wrong buffer (output being written to, not scene). +**Fix**: Moved capture before loop starts (lines 308-346). Capture now copies `framebuffer_a` to `captured_frame` auxiliary texture ONCE before any post-effects run. + +### Bug 2: Missing Uniforms Update ⚠️ CRITICAL +**Location**: `src/effects/cnn_effect.cc` +**Issue**: `CNNEffect::update_bind_group()` never updated `uniforms_` buffer. `uniforms.resolution` uninitialized (0,0 or garbage) → UV calculation `p.xy / uniforms.resolution` produced NaN → all texture samples black. +**Fix**: Added uniforms update before bind group creation (lines 132-142): +```cpp +const CommonPostProcessUniforms u = { + .resolution = {(float)width_, (float)height_}, + .aspect_ratio = (float)width_ / (float)height_, + .time = 0.0f, + .beat = 0.0f, + .audio_intensity = 0.0f, +}; +uniforms_.update(ctx_.queue, u); +``` + +## Key Lessons + +1. **All post-process effects MUST update `uniforms_` buffer** - Required for UV calculations and shader parameters +2. **Framebuffer capture timing is critical** - Must happen before post-chain ping-pong starts +3. **Uninitialized uniforms cause silent failures** - Produces black output without validation errors +4. **Post-effects must render or chain breaks** - `loadOp=Load` preserves previous (black) content if no draw call executes + +## Files Modified +- `src/gpu/effect.cc`: Lines 308-346 (capture timing) +- `src/effects/cnn_effect.cc`: Lines 132-142 (uniforms update) + +## Verification +Test: `demo64k --seek 11.5` +- ✅ Scene visible with RotatingCube +- ✅ CNN stylization applied +- ✅ All 3 layers process with correct original texture reference diff --git a/cnn_v1/docs/CNN_FLATTEN_ANALYSIS.md b/cnn_v1/docs/CNN_FLATTEN_ANALYSIS.md new file mode 100644 index 0000000..8664157 --- /dev/null +++ b/cnn_v1/docs/CNN_FLATTEN_ANALYSIS.md @@ -0,0 +1,189 @@ +# CNN Shader Flatten Mode - Technical Analysis + +**Status:** Analysis complete - flatten mode NOT RECOMMENDED + +**Date:** February 2026 + +--- + +## Context + +Current CNN architecture uses **3 sequential render passes** (linear chaining): +- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer +- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer +- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original + +Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers. + +--- + +## Current Architecture + +**Shader Structure:** +- 1 pipeline with layer branching (`layer_index` uniform) +- 5 bindings: sampler, input texture, uniforms, layer params, original capture +- Total shader size: ~8 KB (snippets + weights) + +**Performance Profile:** +- 3 render pass dispatches +- 2 framebuffer writes + reads between layers +- Memory bandwidth: ~2× framebuffer size per layer +- Register pressure: Low (per-layer isolation) + +**Weight Buffer:** 290 vec4s (4.6 KB) - already unified + +--- + +## Flatten Approaches Evaluated + +### Option A: Full Flatten (All 3 Layers) + +**Cascading Receptive Field:** + +To compute final output at position (x, y): +- Layer 2 needs 3×3 neighborhood of Layer 1 outputs +- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs +- Each Layer 0 output needs 5×5 neighborhood of input samples + +**Effective input sampling:** 9×9 pixels (vs current 5×5 max) + +**Intermediate Storage (per thread/pixel):** +``` +Layer 0 outputs: 5×5 positions × 4 channels = 100 floats +Layer 1 outputs: 3×3 positions × 4 channels = 36 floats + TOTAL = 136 floats (544 bytes) +``` + +**GPU Register Pressure:** +- Modern GPUs: 32-64 KB registers per SM, shared across warps +- 544 bytes/thread → max 64 threads/SM (**low occupancy**) +- Current multi-pass: ~4-8 bytes/thread (high occupancy) + +**Pros:** +- 1 dispatch vs 3 (reduce CPU overhead) +- Zero framebuffer bandwidth between layers + +**Cons:** +- **Severe register pressure** (10-20× increase) +- Reduced occupancy → potential performance loss +- Complex shader (harder debug, larger binary) +- 9×9 input sampling + +**Assessment:** ❌ **Not Recommended** +Register cost outweighs bandwidth savings. + +--- + +### Option B: Partial Flatten (Layers 1 + 2) + +Keep Layer 0 separate, flatten only Layers 1 and 2. + +**Pass Structure:** +1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer +2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader + +**Intermediate Storage:** +``` +Layer 0 samples: 3×3 × 4 = 36 floats (read once) +Layer 1 outputs: 3×3 × 4 = 36 floats (computed) + TOTAL = 72 floats (288 bytes) +``` + +**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs + +**Pros:** +- 2 passes vs 3 (33% reduction) +- 1 framebuffer write saved +- More manageable register usage + +**Cons:** +- Still significant register pressure (288 bytes vs ~8 bytes baseline) +- Medium complexity increase +- Layer 0 (heaviest kernel) still separate + +**Assessment:** ⚠️ **Marginal Benefit** +Saves 1 pass but register cost still high. + +--- + +### Option C: Keep Current Multi-Pass ✅ + +**Rationale:** +- Current architecture well-suited to GPU design (high throughput via parallelism) +- Minimal register usage → high occupancy → hides memory latency +- Framebuffer bandwidth cost < register pressure cost +- Clean separation aids debugging/iteration +- Modular (easy to add/remove layers) + +**Alternative Optimizations (if bandwidth critical):** +1. Merge passes via render pass load/store ops (Vulkan subpasses) +2. Reduce intermediate channel count (4→3 or 2) +3. Hybrid: Compute shaders + workgroup shared memory +4. Layer pruning (2-layer vs 3-layer quality comparison) + +--- + +## Recommendation + +**✅ Keep current multi-pass architecture** + +### Decision Matrix + +| Factor | Multi-Pass | Partial Flatten | Full Flatten | +|--------|-----------|----------------|--------------| +| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme | +| Occupancy | ✅ High | ⚠️ Medium | ❌ Low | +| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest | +| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High | +| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard | +| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest | + +**Modern GPU Architecture Favors:** +- High parallelism (many small threads) over complex threads +- Hiding latency via occupancy over minimizing operations +- Memory bandwidth via caching, not elimination + +--- + +## Alternative: Compute Shader + Shared Memory + +**If bandwidth becomes critical:** +- Use compute shader with workgroup shared memory +- Load tile + halos into shared memory (9×9 input samples) +- Compute all 3 layers for tile interior (avoids redundant sampling) +- Requires explicit synchronization (`workgroupBarrier`) + +**Trade-offs:** +- ✅ Low register pressure + low bandwidth +- ❌ Compute pipeline complexity (no render pass integration) +- ❌ Tile edge handling +- ❌ Larger code size + +--- + +## Conclusion + +Current 3-pass architecture is **appropriate for demo64k**: +- Size-efficient (modular shaders) +- Performance adequate (bandwidth not bottleneck) +- Maintainable (clean layer isolation) + +**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint. + +### Size Optimization Alternatives (Better ROI) + +If size optimization critical, focus on: +1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization) +2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s) +3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels) + +These yield better size/performance than shader architecture changes. + +--- + +## References + +- `CNN_V1_EFFECT.md` - CNN implementation details +- `CNN.md` - High-level CNN design +- `../src/cnn_effect.cc` - Current implementation +- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets diff --git a/cnn_v1/docs/CNN_RGBD_GRAYSCALE_SUMMARY.md b/cnn_v1/docs/CNN_RGBD_GRAYSCALE_SUMMARY.md new file mode 100644 index 0000000..3439f2c --- /dev/null +++ b/cnn_v1/docs/CNN_RGBD_GRAYSCALE_SUMMARY.md @@ -0,0 +1,136 @@ +# CNN RGBD→Grayscale Architecture Implementation + +## Summary + +Implemented CNN architecture upgrade: RGBD input → grayscale output with 7-channel augmented input. + +## Changes Made + +### Architecture + +**Input:** RGBD (4 channels: RGB + inverse depth D=1/z) +**Output:** Grayscale (1 channel) +**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1] + +**Layer Configuration:** +- Inner layers (0..N-2): Conv2d(7→4) - output RGBD with tanh activation +- Final layer (N-1): Conv2d(7→1) - output grayscale, no activation + +### Input Normalization (all to [-1,1]) + +- **RGBD:** `(rgbd - 0.5) * 2` +- **UV coords:** `(uv - 0.5) * 2` +- **Grayscale:** `dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722))` (computed once, passed as parameter) + +**Rationale:** Zero-centered inputs for tanh activation, better gradient flow. + +### Modified Files + +**Training (`/Users/skal/demo/training/train_cnn.py`):** +1. Removed `CoordConv2d` class +2. Updated `SimpleCNN`: + - Inner layers: `Conv2d(7, 4)` - RGBD output + - Final layer: `Conv2d(7, 1)` - grayscale output +3. Updated `forward()`: + - Normalize RGBD/coords/gray to [-1,1] + - Concatenate 7-channel input for each layer + - Apply tanh (inner) or none (final) + - Denormalize final output +4. Updated `export_weights_to_wgsl()`: + - Inner: `array<array<f32, 8>, 36>` (9 pos × 4 ch × 8 values) + - Final: `array<array<f32, 8>, 9>` (9 pos × 8 values) +5. Updated `generate_layer_shader()`: + - Use `cnn_conv3x3_7to4` for inner layers + - Use `cnn_conv3x3_7to1` for final layer + - Denormalize outputs from [-1,1] to [0,1] +6. Updated `ImagePairDataset`: + - Load RGBA input (was RGB) + +**Shaders (`/Users/skal/demo/workspaces/main/shaders/cnn/cnn_conv3x3.wgsl`):** +1. Added `cnn_conv3x3_7to4()`: + - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter) + - 4-channel output: RGBD + - Weights: `array<array<f32, 8>, 36>` +2. Added `cnn_conv3x3_7to1()`: + - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter) + - 1-channel output: grayscale + - Weights: `array<array<f32, 8>, 9>` +3. Optimized: gray computed once in caller using `dot()`, not per-function + +**Documentation (`/Users/skal/demo/doc/CNN_EFFECT.md`):** +1. Updated architecture section with RGBD→grayscale pipeline +2. Updated training data requirements (RGBA input) +3. Updated weight storage format + +### No C++ Changes + +CNNLayerParams and bind groups remain unchanged. + +## Data Flow + +1. Layer 0 captures original RGBD to `captured_frame` +2. Each layer: + - Samples previous layer output (RGBD in [0,1]) + - Normalizes RGBD to [-1,1] + - Computes gray once using `dot()` (fs_main level) + - Normalizes UV coords to [-1,1] (inside conv functions) + - Concatenates 7-channel input + - Applies convolution with layer-specific weights + - Outputs RGBD (inner) or grayscale (final) in [-1,1] + - Applies tanh (inner only) + - Denormalizes to [0,1] for texture storage + - Blends with original + +## Next Steps + +1. **Prepare RGBD training data:** + - Input: RGBA images (RGB + depth in alpha) + - Target: Grayscale stylized output + +2. **Train network:** + ```bash + python3 training/train_cnn.py \ + --input training/input \ + --target training/output \ + --layers 3 \ + --epochs 1000 + ``` + +3. **Verify generated shaders:** + - Check `cnn_weights_generated.wgsl` structure + - Check `cnn_layer.wgsl` uses new conv functions + +4. **Test in demo:** + ```bash + cmake --build build -j4 + ./build/demo64k + ``` + +## Design Rationale + +**Why [-1,1] normalization?** +- Centered inputs for tanh (operates best around 0) +- Better gradient flow +- Standard ML practice for normalized data + +**Why RGBD throughout vs RGB?** +- Depth information propagates through network +- Enables depth-aware stylization +- Consistent 4-channel processing + +**Why 7-channel input?** +- Coordinates: position-dependent effects (vignettes) +- Grayscale: luminance-aware processing +- RGBD: full color+depth information +- Enables richer feature learning + +## Testing Checklist + +- [ ] Train network with RGBD input data +- [ ] Verify `cnn_weights_generated.wgsl` structure +- [ ] Verify `cnn_layer.wgsl` uses `7to4`/`7to1` functions +- [ ] Build demo without errors +- [ ] Visual test: inner layers show RGBD evolution +- [ ] Visual test: final layer produces grayscale +- [ ] Visual test: blending works correctly +- [ ] Compare quality with previous RGB→RGB architecture diff --git a/cnn_v1/docs/CNN_TEST_TOOL.md b/cnn_v1/docs/CNN_TEST_TOOL.md new file mode 100644 index 0000000..4307894 --- /dev/null +++ b/cnn_v1/docs/CNN_TEST_TOOL.md @@ -0,0 +1,244 @@ +# CNN Shader Testing Tool + +Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Supports both CNN v1 (render pipeline) and v2 (compute, storage buffer). + +--- + +## Purpose + +- Validate trained weights against ground truth +- Debug CNN layer behavior in isolation +- Generate test outputs for training workflow +- Match Python training script's inference mode + +--- + +## Architecture + +**Two implementations:** + +1. **CNN v1** (render pipeline, texture atlas weights) + - 3 fixed layers + - RGBA16Float intermediates + - BGRA8Unorm final output + +2. **CNN v2** (compute shaders, storage buffer weights) + - Dynamic layer count from binary + - 7D static features (RGBD + UV + sin + bias) + - RGBA32Uint packed f16 intermediates + - Storage buffer: ~3-5 KB weights + +**Core GPU utility:** `src/gpu/texture_readback.{h,cc}` +- Synchronous texture-to-CPU readback +- Supports RGBA16Float, RGBA32Uint, BGRA8Unorm +- Protected with STRIP_ALL (0 bytes in release) + +--- + +## Usage + +```bash +cnn_test input.png output.png [OPTIONS] + +OPTIONS: + --cnn-version N CNN version: 1 (default) or 2 (ignored with --weights) + --weights PATH Load weights from .bin (forces CNN v2, overrides layer config) + --blend F Final blend amount (0.0-1.0, default: 1.0) + --format ppm|png Output format (default: png) + --layers N Number of CNN layers (1-10, v1 only, default: 3, ignored with --weights) + --save-intermediates DIR Save intermediate layers to directory + --debug-hex Print first 8 pixels as hex (debug) + --help Show usage +``` + +**Examples:** +```bash +# CNN v1 (render pipeline, 3 layers) +./build/cnn_test input.png output.png --cnn-version 1 + +# CNN v2 (compute, storage buffer, uses asset system weights) +./build/cnn_test input.png output.png --cnn-version 2 + +# CNN v2 with runtime weight loading (loads layer config from .bin) +./build/cnn_test input.png output.png --weights checkpoints/checkpoint_epoch_100.pth.bin + +# 50% blend with original (v2) +./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5 + +# Debug hex dump +./build/cnn_test input.png output.png --cnn-version 2 --debug-hex +``` + +**Important:** When using `--weights`, the layer count and kernel sizes are read from the binary file header, overriding any `--layers` or `--cnn-version` arguments. + +--- + +## Implementation Details + +### Core Readback Utility + +**File:** `src/gpu/texture_readback.{h,cc}` + +**Function:** +```cpp +std::vector<uint8_t> read_texture_pixels( + WGPUInstance instance, + WGPUDevice device, + WGPUTexture texture, + int width, + int height); +``` + +**Features:** +- Returns BGRA8 format (4 bytes per pixel) +- Synchronous blocking operation +- Cross-platform async callback handling (Win32 vs Native API) +- Automatic staging buffer creation and cleanup + +**Refactored OffscreenRenderTarget:** +```cpp +std::vector<uint8_t> OffscreenRenderTarget::read_pixels() { +#if !defined(STRIP_ALL) + return read_texture_pixels(instance_, device_, texture_, width_, height_); +#else + return std::vector<uint8_t>(); +#endif +} +``` + +### CNN v1 Pipeline (Render) + +**Fixed 3-layer architecture:** +- Ping-pong RGBA16Float textures +- CNNLayerParams (binding 3): layer_index, blend_amount +- Shader composer resolves #include directives + +### CNN v2 Pipeline (Compute) + +**Dynamic layer architecture:** +1. **Static features compute:** Generate 7D features (RGBD + UV + sin + bias) +2. **Layer computes:** N layers from binary weights (3-5 typically) + - Storage buffer weights (read-only) + - RGBA32Uint packed f16 textures (ping-pong) + - CNNv2LayerParams: kernel_size, channels, weight_offset, blend +3. **Readback:** RGBA32Uint → f16 decode → u8 clamp + +**Binary format:** Header (20B) + layer info (20B×N) + f16 weights + +**Weight Loading:** +- **Without `--weights`:** Loads from asset system (`ASSET_WEIGHTS_CNN_V2`) +- **With `--weights PATH`:** Loads from external `.bin` file (e.g., checkpoint exports) + - Layer count and kernel sizes parsed from binary header + - Overrides any `--layers` or `--cnn-version` arguments + - Enables runtime testing of training checkpoints without rebuild + +--- + +## Build Integration + +**CMakeLists.txt:** + +1. Added `src/gpu/texture_readback.cc` to GPU_SOURCES (both sections) +2. Tool target: +```cmake +add_executable(cnn_test + tools/cnn_test.cc + src/tests/common/webgpu_test_fixture.cc + src/tests/common/offscreen_render_target.cc + ${PLATFORM_SOURCES} + ${GEN_DEMO_CC}) + +target_link_libraries(cnn_test PRIVATE + gpu util procedural ${DEMO_LIBS}) + +add_dependencies(cnn_test generate_demo_assets) + +target_compile_definitions(cnn_test PRIVATE + STB_IMAGE_IMPLEMENTATION + STB_IMAGE_WRITE_IMPLEMENTATION) +``` + +**Build:** +```bash +cmake -S . -B build -DDEMO_BUILD_TOOLS=ON +cmake --build build -j4 +``` + +--- + +## Validation Workflow (CNN v2) + +### 1. Train and Export +```bash +# Train and export weights +./scripts/train_cnn_v2_full.sh --epochs 200 --batch-size 16 +``` + +### 2. Tool Inference +```bash +# Run tool with v2 +./build/cnn_test training/input/img_000.png output.png --cnn-version 2 +``` + +### 3. Visual Comparison +Compare output.png with training/target_X/img_000.png + +--- + +## Status + +**CNN v1:** Builds and runs, produces incorrect output (all white). Use CNNEffect in demo for visual validation. + +**CNN v2:** ⚠️ Partially functional. Readback works but output differs from HTML validation tool. +- Loads binary weights from `workspaces/main/weights/cnn_v2_weights.bin` +- Matches CNNv2Effect architecture +- **Known Issue:** Visual output differs from `tools/cnn_v2_test/index.html` despite matching shader code +- Root cause under investigation (weight indexing? texture sampling? activation clamping?) +- Use HTML tool (`tools/cnn_v2_test/index.html`) for accurate validation + +--- + +## Technical Notes (Readback Fix) + +**Original Bug:** Buffer mapping returned `WGPUMapAsyncStatus_Unknown` (status=5) + +**Root Cause:** Callback mode mismatch +- Used `WGPUCallbackMode_WaitAnyOnly` (fires only during `wgpuInstanceWaitAny`) +- Called `wgpuInstanceProcessEvents` in wait loop (wrong API for this mode) +- Callback never fired → timeout → empty buffer + +**Fix Applied:** +1. Changed callback mode to `WGPUCallbackMode_AllowProcessEvents` +2. Replaced `wgpuInstanceProcessEvents` with `wgpuDevicePoll(device, true, nullptr)` +3. Added pre-mapping device poll to ensure copy completes + +**Relevant Code:** `src/gpu/texture_readback.cc` lines 97-110 + +**Reference:** WebGPU spec - Asynchronous Operations, Callback Modes + +--- + +## Limitations + +- **CNN v1:** Produces incorrect output, use for debugging only +- **Single image:** Batch processing requires shell loop +- **No real-time preview:** Offline processing only +- **PNG input:** stb_image (JPEG/PNG/BMP/TGA also supported) + +--- + +## Technical Notes + +**CNN v2 f16 decoding:** +- RGBA32Uint texture stores 8×f16 as 4×u32 +- Custom decoder: extract u16, decode f16→f32, clamp [0,1]→u8 +- Handles denormals, infinity, NaN + +**Cross-platform:** +- macOS, Linux (native WebGPU) +- Windows (mingw-w64 cross-compile) + +**Size impact:** +- Debug/STRIP_ALL=OFF: compiled +- STRIP_ALL=ON: 0 bytes (compiled out) +- FINAL_STRIP=ON: tool not built diff --git a/cnn_v1/docs/CNN_V1_EFFECT.md b/cnn_v1/docs/CNN_V1_EFFECT.md new file mode 100644 index 0000000..40f095e --- /dev/null +++ b/cnn_v1/docs/CNN_V1_EFFECT.md @@ -0,0 +1,400 @@ +# CNN Post-Processing Effect + +Neural network-based stylization for rendered scenes. + +--- + +## Overview + +Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead. + +**Key Features:** +- Position-aware layer 0 (coordinate input for vignetting, edge effects) +- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining +- Original input available to all layers via framebuffer capture +- Configurable final blend with original scene +- Modular WGSL shader architecture +- Hardcoded weights (trained offline via PyTorch) +- ~5-8 KB binary footprint + +--- + +## Architecture + +### RGBD → Grayscale Pipeline + +**Input:** RGBD (RGB + inverse depth D=1/z) +**Output:** Grayscale (1 channel) +**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1] + +**Architecture:** +- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD +- **Final layer (N-1):** Conv2d(7→1) - output grayscale + +```wgsl +// Inner layers: 7→4 (RGBD output, vec4-optimized) +fn cnn_conv3x3_7to4( + tex: texture_2d<f32>, + samp: sampler, + uv: vec2<f32>, + resolution: vec2<f32>, + gray: f32, # Grayscale [-1,1] + weights: array<vec4<f32>, 72> # 9 pos × 4 ch × 2 vec4 (8 floats per filter) +) -> vec4<f32> + +// Final layer: 7→1 (grayscale output, vec4-optimized) +fn cnn_conv3x3_7to1( + tex: texture_2d<f32>, + samp: sampler, + uv: vec2<f32>, + resolution: vec2<f32>, + gray: f32, + weights: array<vec4<f32>, 18> # 9 pos × 2 vec4 (8 floats per filter) +) -> f32 +``` + +**Input normalization:** +- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1] +- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1] +- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))` +- **Inter-layer data** stays in [-1,1] (no denormalization) +- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1] + +**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer + +### Multi-Layer Architecture + +CNNEffect supports multi-layer networks via automatic effect chaining: + +1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7` +2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2) +3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"` +4. **Original input binding**: All layers access original via `@binding(4)` +5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)` + +**Framebuffer Capture API:** +- `Effect::needs_framebuffer_capture()` - effect requests pre-capture +- MainSequence automatically blits input → `"captured_frame"` auxiliary texture +- Generic mechanism usable by any effect + +### File Structure + +``` +src/effects/ + cnn_effect.h/cc # CNNEffect class + framebuffer capture + +workspaces/main/shaders/cnn/ + cnn_activation.wgsl # tanh, ReLU, sigmoid, leaky_relu + cnn_conv3x3.wgsl # 3×3 convolution (standard + coord-aware) + cnn_conv5x5.wgsl # 5×5 convolution (standard + coord-aware) + cnn_conv7x7.wgsl # 7×7 convolution (standard + coord-aware) + cnn_weights_generated.wgsl # Weight arrays (auto-generated by train_cnn.py) + cnn_layer.wgsl # Main shader with layer switches (auto-generated by train_cnn.py) +``` + +--- + +## Training Workflow + +### 1. Prepare Training Data + +Input/target image pairs: +``` +training/input/img_000.png # RGBA (RGB + alpha) +training/output/img_000.png # Grayscale target +``` + +**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily. + +### 2. Train Network + +**Patch-based (Recommended)** - Preserves natural pixel scale: +```bash +python3 training/train_cnn.py \ + --input training/input --target training/output \ + --patch-size 32 --patches-per-image 64 --detector harris \ + --layers 3 --kernel-sizes 3,5,3 \ + --epochs 5000 --batch-size 16 --checkpoint-every 1000 +``` + +**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges) + +**Full-image (Legacy)** - Resizes to 256×256: +```bash +python3 training/train_cnn.py \ + --input training/input --target training/output \ + --layers 3 --kernel-sizes 3,5,3 \ + --epochs 10000 --batch-size 8 --checkpoint-every 1000 +``` + +**Auto-generates:** +- `cnn_weights_generated.wgsl` - Weight arrays +- `cnn_layer.wgsl` - Layer shader + +### 3. Export & Validate + +```bash +# Export shaders +./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth + +# Generate ground truth +./training/train_cnn.py --infer input.png \ + --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png +``` + +### 4. Rebuild Demo + +```bash +cmake --build build -j4 && ./build/demo64k +``` + +--- + +## Usage + +### C++ Integration + +**Single layer (manual):** +```cpp +#include "effects/cnn_effect.h" + +CNNEffectParams p; +p.layer_index = 0; +p.total_layers = 1; +p.blend_amount = 1.0f; +auto cnn = std::make_shared<CNNEffect>(ctx, p); +timeline.add_effect(cnn, start_time, end_time); +``` + +**Multi-layer (automatic via timeline compiler):** + +Use timeline syntax - `seq_compiler` expands to multiple instances. + +### Timeline Examples + +**Single-layer CNN (full stylization):** +``` +SEQUENCE 10.0 0 + EFFECT + Hybrid3DEffect 0.00 5.00 + EFFECT + CNNEffect 0.50 5.00 layers=1 +``` + +**Multi-layer CNN with blend:** +``` +SEQUENCE 10.0 0 + EFFECT + Hybrid3DEffect 0.00 5.00 + EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7 +``` + +Expands to: +```cpp +// Layer 0 (captures original, blend=1.0) +{ + CNNEffectParams p; + p.layer_index = 0; + p.total_layers = 3; + p.blend_amount = 1.0f; + seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1); +} +// Layer 1 (blend=1.0) +{ + CNNEffectParams p; + p.layer_index = 1; + p.total_layers = 3; + p.blend_amount = 1.0f; + seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2); +} +// Layer 2 (final blend=0.7) +{ + CNNEffectParams p; + p.layer_index = 2; + p.total_layers = 3; + p.blend_amount = 0.7f; + seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3); +} +``` + +--- + +## Shader Structure + +**Bindings:** +```wgsl +@group(0) @binding(0) var smplr: sampler; +@group(0) @binding(1) var txt: texture_2d<f32>; // Current layer input +@group(0) @binding(2) var<uniform> uniforms: CommonUniforms; +@group(0) @binding(3) var<uniform> params: CNNLayerParams; +@group(0) @binding(4) var original_input: texture_2d<f32>; // Layer 0 input (captured) +``` + +**Fragment shader logic:** +```wgsl +@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> { + let uv = p.xy / uniforms.resolution; + let original_raw = textureSample(original_input, smplr, uv); + let original = (original_raw - 0.5) * 2.0; // Normalize to [-1,1] + let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722)); + var result = vec4<f32>(0.0); + + if (params.layer_index == 0) { + result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution, + weights_layer0); + result = cnn_tanh(result); + } + else if (params.layer_index == 1) { + result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution, + gray, weights_layer1); + result = cnn_tanh(result); + } + // ... other layers + + // Blend with ORIGINAL input (not previous layer) + return mix(original_raw, result, params.blend_amount); +} +``` + +**Weight Storage (vec4-optimized):** + +**Inner layers (7→4 RGBD output):** +```wgsl +// Structure: array<vec4<f32>, 72> +// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) +const weights_layer0: array<vec4<f32>, 72> = array( + vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0_ch0 (rgba weights) + vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0_ch0 (uv, gray, bias) + vec4<f32>(w1_r, w1_g, w1_b, w1_d), // pos0_ch1 (rgba weights) + vec4<f32>(w1_u, w1_v, w1_gray, bias1), // pos0_ch1 (uv, gray, bias) + // ... 68 more vec4s +); +``` + +**Final layer (7→1 grayscale output):** +```wgsl +// Structure: array<vec4<f32>, 18> +// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) +const weights_layerN: array<vec4<f32>, 18> = array( + vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0 (rgba weights) + vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0 (uv, gray, bias) + // ... 16 more vec4s +); +``` + +**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs. + +--- + +## Size Budget + +| Component | Size | Notes | +|-----------|------|-------| +| Activation functions | ~200 B | 4 functions | +| Conv3x3 (standard + coord) | ~500 B | Both variants | +| Conv5x5 (standard + coord) | ~700 B | Both variants | +| Conv7x7 (standard + coord) | ~900 B | Both variants | +| Main shader | ~800 B | Layer composition | +| C++ implementation | ~300 B | Effect class | +| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) | +| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes | +| **Total** | **5-9 KB** | Acceptable for 64k | + +**Optimization strategies:** +- Quantize weights (float32 → int8) +- Prune near-zero weights +- Use separable convolutions + +--- + +## Testing + +```bash +./build/test_demo_effects # CNN construction/shader tests +./build/demo64k # Visual test +``` + +--- + +## Blend Parameter Behavior + +**blend_amount** controls final compositing with original: +- `blend=0.0`: Pure original (no CNN effect) +- `blend=0.5`: 50% original + 50% CNN +- `blend=1.0`: Pure CNN output (full stylization) + +**Important:** Blend uses captured layer 0 input, not previous layer output. + +**Example use cases:** +- `blend=1.0`: Full stylization (default) +- `blend=0.7`: Subtle effect preserving original details +- `blend=0.3`: Light artistic touch + +## Troubleshooting + +**Shader compilation fails:** +- Check `cnn_weights_generated.wgsl` syntax +- Verify snippets registered in `shaders.cc::InitShaderComposer()` +- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`) + +**Black/corrupted output:** +- Weights untrained (identity placeholder) +- Check `captured_frame` auxiliary texture is registered +- Verify layer priorities in timeline are sequential + +**Wrong blend result:** +- Ensure layer 0 has `needs_framebuffer_capture() == true` +- Check MainSequence framebuffer capture logic +- Verify `original_input` binding is populated + +**Training loss not decreasing:** +- Lower learning rate (`--learning-rate 0.0001`) +- More epochs (`--epochs 1000`) +- Check input/target image alignment + +--- + +## Vec4 Optimization + +**Architecture:** Weights stored as vec4 pairs for SIMD efficiency. + +**Input representation:** +```wgsl +let rgbd = textureSample(...); // vec4: [r, g, b, d] +let in1 = vec4<f32>(uv_norm, gray, 1.0); // vec4: [u, v, gray, 1.0] +``` + +**Weight indexing:** +```wgsl +var pos = 0; // Direct weight array index +for (var dy = -1; dy <= 1; dy++) { + for (var dx = -1; dx <= 1; dx++) { + // Unrolled channel loop (4 output channels) + sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1); + sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1); + sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1); + sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1); + pos += 8; // 4 channels × 2 vec4s per channel + } +} +``` + +**Benefits:** +- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs) +- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment) +- **Bias integration:** Free via `[..., 1.0]` component (no separate add) +- **Code simplicity:** Eliminates inner loop, direct indexing with `pos` +- **Performance:** 2-3× GPU throughput improvement over scalar version + +**Weight layout per filter (8 floats):** +- vec4[0]: [w_r, w_g, w_b, w_d] (rgba input weights) +- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias) + +**3×3 kernel sizes:** +- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes) +- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes) + +--- + +## References + +- **Training Script:** `training/train_cnn.py` +- **Shader Composition:** `doc/SEQUENCE.md` +- **Effect System:** `src/gpu/effect.h` |
