diff options
Diffstat (limited to 'doc')
| -rw-r--r-- | doc/AUDIO_WAV_DRIFT_BUG.md | 17 | ||||
| -rw-r--r-- | doc/AUXILIARY_TEXTURE_INIT.md | 2 | ||||
| -rw-r--r-- | doc/BUILD.md | 8 | ||||
| -rw-r--r-- | doc/CMAKE_MODULES.md | 22 | ||||
| -rw-r--r-- | doc/CNN.md | 79 | ||||
| -rw-r--r-- | doc/CNN_BIAS_FIX_2026-02.md | 85 | ||||
| -rw-r--r-- | doc/CNN_DEBUG.md | 43 | ||||
| -rw-r--r-- | doc/CNN_EFFECT.md | 400 | ||||
| -rw-r--r-- | doc/CNN_FLATTEN_ANALYSIS.md | 189 | ||||
| -rw-r--r-- | doc/CNN_RGBD_GRAYSCALE_SUMMARY.md | 136 | ||||
| -rw-r--r-- | doc/CNN_TEST_TOOL.md | 244 | ||||
| -rw-r--r-- | doc/CNN_V2.md | 813 | ||||
| -rw-r--r-- | doc/CNN_V2_BINARY_FORMAT.md | 235 | ||||
| -rw-r--r-- | doc/CNN_V2_DEBUG_TOOLS.md | 143 | ||||
| -rw-r--r-- | doc/CNN_V2_WEB_TOOL.md | 348 | ||||
| -rw-r--r-- | doc/COMPLETED.md | 6 | ||||
| -rw-r--r-- | doc/HOWTO.md | 44 |
17 files changed, 63 insertions, 2751 deletions
diff --git a/doc/AUDIO_WAV_DRIFT_BUG.md b/doc/AUDIO_WAV_DRIFT_BUG.md index e22f4fa..050dd49 100644 --- a/doc/AUDIO_WAV_DRIFT_BUG.md +++ b/doc/AUDIO_WAV_DRIFT_BUG.md @@ -1,7 +1,8 @@ # Audio WAV Drift Bug Investigation **Date:** 2026-02-15 -**Status:** ROOT CAUSE IDENTIFIED +**Status:** ACCEPTABLE (to be continued) +**Current State:** -150ms drift at beat 64b, no glitches ## Problem Statement @@ -163,8 +164,18 @@ Eliminates cumulative truncation error. 1. ✅ Measure WAV sample positions directly (Python script) 2. ✅ Add render tracking debug output 3. ✅ Confirm over-rendering (366ms per 10s) -4. ⏳ Implement fix -5. ⏳ Verify corrected WAV alignment in viewer +4. ✅ Implement partial fix (bypass ring buffer, direct render) +5. ⚠️ Current result: -150ms drift at beat 64b (acceptable, needs further work) + +## Current Implementation (main.cc:286-308) + +**WAV dump now bypasses ring buffer entirely:** +1. **Frame accumulator**: Calculates exact frames per update (no truncation) +2. **Direct render**: Calls `synth_render()` directly with exact frame count +3. **No ring buffer**: Eliminates buffer management complexity +4. **Result**: No glitches, but -150ms drift remains + +**Remaining issue:** Drift persists despite direct rendering. Likely related to tempo scaling or audio engine state management. Acceptable for now. ## Notes diff --git a/doc/AUXILIARY_TEXTURE_INIT.md b/doc/AUXILIARY_TEXTURE_INIT.md index 9cac70b..036cbf7 100644 --- a/doc/AUXILIARY_TEXTURE_INIT.md +++ b/doc/AUXILIARY_TEXTURE_INIT.md @@ -18,7 +18,7 @@ entry.seq->resize(width, height); // Too late - textures already created **Affected:** - CircleMaskEffect (circle_mask texture) -- CNNEffect (captured_frame texture) +- CNNv1Effect (captured_frame texture) - RotatingCubeEffect (consumer, hardcoded resolution in uniforms) --- diff --git a/doc/BUILD.md b/doc/BUILD.md index d3434f4..fd0c3d9 100644 --- a/doc/BUILD.md +++ b/doc/BUILD.md @@ -95,9 +95,11 @@ Use Xcode Metal debugger for shader performance analysis. ## Build System Internals **Asset Dependency Tracking:** -- CMake tracks 42 demo + 17 test assets -- Editing shaders/audio/sequences auto-triggers rebuild -- Asset lists parsed to extract individual file dependencies +- CMake tracks 42 demo + 17 test assets split into 4 categories +- **Granular rebuilds:** Changing a shader only rebuilds shader-dependent targets +- **Categories:** `shaders` (.wgsl), `audio` (.spec, .track), `models` (.obj), `data` (.bin, .png, PROC) +- Asset lists parsed at configure time to extract category-specific file dependencies +- Unified output (`assets_data.cc`) avoids duplicate symbols while preserving granular tracking **Header Organization:** - `asset_manager_dcl.h`: Forward declarations diff --git a/doc/CMAKE_MODULES.md b/doc/CMAKE_MODULES.md index 2ea7d00..9f71d91 100644 --- a/doc/CMAKE_MODULES.md +++ b/doc/CMAKE_MODULES.md @@ -90,6 +90,18 @@ Creates an executable for the demo (legacy macro). ### `add_demo_test(NAME TEST_NAME LABEL SOURCES...)` Creates a test executable and registers it with CTest (legacy macro). +### `demo_add_asset_deps(TARGET CATEGORY)` +Adds asset category dependencies to a target for granular rebuilds. + +**Categories:** `shaders`, `audio`, `models`, `data`, `all`, `test` + +**Example:** +```cmake +demo_add_asset_deps(test_synth audio) # Only depends on audio assets +demo_add_asset_deps(test_shader_compilation shaders) # Only depends on shaders +demo_add_asset_deps(demo64k all) # Depends on all asset categories +``` + --- ## Conditional Inclusion @@ -107,12 +119,13 @@ This reduces parse time when building without tests/tools. ## Adding New Components ### New Effect -- Add sources to `cmake/DemoSourceLists.cmake` (GPU_SOURCES list) -- No other CMake changes needed +- Add sources to `cmake/DemoSourceLists.cmake` (`COMMON_GPU_EFFECTS` list) +- No other CMake changes needed (automatically included in headless and normal modes) ### New Test -- Add to `cmake/DemoTests.cmake` using `demo_add_test_with_deps()` -- Use LINK and DEPENDS parameters for libraries/assets +- Add to `cmake/DemoTests.cmake` using `add_demo_test()` +- Use `demo_add_asset_deps()` to specify asset category dependencies (e.g., `shaders`, `audio`) +- This enables granular rebuilds—only changed asset categories trigger test recompilation ### New Library - Add to `cmake/DemoLibraries.cmake` with appropriate dependencies @@ -132,6 +145,7 @@ This reduces parse time when building without tests/tools. 4. **Reusability:** Shared macros—eliminate 200+ lines of repetition 5. **Clarity:** Top-level CMakeLists.txt is 54-line roadmap 6. **Scalability:** Easy to add new tests/tools/libraries without bloating main file +7. **Granular Rebuilds:** Asset categories enable 3-5× faster incremental builds for typical changes --- diff --git a/doc/CNN.md b/doc/CNN.md deleted file mode 100644 index 2dc3362..0000000 --- a/doc/CNN.md +++ /dev/null @@ -1,79 +0,0 @@ -# Convolutional Neural Net Shader (CNN) post-processing - -**Status:** ✅ Foundation implemented (single-layer, expandable to multi-pass) - -## Idea - -Have the input 3d scene be processed by a multi-layer CNN trained on the side. -Input: some rendered scene. -Output: 'stylized' scene with CNN post-processing. - -**See `doc/CNN_EFFECT.md` for implementation details, usage, and API reference.** - -## Shader implementation - -### input / output - -Need 1 texture buffer per CNN layer. -Input (r,g,b,1/z) for layer 0 (render 3d scene), or output from layer N-1 for layer N. -output: (r,g,b, alpha). Don't need the 1/z information (can be fetched from input) - -### size of one layer - -Notation: -S: the number of input samples from layer N-1. -Example: 3x3 input -> S = 3x3 = 9. - -Each S samples is 4 values (r,g,b, w=1/z). - -Each sample is processed by a mat4 matrix. 4 input => 4 output. - -Weight matrix = S x mat4 - -Final bias: 4 values. - -WGSL code example: See file CNN.shader - -### Layers - -we need 3 or 4 layer ? -Several different shaders for each layer. -Ping-pong for input/output texture buffer between each layers? - -## Implementation Status - -**Completed:** -- ✅ Modular WGSL shader architecture (6 snippet files) -- ✅ CNNEffect C++ class (single-layer rendering) -- ✅ ShaderComposer integration (#include resolution) -- ✅ Asset registration (7 new shader assets) -- ✅ Test coverage (test_demo_effects.cc) -- ✅ Placeholder identity weights for testing - -**Size:** ~3-4 KB shader code + ~2-4 KB weights = **5-8 KB total** - -**Pending:** -- ⏳ Training script (`scripts/train_cnn.py`) to generate real weights -- ⏳ Multi-layer rendering with ping-pong textures -- ⏳ Weight quantization for size optimization - ---- - -## Training (To Be Implemented) - -The layer weight/bias data are hard-coded in the shaders. -Training workflow: - -1. Prepare image pairs (before: raw render, after: target style) -2. Run `python scripts/train_cnn.py --input scene.png --target stylized.png` -3. Script generates `cnn_weights_generated.wgsl` -4. Rebuild: `cmake --build build -j4` - -**Reference:** File `CNN.py` contains training example (needs adaptation). - -Need a repository of reference image pairs (before/after) for training and validation. -Each input image is randomly sampled into 3×3 patch of (r,g,b,1/z) input samples. -And trained to match the (r,g,b,a) output. - -Training generates the .wgsl code for layers' shaders. - diff --git a/doc/CNN_BIAS_FIX_2026-02.md b/doc/CNN_BIAS_FIX_2026-02.md deleted file mode 100644 index 26db8eb..0000000 --- a/doc/CNN_BIAS_FIX_2026-02.md +++ /dev/null @@ -1,85 +0,0 @@ -# CNN Bias Accumulation Fix (2026-02-11) - -## Problem -Bias was being added multiple times in shader convolution loops (once per kernel position), causing mismatch between PyTorch training and WGSL inference. - -## Root Cause -**Location**: `training/train_cnn.py:381, 398` - -When exporting weights to WGSL, bias was replicated for every kernel position. The shader loops through positions doing: -```wgsl -sum += dot(weights[pos], rgbd) + dot(weights[pos+1], in1); // in1.w = 1.0 -``` - -For 3×3 kernel (9 positions), bias added 9×. For 5×5, added 25×. - -## Fix -Divide bias by `num_positions` during export: -```python -# Final layer (7→1) -v1.append(f"{bias[0] / num_positions:.6f}") - -# Inner layers (7→4) -v1.append(f"{bias[out_c] / num_positions:.6f}") -``` - -Shader accumulates bias × num_positions = original bias (correct). - ---- - -## Additional Improvements - -### 1. RGBA Output Support -**train_cnn.py**: Now saves 4-channel RGBA PNG preserving alpha from input: -```python -alpha = img_tensor[0, 3:4, :, :].permute(1, 2, 0).numpy() -output_rgba = np.concatenate([output, alpha], axis=2) -Image.fromarray((output_rgba * 255).astype(np.uint8), mode='RGBA') -``` - -Intermediate layers also save RGBA if 4-channel. - -### 2. Debug Hex Output -**Both tools** support `--debug-hex` to print first 8 pixels as hex: -```bash -./training/train_cnn.py --infer input.png --export-only checkpoint.pth --debug-hex -./build/cnn_test input.png output.png --debug-hex -``` - -Output format: `[0] 0xRRGGBBAA` for pixel-level comparison. - -### 3. Cleanup -Removed sRGB/linear_png debug code from `cnn_test.cc` (simplified PNG saving). - ---- - -## Files Modified -- `training/train_cnn.py`: Bias fix, RGBA output, --debug-hex -- `tools/cnn_test.cc`: --debug-hex, remove linear_png -- `workspaces/main/shaders/cnn/cnn_weights_generated.wgsl`: Regenerated with fixed bias - -## Testing -```bash -# Train with fixed export -./training/train_cnn.py --input training/input/ --target training/output/ \ - --layers 3 --kernel_sizes 3,3,3 --epochs 5000 - -# Generate ground truth -./training/train_cnn.py --infer input.png --export-only checkpoint.pth \ - --output ground_truth.png --debug-hex - -# Run GPU tool -./build/cnn_test input.png tool_output.png --debug-hex - -# Compare hex output for first 8 pixels -``` - ---- - -## Status -✅ Bias accumulation bug fixed -✅ RGBA output with alpha preservation -✅ Debug hex comparison tool -✅ Weights regenerated - -Commit: `8ff8c56` diff --git a/doc/CNN_DEBUG.md b/doc/CNN_DEBUG.md deleted file mode 100644 index ba220a0..0000000 --- a/doc/CNN_DEBUG.md +++ /dev/null @@ -1,43 +0,0 @@ -# CNN Effect Black Screen Bug - Resolution (2026-02) - -## Problem -CNN post-processing effect showed black screen when activated at 11.50s, despite scene rendering correctly before CNN started. - -## Root Causes - -### Bug 1: Framebuffer Capture Timing -**Location**: `src/gpu/effect.cc` -**Issue**: Capture ran INSIDE post-effect loop after ping-pong buffer swaps. CNN layers 1+ captured wrong buffer (output being written to, not scene). -**Fix**: Moved capture before loop starts (lines 308-346). Capture now copies `framebuffer_a` to `captured_frame` auxiliary texture ONCE before any post-effects run. - -### Bug 2: Missing Uniforms Update ⚠️ CRITICAL -**Location**: `src/effects/cnn_effect.cc` -**Issue**: `CNNEffect::update_bind_group()` never updated `uniforms_` buffer. `uniforms.resolution` uninitialized (0,0 or garbage) → UV calculation `p.xy / uniforms.resolution` produced NaN → all texture samples black. -**Fix**: Added uniforms update before bind group creation (lines 132-142): -```cpp -const CommonPostProcessUniforms u = { - .resolution = {(float)width_, (float)height_}, - .aspect_ratio = (float)width_ / (float)height_, - .time = 0.0f, - .beat = 0.0f, - .audio_intensity = 0.0f, -}; -uniforms_.update(ctx_.queue, u); -``` - -## Key Lessons - -1. **All post-process effects MUST update `uniforms_` buffer** - Required for UV calculations and shader parameters -2. **Framebuffer capture timing is critical** - Must happen before post-chain ping-pong starts -3. **Uninitialized uniforms cause silent failures** - Produces black output without validation errors -4. **Post-effects must render or chain breaks** - `loadOp=Load` preserves previous (black) content if no draw call executes - -## Files Modified -- `src/gpu/effect.cc`: Lines 308-346 (capture timing) -- `src/effects/cnn_effect.cc`: Lines 132-142 (uniforms update) - -## Verification -Test: `demo64k --seek 11.5` -- ✅ Scene visible with RotatingCube -- ✅ CNN stylization applied -- ✅ All 3 layers process with correct original texture reference diff --git a/doc/CNN_EFFECT.md b/doc/CNN_EFFECT.md deleted file mode 100644 index 40f095e..0000000 --- a/doc/CNN_EFFECT.md +++ /dev/null @@ -1,400 +0,0 @@ -# CNN Post-Processing Effect - -Neural network-based stylization for rendered scenes. - ---- - -## Overview - -Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead. - -**Key Features:** -- Position-aware layer 0 (coordinate input for vignetting, edge effects) -- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining -- Original input available to all layers via framebuffer capture -- Configurable final blend with original scene -- Modular WGSL shader architecture -- Hardcoded weights (trained offline via PyTorch) -- ~5-8 KB binary footprint - ---- - -## Architecture - -### RGBD → Grayscale Pipeline - -**Input:** RGBD (RGB + inverse depth D=1/z) -**Output:** Grayscale (1 channel) -**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1] - -**Architecture:** -- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD -- **Final layer (N-1):** Conv2d(7→1) - output grayscale - -```wgsl -// Inner layers: 7→4 (RGBD output, vec4-optimized) -fn cnn_conv3x3_7to4( - tex: texture_2d<f32>, - samp: sampler, - uv: vec2<f32>, - resolution: vec2<f32>, - gray: f32, # Grayscale [-1,1] - weights: array<vec4<f32>, 72> # 9 pos × 4 ch × 2 vec4 (8 floats per filter) -) -> vec4<f32> - -// Final layer: 7→1 (grayscale output, vec4-optimized) -fn cnn_conv3x3_7to1( - tex: texture_2d<f32>, - samp: sampler, - uv: vec2<f32>, - resolution: vec2<f32>, - gray: f32, - weights: array<vec4<f32>, 18> # 9 pos × 2 vec4 (8 floats per filter) -) -> f32 -``` - -**Input normalization:** -- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1] -- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1] -- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))` -- **Inter-layer data** stays in [-1,1] (no denormalization) -- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1] - -**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer - -### Multi-Layer Architecture - -CNNEffect supports multi-layer networks via automatic effect chaining: - -1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7` -2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2) -3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"` -4. **Original input binding**: All layers access original via `@binding(4)` -5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)` - -**Framebuffer Capture API:** -- `Effect::needs_framebuffer_capture()` - effect requests pre-capture -- MainSequence automatically blits input → `"captured_frame"` auxiliary texture -- Generic mechanism usable by any effect - -### File Structure - -``` -src/effects/ - cnn_effect.h/cc # CNNEffect class + framebuffer capture - -workspaces/main/shaders/cnn/ - cnn_activation.wgsl # tanh, ReLU, sigmoid, leaky_relu - cnn_conv3x3.wgsl # 3×3 convolution (standard + coord-aware) - cnn_conv5x5.wgsl # 5×5 convolution (standard + coord-aware) - cnn_conv7x7.wgsl # 7×7 convolution (standard + coord-aware) - cnn_weights_generated.wgsl # Weight arrays (auto-generated by train_cnn.py) - cnn_layer.wgsl # Main shader with layer switches (auto-generated by train_cnn.py) -``` - ---- - -## Training Workflow - -### 1. Prepare Training Data - -Input/target image pairs: -``` -training/input/img_000.png # RGBA (RGB + alpha) -training/output/img_000.png # Grayscale target -``` - -**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily. - -### 2. Train Network - -**Patch-based (Recommended)** - Preserves natural pixel scale: -```bash -python3 training/train_cnn.py \ - --input training/input --target training/output \ - --patch-size 32 --patches-per-image 64 --detector harris \ - --layers 3 --kernel-sizes 3,5,3 \ - --epochs 5000 --batch-size 16 --checkpoint-every 1000 -``` - -**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges) - -**Full-image (Legacy)** - Resizes to 256×256: -```bash -python3 training/train_cnn.py \ - --input training/input --target training/output \ - --layers 3 --kernel-sizes 3,5,3 \ - --epochs 10000 --batch-size 8 --checkpoint-every 1000 -``` - -**Auto-generates:** -- `cnn_weights_generated.wgsl` - Weight arrays -- `cnn_layer.wgsl` - Layer shader - -### 3. Export & Validate - -```bash -# Export shaders -./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth - -# Generate ground truth -./training/train_cnn.py --infer input.png \ - --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png -``` - -### 4. Rebuild Demo - -```bash -cmake --build build -j4 && ./build/demo64k -``` - ---- - -## Usage - -### C++ Integration - -**Single layer (manual):** -```cpp -#include "effects/cnn_effect.h" - -CNNEffectParams p; -p.layer_index = 0; -p.total_layers = 1; -p.blend_amount = 1.0f; -auto cnn = std::make_shared<CNNEffect>(ctx, p); -timeline.add_effect(cnn, start_time, end_time); -``` - -**Multi-layer (automatic via timeline compiler):** - -Use timeline syntax - `seq_compiler` expands to multiple instances. - -### Timeline Examples - -**Single-layer CNN (full stylization):** -``` -SEQUENCE 10.0 0 - EFFECT + Hybrid3DEffect 0.00 5.00 - EFFECT + CNNEffect 0.50 5.00 layers=1 -``` - -**Multi-layer CNN with blend:** -``` -SEQUENCE 10.0 0 - EFFECT + Hybrid3DEffect 0.00 5.00 - EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7 -``` - -Expands to: -```cpp -// Layer 0 (captures original, blend=1.0) -{ - CNNEffectParams p; - p.layer_index = 0; - p.total_layers = 3; - p.blend_amount = 1.0f; - seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1); -} -// Layer 1 (blend=1.0) -{ - CNNEffectParams p; - p.layer_index = 1; - p.total_layers = 3; - p.blend_amount = 1.0f; - seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2); -} -// Layer 2 (final blend=0.7) -{ - CNNEffectParams p; - p.layer_index = 2; - p.total_layers = 3; - p.blend_amount = 0.7f; - seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3); -} -``` - ---- - -## Shader Structure - -**Bindings:** -```wgsl -@group(0) @binding(0) var smplr: sampler; -@group(0) @binding(1) var txt: texture_2d<f32>; // Current layer input -@group(0) @binding(2) var<uniform> uniforms: CommonUniforms; -@group(0) @binding(3) var<uniform> params: CNNLayerParams; -@group(0) @binding(4) var original_input: texture_2d<f32>; // Layer 0 input (captured) -``` - -**Fragment shader logic:** -```wgsl -@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> { - let uv = p.xy / uniforms.resolution; - let original_raw = textureSample(original_input, smplr, uv); - let original = (original_raw - 0.5) * 2.0; // Normalize to [-1,1] - let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722)); - var result = vec4<f32>(0.0); - - if (params.layer_index == 0) { - result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution, - weights_layer0); - result = cnn_tanh(result); - } - else if (params.layer_index == 1) { - result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution, - gray, weights_layer1); - result = cnn_tanh(result); - } - // ... other layers - - // Blend with ORIGINAL input (not previous layer) - return mix(original_raw, result, params.blend_amount); -} -``` - -**Weight Storage (vec4-optimized):** - -**Inner layers (7→4 RGBD output):** -```wgsl -// Structure: array<vec4<f32>, 72> -// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) -const weights_layer0: array<vec4<f32>, 72> = array( - vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0_ch0 (rgba weights) - vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0_ch0 (uv, gray, bias) - vec4<f32>(w1_r, w1_g, w1_b, w1_d), // pos0_ch1 (rgba weights) - vec4<f32>(w1_u, w1_v, w1_gray, bias1), // pos0_ch1 (uv, gray, bias) - // ... 68 more vec4s -); -``` - -**Final layer (7→1 grayscale output):** -```wgsl -// Structure: array<vec4<f32>, 18> -// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1]) -const weights_layerN: array<vec4<f32>, 18> = array( - vec4<f32>(w0_r, w0_g, w0_b, w0_d), // pos0 (rgba weights) - vec4<f32>(w0_u, w0_v, w0_gray, bias0), // pos0 (uv, gray, bias) - // ... 16 more vec4s -); -``` - -**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs. - ---- - -## Size Budget - -| Component | Size | Notes | -|-----------|------|-------| -| Activation functions | ~200 B | 4 functions | -| Conv3x3 (standard + coord) | ~500 B | Both variants | -| Conv5x5 (standard + coord) | ~700 B | Both variants | -| Conv7x7 (standard + coord) | ~900 B | Both variants | -| Main shader | ~800 B | Layer composition | -| C++ implementation | ~300 B | Effect class | -| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) | -| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes | -| **Total** | **5-9 KB** | Acceptable for 64k | - -**Optimization strategies:** -- Quantize weights (float32 → int8) -- Prune near-zero weights -- Use separable convolutions - ---- - -## Testing - -```bash -./build/test_demo_effects # CNN construction/shader tests -./build/demo64k # Visual test -``` - ---- - -## Blend Parameter Behavior - -**blend_amount** controls final compositing with original: -- `blend=0.0`: Pure original (no CNN effect) -- `blend=0.5`: 50% original + 50% CNN -- `blend=1.0`: Pure CNN output (full stylization) - -**Important:** Blend uses captured layer 0 input, not previous layer output. - -**Example use cases:** -- `blend=1.0`: Full stylization (default) -- `blend=0.7`: Subtle effect preserving original details -- `blend=0.3`: Light artistic touch - -## Troubleshooting - -**Shader compilation fails:** -- Check `cnn_weights_generated.wgsl` syntax -- Verify snippets registered in `shaders.cc::InitShaderComposer()` -- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`) - -**Black/corrupted output:** -- Weights untrained (identity placeholder) -- Check `captured_frame` auxiliary texture is registered -- Verify layer priorities in timeline are sequential - -**Wrong blend result:** -- Ensure layer 0 has `needs_framebuffer_capture() == true` -- Check MainSequence framebuffer capture logic -- Verify `original_input` binding is populated - -**Training loss not decreasing:** -- Lower learning rate (`--learning-rate 0.0001`) -- More epochs (`--epochs 1000`) -- Check input/target image alignment - ---- - -## Vec4 Optimization - -**Architecture:** Weights stored as vec4 pairs for SIMD efficiency. - -**Input representation:** -```wgsl -let rgbd = textureSample(...); // vec4: [r, g, b, d] -let in1 = vec4<f32>(uv_norm, gray, 1.0); // vec4: [u, v, gray, 1.0] -``` - -**Weight indexing:** -```wgsl -var pos = 0; // Direct weight array index -for (var dy = -1; dy <= 1; dy++) { - for (var dx = -1; dx <= 1; dx++) { - // Unrolled channel loop (4 output channels) - sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1); - sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1); - sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1); - sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1); - pos += 8; // 4 channels × 2 vec4s per channel - } -} -``` - -**Benefits:** -- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs) -- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment) -- **Bias integration:** Free via `[..., 1.0]` component (no separate add) -- **Code simplicity:** Eliminates inner loop, direct indexing with `pos` -- **Performance:** 2-3× GPU throughput improvement over scalar version - -**Weight layout per filter (8 floats):** -- vec4[0]: [w_r, w_g, w_b, w_d] (rgba input weights) -- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias) - -**3×3 kernel sizes:** -- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes) -- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes) - ---- - -## References - -- **Training Script:** `training/train_cnn.py` -- **Shader Composition:** `doc/SEQUENCE.md` -- **Effect System:** `src/gpu/effect.h` diff --git a/doc/CNN_FLATTEN_ANALYSIS.md b/doc/CNN_FLATTEN_ANALYSIS.md deleted file mode 100644 index bf63c5d..0000000 --- a/doc/CNN_FLATTEN_ANALYSIS.md +++ /dev/null @@ -1,189 +0,0 @@ -# CNN Shader Flatten Mode - Technical Analysis - -**Status:** Analysis complete - flatten mode NOT RECOMMENDED - -**Date:** February 2026 - ---- - -## Context - -Current CNN architecture uses **3 sequential render passes** (linear chaining): -- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer -- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer -- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original - -Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers. - ---- - -## Current Architecture - -**Shader Structure:** -- 1 pipeline with layer branching (`layer_index` uniform) -- 5 bindings: sampler, input texture, uniforms, layer params, original capture -- Total shader size: ~8 KB (snippets + weights) - -**Performance Profile:** -- 3 render pass dispatches -- 2 framebuffer writes + reads between layers -- Memory bandwidth: ~2× framebuffer size per layer -- Register pressure: Low (per-layer isolation) - -**Weight Buffer:** 290 vec4s (4.6 KB) - already unified - ---- - -## Flatten Approaches Evaluated - -### Option A: Full Flatten (All 3 Layers) - -**Cascading Receptive Field:** - -To compute final output at position (x, y): -- Layer 2 needs 3×3 neighborhood of Layer 1 outputs -- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs -- Each Layer 0 output needs 5×5 neighborhood of input samples - -**Effective input sampling:** 9×9 pixels (vs current 5×5 max) - -**Intermediate Storage (per thread/pixel):** -``` -Layer 0 outputs: 5×5 positions × 4 channels = 100 floats -Layer 1 outputs: 3×3 positions × 4 channels = 36 floats - TOTAL = 136 floats (544 bytes) -``` - -**GPU Register Pressure:** -- Modern GPUs: 32-64 KB registers per SM, shared across warps -- 544 bytes/thread → max 64 threads/SM (**low occupancy**) -- Current multi-pass: ~4-8 bytes/thread (high occupancy) - -**Pros:** -- 1 dispatch vs 3 (reduce CPU overhead) -- Zero framebuffer bandwidth between layers - -**Cons:** -- **Severe register pressure** (10-20× increase) -- Reduced occupancy → potential performance loss -- Complex shader (harder debug, larger binary) -- 9×9 input sampling - -**Assessment:** ❌ **Not Recommended** -Register cost outweighs bandwidth savings. - ---- - -### Option B: Partial Flatten (Layers 1 + 2) - -Keep Layer 0 separate, flatten only Layers 1 and 2. - -**Pass Structure:** -1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer -2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader - -**Intermediate Storage:** -``` -Layer 0 samples: 3×3 × 4 = 36 floats (read once) -Layer 1 outputs: 3×3 × 4 = 36 floats (computed) - TOTAL = 72 floats (288 bytes) -``` - -**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs - -**Pros:** -- 2 passes vs 3 (33% reduction) -- 1 framebuffer write saved -- More manageable register usage - -**Cons:** -- Still significant register pressure (288 bytes vs ~8 bytes baseline) -- Medium complexity increase -- Layer 0 (heaviest kernel) still separate - -**Assessment:** ⚠️ **Marginal Benefit** -Saves 1 pass but register cost still high. - ---- - -### Option C: Keep Current Multi-Pass ✅ - -**Rationale:** -- Current architecture well-suited to GPU design (high throughput via parallelism) -- Minimal register usage → high occupancy → hides memory latency -- Framebuffer bandwidth cost < register pressure cost -- Clean separation aids debugging/iteration -- Modular (easy to add/remove layers) - -**Alternative Optimizations (if bandwidth critical):** -1. Merge passes via render pass load/store ops (Vulkan subpasses) -2. Reduce intermediate channel count (4→3 or 2) -3. Hybrid: Compute shaders + workgroup shared memory -4. Layer pruning (2-layer vs 3-layer quality comparison) - ---- - -## Recommendation - -**✅ Keep current multi-pass architecture** - -### Decision Matrix - -| Factor | Multi-Pass | Partial Flatten | Full Flatten | -|--------|-----------|----------------|--------------| -| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme | -| Occupancy | ✅ High | ⚠️ Medium | ❌ Low | -| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest | -| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High | -| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard | -| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest | - -**Modern GPU Architecture Favors:** -- High parallelism (many small threads) over complex threads -- Hiding latency via occupancy over minimizing operations -- Memory bandwidth via caching, not elimination - ---- - -## Alternative: Compute Shader + Shared Memory - -**If bandwidth becomes critical:** -- Use compute shader with workgroup shared memory -- Load tile + halos into shared memory (9×9 input samples) -- Compute all 3 layers for tile interior (avoids redundant sampling) -- Requires explicit synchronization (`workgroupBarrier`) - -**Trade-offs:** -- ✅ Low register pressure + low bandwidth -- ❌ Compute pipeline complexity (no render pass integration) -- ❌ Tile edge handling -- ❌ Larger code size - ---- - -## Conclusion - -Current 3-pass architecture is **appropriate for demo64k**: -- Size-efficient (modular shaders) -- Performance adequate (bandwidth not bottleneck) -- Maintainable (clean layer isolation) - -**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint. - -### Size Optimization Alternatives (Better ROI) - -If size optimization critical, focus on: -1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization) -2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s) -3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels) - -These yield better size/performance than shader architecture changes. - ---- - -## References - -- `doc/CNN_EFFECT.md` - CNN implementation details -- `doc/CNN.md` - High-level CNN design -- `src/effects/cnn_effect.cc` - Current implementation -- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets diff --git a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md b/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md deleted file mode 100644 index 3439f2c..0000000 --- a/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md +++ /dev/null @@ -1,136 +0,0 @@ -# CNN RGBD→Grayscale Architecture Implementation - -## Summary - -Implemented CNN architecture upgrade: RGBD input → grayscale output with 7-channel augmented input. - -## Changes Made - -### Architecture - -**Input:** RGBD (4 channels: RGB + inverse depth D=1/z) -**Output:** Grayscale (1 channel) -**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1] - -**Layer Configuration:** -- Inner layers (0..N-2): Conv2d(7→4) - output RGBD with tanh activation -- Final layer (N-1): Conv2d(7→1) - output grayscale, no activation - -### Input Normalization (all to [-1,1]) - -- **RGBD:** `(rgbd - 0.5) * 2` -- **UV coords:** `(uv - 0.5) * 2` -- **Grayscale:** `dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722))` (computed once, passed as parameter) - -**Rationale:** Zero-centered inputs for tanh activation, better gradient flow. - -### Modified Files - -**Training (`/Users/skal/demo/training/train_cnn.py`):** -1. Removed `CoordConv2d` class -2. Updated `SimpleCNN`: - - Inner layers: `Conv2d(7, 4)` - RGBD output - - Final layer: `Conv2d(7, 1)` - grayscale output -3. Updated `forward()`: - - Normalize RGBD/coords/gray to [-1,1] - - Concatenate 7-channel input for each layer - - Apply tanh (inner) or none (final) - - Denormalize final output -4. Updated `export_weights_to_wgsl()`: - - Inner: `array<array<f32, 8>, 36>` (9 pos × 4 ch × 8 values) - - Final: `array<array<f32, 8>, 9>` (9 pos × 8 values) -5. Updated `generate_layer_shader()`: - - Use `cnn_conv3x3_7to4` for inner layers - - Use `cnn_conv3x3_7to1` for final layer - - Denormalize outputs from [-1,1] to [0,1] -6. Updated `ImagePairDataset`: - - Load RGBA input (was RGB) - -**Shaders (`/Users/skal/demo/workspaces/main/shaders/cnn/cnn_conv3x3.wgsl`):** -1. Added `cnn_conv3x3_7to4()`: - - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter) - - 4-channel output: RGBD - - Weights: `array<array<f32, 8>, 36>` -2. Added `cnn_conv3x3_7to1()`: - - 7-channel input: [RGBD, uv_x, uv_y, gray] (gray passed as parameter) - - 1-channel output: grayscale - - Weights: `array<array<f32, 8>, 9>` -3. Optimized: gray computed once in caller using `dot()`, not per-function - -**Documentation (`/Users/skal/demo/doc/CNN_EFFECT.md`):** -1. Updated architecture section with RGBD→grayscale pipeline -2. Updated training data requirements (RGBA input) -3. Updated weight storage format - -### No C++ Changes - -CNNLayerParams and bind groups remain unchanged. - -## Data Flow - -1. Layer 0 captures original RGBD to `captured_frame` -2. Each layer: - - Samples previous layer output (RGBD in [0,1]) - - Normalizes RGBD to [-1,1] - - Computes gray once using `dot()` (fs_main level) - - Normalizes UV coords to [-1,1] (inside conv functions) - - Concatenates 7-channel input - - Applies convolution with layer-specific weights - - Outputs RGBD (inner) or grayscale (final) in [-1,1] - - Applies tanh (inner only) - - Denormalizes to [0,1] for texture storage - - Blends with original - -## Next Steps - -1. **Prepare RGBD training data:** - - Input: RGBA images (RGB + depth in alpha) - - Target: Grayscale stylized output - -2. **Train network:** - ```bash - python3 training/train_cnn.py \ - --input training/input \ - --target training/output \ - --layers 3 \ - --epochs 1000 - ``` - -3. **Verify generated shaders:** - - Check `cnn_weights_generated.wgsl` structure - - Check `cnn_layer.wgsl` uses new conv functions - -4. **Test in demo:** - ```bash - cmake --build build -j4 - ./build/demo64k - ``` - -## Design Rationale - -**Why [-1,1] normalization?** -- Centered inputs for tanh (operates best around 0) -- Better gradient flow -- Standard ML practice for normalized data - -**Why RGBD throughout vs RGB?** -- Depth information propagates through network -- Enables depth-aware stylization -- Consistent 4-channel processing - -**Why 7-channel input?** -- Coordinates: position-dependent effects (vignettes) -- Grayscale: luminance-aware processing -- RGBD: full color+depth information -- Enables richer feature learning - -## Testing Checklist - -- [ ] Train network with RGBD input data -- [ ] Verify `cnn_weights_generated.wgsl` structure -- [ ] Verify `cnn_layer.wgsl` uses `7to4`/`7to1` functions -- [ ] Build demo without errors -- [ ] Visual test: inner layers show RGBD evolution -- [ ] Visual test: final layer produces grayscale -- [ ] Visual test: blending works correctly -- [ ] Compare quality with previous RGB→RGB architecture diff --git a/doc/CNN_TEST_TOOL.md b/doc/CNN_TEST_TOOL.md deleted file mode 100644 index 4307894..0000000 --- a/doc/CNN_TEST_TOOL.md +++ /dev/null @@ -1,244 +0,0 @@ -# CNN Shader Testing Tool - -Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Supports both CNN v1 (render pipeline) and v2 (compute, storage buffer). - ---- - -## Purpose - -- Validate trained weights against ground truth -- Debug CNN layer behavior in isolation -- Generate test outputs for training workflow -- Match Python training script's inference mode - ---- - -## Architecture - -**Two implementations:** - -1. **CNN v1** (render pipeline, texture atlas weights) - - 3 fixed layers - - RGBA16Float intermediates - - BGRA8Unorm final output - -2. **CNN v2** (compute shaders, storage buffer weights) - - Dynamic layer count from binary - - 7D static features (RGBD + UV + sin + bias) - - RGBA32Uint packed f16 intermediates - - Storage buffer: ~3-5 KB weights - -**Core GPU utility:** `src/gpu/texture_readback.{h,cc}` -- Synchronous texture-to-CPU readback -- Supports RGBA16Float, RGBA32Uint, BGRA8Unorm -- Protected with STRIP_ALL (0 bytes in release) - ---- - -## Usage - -```bash -cnn_test input.png output.png [OPTIONS] - -OPTIONS: - --cnn-version N CNN version: 1 (default) or 2 (ignored with --weights) - --weights PATH Load weights from .bin (forces CNN v2, overrides layer config) - --blend F Final blend amount (0.0-1.0, default: 1.0) - --format ppm|png Output format (default: png) - --layers N Number of CNN layers (1-10, v1 only, default: 3, ignored with --weights) - --save-intermediates DIR Save intermediate layers to directory - --debug-hex Print first 8 pixels as hex (debug) - --help Show usage -``` - -**Examples:** -```bash -# CNN v1 (render pipeline, 3 layers) -./build/cnn_test input.png output.png --cnn-version 1 - -# CNN v2 (compute, storage buffer, uses asset system weights) -./build/cnn_test input.png output.png --cnn-version 2 - -# CNN v2 with runtime weight loading (loads layer config from .bin) -./build/cnn_test input.png output.png --weights checkpoints/checkpoint_epoch_100.pth.bin - -# 50% blend with original (v2) -./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5 - -# Debug hex dump -./build/cnn_test input.png output.png --cnn-version 2 --debug-hex -``` - -**Important:** When using `--weights`, the layer count and kernel sizes are read from the binary file header, overriding any `--layers` or `--cnn-version` arguments. - ---- - -## Implementation Details - -### Core Readback Utility - -**File:** `src/gpu/texture_readback.{h,cc}` - -**Function:** -```cpp -std::vector<uint8_t> read_texture_pixels( - WGPUInstance instance, - WGPUDevice device, - WGPUTexture texture, - int width, - int height); -``` - -**Features:** -- Returns BGRA8 format (4 bytes per pixel) -- Synchronous blocking operation -- Cross-platform async callback handling (Win32 vs Native API) -- Automatic staging buffer creation and cleanup - -**Refactored OffscreenRenderTarget:** -```cpp -std::vector<uint8_t> OffscreenRenderTarget::read_pixels() { -#if !defined(STRIP_ALL) - return read_texture_pixels(instance_, device_, texture_, width_, height_); -#else - return std::vector<uint8_t>(); -#endif -} -``` - -### CNN v1 Pipeline (Render) - -**Fixed 3-layer architecture:** -- Ping-pong RGBA16Float textures -- CNNLayerParams (binding 3): layer_index, blend_amount -- Shader composer resolves #include directives - -### CNN v2 Pipeline (Compute) - -**Dynamic layer architecture:** -1. **Static features compute:** Generate 7D features (RGBD + UV + sin + bias) -2. **Layer computes:** N layers from binary weights (3-5 typically) - - Storage buffer weights (read-only) - - RGBA32Uint packed f16 textures (ping-pong) - - CNNv2LayerParams: kernel_size, channels, weight_offset, blend -3. **Readback:** RGBA32Uint → f16 decode → u8 clamp - -**Binary format:** Header (20B) + layer info (20B×N) + f16 weights - -**Weight Loading:** -- **Without `--weights`:** Loads from asset system (`ASSET_WEIGHTS_CNN_V2`) -- **With `--weights PATH`:** Loads from external `.bin` file (e.g., checkpoint exports) - - Layer count and kernel sizes parsed from binary header - - Overrides any `--layers` or `--cnn-version` arguments - - Enables runtime testing of training checkpoints without rebuild - ---- - -## Build Integration - -**CMakeLists.txt:** - -1. Added `src/gpu/texture_readback.cc` to GPU_SOURCES (both sections) -2. Tool target: -```cmake -add_executable(cnn_test - tools/cnn_test.cc - src/tests/common/webgpu_test_fixture.cc - src/tests/common/offscreen_render_target.cc - ${PLATFORM_SOURCES} - ${GEN_DEMO_CC}) - -target_link_libraries(cnn_test PRIVATE - gpu util procedural ${DEMO_LIBS}) - -add_dependencies(cnn_test generate_demo_assets) - -target_compile_definitions(cnn_test PRIVATE - STB_IMAGE_IMPLEMENTATION - STB_IMAGE_WRITE_IMPLEMENTATION) -``` - -**Build:** -```bash -cmake -S . -B build -DDEMO_BUILD_TOOLS=ON -cmake --build build -j4 -``` - ---- - -## Validation Workflow (CNN v2) - -### 1. Train and Export -```bash -# Train and export weights -./scripts/train_cnn_v2_full.sh --epochs 200 --batch-size 16 -``` - -### 2. Tool Inference -```bash -# Run tool with v2 -./build/cnn_test training/input/img_000.png output.png --cnn-version 2 -``` - -### 3. Visual Comparison -Compare output.png with training/target_X/img_000.png - ---- - -## Status - -**CNN v1:** Builds and runs, produces incorrect output (all white). Use CNNEffect in demo for visual validation. - -**CNN v2:** ⚠️ Partially functional. Readback works but output differs from HTML validation tool. -- Loads binary weights from `workspaces/main/weights/cnn_v2_weights.bin` -- Matches CNNv2Effect architecture -- **Known Issue:** Visual output differs from `tools/cnn_v2_test/index.html` despite matching shader code -- Root cause under investigation (weight indexing? texture sampling? activation clamping?) -- Use HTML tool (`tools/cnn_v2_test/index.html`) for accurate validation - ---- - -## Technical Notes (Readback Fix) - -**Original Bug:** Buffer mapping returned `WGPUMapAsyncStatus_Unknown` (status=5) - -**Root Cause:** Callback mode mismatch -- Used `WGPUCallbackMode_WaitAnyOnly` (fires only during `wgpuInstanceWaitAny`) -- Called `wgpuInstanceProcessEvents` in wait loop (wrong API for this mode) -- Callback never fired → timeout → empty buffer - -**Fix Applied:** -1. Changed callback mode to `WGPUCallbackMode_AllowProcessEvents` -2. Replaced `wgpuInstanceProcessEvents` with `wgpuDevicePoll(device, true, nullptr)` -3. Added pre-mapping device poll to ensure copy completes - -**Relevant Code:** `src/gpu/texture_readback.cc` lines 97-110 - -**Reference:** WebGPU spec - Asynchronous Operations, Callback Modes - ---- - -## Limitations - -- **CNN v1:** Produces incorrect output, use for debugging only -- **Single image:** Batch processing requires shell loop -- **No real-time preview:** Offline processing only -- **PNG input:** stb_image (JPEG/PNG/BMP/TGA also supported) - ---- - -## Technical Notes - -**CNN v2 f16 decoding:** -- RGBA32Uint texture stores 8×f16 as 4×u32 -- Custom decoder: extract u16, decode f16→f32, clamp [0,1]→u8 -- Handles denormals, infinity, NaN - -**Cross-platform:** -- macOS, Linux (native WebGPU) -- Windows (mingw-w64 cross-compile) - -**Size impact:** -- Debug/STRIP_ALL=OFF: compiled -- STRIP_ALL=ON: 0 bytes (compiled out) -- FINAL_STRIP=ON: tool not built diff --git a/doc/CNN_V2.md b/doc/CNN_V2.md deleted file mode 100644 index b7fd6f8..0000000 --- a/doc/CNN_V2.md +++ /dev/null @@ -1,813 +0,0 @@ -# CNN v2: Parametric Static Features - -**Technical Design Document** - ---- - -## Overview - -CNN v2 extends the original CNN post-processing effect with parametric static features, enabling richer spatial and frequency-domain inputs for improved visual quality. - -**Key improvements over v1:** -- 7D static feature input (vs 4D RGB) -- Multi-frequency position encoding (NeRF-style) -- Configurable mip-level for p0-p3 parametric features (0-3) -- Per-layer configurable kernel sizes (1×1, 3×3, 5×5) -- Variable channel counts per layer -- Float16 weight storage (~3.2 KB for 3-layer model) -- Bias integrated as static feature dimension -- Storage buffer architecture (dynamic layer count) -- Binary weight format v2 for runtime loading -- Sigmoid activation for layer 0 and final layer (smooth [0,1] mapping) - -**Status:** ✅ Complete. Sigmoid activation, stable training, validation tools operational. - -**Breaking Change:** -- Models trained with `clamp()` incompatible. Retrain required. - -**TODO:** -- 8-bit quantization with QAT for 2× size reduction (~1.6 KB) - ---- - -## Architecture - -### Pipeline Overview - -``` -Input RGBD → Static Features Compute → CNN Layers → Output RGBA - └─ computed once/frame ─┘ └─ multi-pass ─┘ -``` - -**Detailed Data Flow:** - -``` - ┌─────────────────────────────────────────┐ - │ Static Features (computed once) │ - │ 8D: p0,p1,p2,p3,uv_x,uv_y,sin10x,bias │ - └──────────────┬──────────────────────────┘ - │ - │ 8D (broadcast to all layers) - ├───────────────────────────┐ - │ │ - ┌──────────────┐ │ │ - │ Input RGBD │──────────────┤ │ - │ 4D │ 4D │ │ - └──────────────┘ │ │ - ▼ │ - ┌────────────┐ │ - │ Layer 0 │ (12D input) │ - │ (CNN) │ = 4D + 8D │ - │ 12D → 4D │ │ - └─────┬──────┘ │ - │ 4D output │ - │ │ - ├───────────────────────────┘ - │ │ - ▼ │ - ┌────────────┐ │ - │ Layer 1 │ (12D input) │ - │ (CNN) │ = 4D + 8D │ - │ 12D → 4D │ │ - └─────┬──────┘ │ - │ 4D output │ - │ │ - ├───────────────────────────┘ - ▼ │ - ... │ - │ │ - ▼ │ - ┌────────────┐ │ - │ Layer N │ (12D input) │ - │ (output) │◄──────────────────┘ - │ 12D → 4D │ - └─────┬──────┘ - │ 4D (RGBA) - ▼ - Output -``` - -**Key Points:** -- Static features computed once, broadcast to all CNN layers -- Each layer: previous 4D output + 8D static → 12D input → 4D output -- Ping-pong buffering between layers -- Layer 0 special case: uses input RGBD instead of previous layer output - -**Static Features Texture:** -- Name: `static_features` -- Format: `texture_storage_2d<rgba32uint, write>` (4×u32) -- Data: 8 float16 values packed via `pack2x16float()` -- Computed once per frame, read by all CNN layers -- Lifetime: Entire frame (all CNN layer passes) - -**CNN Layers:** -- Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels -- Layer 1+: previous output (4D) + static (8D) = 12D → 4 channels -- All layers: uniform 12D input, 4D output (ping-pong buffer) -- Storage: `texture_storage_2d<rgba32uint>` (4 channels as 2×f16 pairs) - -**Activation Functions:** -- Layer 0 & final layer: `sigmoid(x)` for smooth [0,1] mapping -- Middle layers: `ReLU` (max(0, x)) -- Rationale: Sigmoid prevents gradient blocking at boundaries, enabling better convergence -- Breaking change: Models trained with `clamp(x, 0, 1)` are incompatible, retrain required - ---- - -## Static Features (7D + 1 bias) - -### Feature Layout - -**8 float16 values per pixel:** - -```wgsl -// Slot 0-3: Parametric features (p0, p1, p2, p3) -// Sampled from configurable mip level (0=original, 1=half, 2=quarter, 3=eighth) -// Training sets mip_level via --mip-level flag, stored in binary format v2 -let p0 = ...; // RGB.r from selected mip level -let p1 = ...; // RGB.g from selected mip level -let p2 = ...; // RGB.b from selected mip level -let p3 = ...; // Depth or RGB channel from mip level - -// Slot 4-5: UV coordinates (normalized screen space) -let uv_x = coord.x / resolution.x; // Horizontal position [0,1] -let uv_y = coord.y / resolution.y; // Vertical position [0,1] - -// Slot 6: Multi-frequency position encoding -let sin20_y = sin(20.0 * uv_y); // Periodic feature (frequency=20, vertical) - -// Slot 7: Bias dimension (always 1.0) -let bias = 1.0; // Learned bias per output channel - -// Packed storage: [p0, p1, p2, p3, uv.x, uv.y, sin(20*uv.y), 1.0] -``` - -### Input Channel Mapping - -**Weight tensor layout (12 input channels per layer):** - -| Input Channel | Feature | Description | -|--------------|---------|-------------| -| 0-3 | Previous layer output | 4D RGBA from prior CNN layer (or input RGBD for Layer 0) | -| 4-11 | Static features | 8D: p0, p1, p2, p3, uv_x, uv_y, sin20_y, bias | - -**Static feature channel details:** -- Channel 4 → p0 (RGB.r from mip level) -- Channel 5 → p1 (RGB.g from mip level) -- Channel 6 → p2 (RGB.b from mip level) -- Channel 7 → p3 (depth or RGB channel from mip level) -- Channel 8 → p4 (uv_x: normalized horizontal position) -- Channel 9 → p5 (uv_y: normalized vertical position) -- Channel 10 → p6 (sin(20*uv_y): periodic encoding) -- Channel 11 → p7 (bias: constant 1.0) - -**Note:** When generating identity weights, p4-p7 correspond to input channels 8-11, not 4-7. - -### Feature Rationale - -| Feature | Dimension | Purpose | Priority | -|---------|-----------|---------|----------| -| p0-p3 | 4D | Parametric auxiliary features (mips, gradients, etc.) | Essential | -| UV coords | 2D | Spatial position awareness | Essential | -| sin(20\*uv.y) | 1D | Periodic position encoding (vertical) | Medium | -| Bias | 1D | Learned bias (standard NN) | Essential | - -**Note:** Input image RGBD (mip 0) fed only to Layer 0. Subsequent layers see static features + previous layer output. - -**Why bias as static feature:** -- Simpler shader code (single weight array) -- Standard NN formulation: y = Wx (x includes bias term) -- Saves 56-112 bytes (no separate bias buffer) -- 7 features sufficient for initial implementation - -### Future Feature Extensions - -**Option: Additional encodings:** -- `sin(40*uv.y)` - Higher frequency encoding -- `gray_mip1` - Multi-scale luminance -- `dx`, `dy` - Sobel gradients -- `variance` - Local texture measure -- `laplacian` - Edge detection - -**Option: uint8 packing (16+ features):** -```wgsl -// texture_storage_2d<rgba8unorm> stores 16 uint8 values -// Trade precision for feature count -// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y, -// sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, var, bias] -``` -Requires quantization-aware training. - ---- - -## Layer Structure - -### Example 3-Layer Network - -``` -Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels (3×3 kernel) -Layer 1: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel) -Layer 2: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel, output RGBA) -``` - -**Output:** 4 channels (RGBA). Training targets preserve alpha from target images. - -### Weight Calculations - -**Per-layer weights (uniform 12D→4D, 3×3 kernels):** -``` -Layer 0: 12 × 3 × 3 × 4 = 432 weights -Layer 1: 12 × 3 × 3 × 4 = 432 weights -Layer 2: 12 × 3 × 3 × 4 = 432 weights -Total: 1296 weights -``` - -**Storage sizes:** -- f32: 1296 × 4 = 5,184 bytes (~5.1 KB) -- f16: 1296 × 2 = 2,592 bytes (~2.5 KB) ✓ **recommended** - -**Comparison to v1:** -- v1: ~800 weights (3.2 KB f32) -- v2: ~1296 weights (2.5 KB f16) -- **Uniform architecture, smaller than v1 f32** - -### Kernel Size Guidelines - -**1×1 kernel (pointwise):** -- No spatial context, channel mixing only -- Weights: `12 × 4 = 48` per layer -- Use for: Fast inference, channel remapping - -**3×3 kernel (standard conv):** -- Local spatial context (recommended) -- Weights: `12 × 9 × 4 = 432` per layer -- Use for: Most layers (balanced quality/size) - -**5×5 kernel (large receptive field):** -- Wide spatial context -- Weights: `12 × 25 × 4 = 1200` per layer -- Use for: Output layer, fine detail enhancement - -### Channel Storage (4×f16 per texel) - -```wgsl -@group(0) @binding(1) var layer_input: texture_2d<u32>; - -fn unpack_channels(coord: vec2<i32>) -> vec4<f32> { - let packed = textureLoad(layer_input, coord, 0); - let v0 = unpack2x16float(packed.x); // [ch0, ch1] - let v1 = unpack2x16float(packed.y); // [ch2, ch3] - return vec4<f32>(v0.x, v0.y, v1.x, v1.y); -} - -fn pack_channels(values: vec4<f32>) -> vec4<u32> { - return vec4<u32>( - pack2x16float(vec2(values.x, values.y)), - pack2x16float(vec2(values.z, values.w)), - 0u, // Unused - 0u // Unused - ); -} -``` - ---- - -## Training Workflow - -### Script: `training/train_cnn_v2.py` - -**Static Feature Extraction:** - -```python -def compute_static_features(rgb, depth, mip_level=0): - """Generate parametric features (8D: p0-p3 + spatial). - - Args: - mip_level: 0=original, 1=half res, 2=quarter res, 3=eighth res - """ - h, w = rgb.shape[:2] - - # Generate mip level for p0-p3 (downsample then upsample) - if mip_level > 0: - mip_rgb = rgb.copy() - for _ in range(mip_level): - mip_rgb = cv2.pyrDown(mip_rgb) - for _ in range(mip_level): - mip_rgb = cv2.pyrUp(mip_rgb) - if mip_rgb.shape[:2] != (h, w): - mip_rgb = cv2.resize(mip_rgb, (w, h), interpolation=cv2.INTER_LINEAR) - else: - mip_rgb = rgb - - # Parametric features from mip level - p0, p1, p2, p3 = mip_rgb[..., 0], mip_rgb[..., 1], mip_rgb[..., 2], depth - - # UV coordinates (normalized) - uv_x = np.linspace(0, 1, w)[None, :].repeat(h, axis=0) - uv_y = np.linspace(0, 1, h)[:, None].repeat(w, axis=1) - - # Multi-frequency position encoding - sin10_x = np.sin(10.0 * uv_x) - - # Bias dimension (always 1.0) - bias = np.ones_like(p0) - - # Stack: [p0, p1, p2, p3, uv.x, uv.y, sin10_x, bias] - return np.stack([p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias], axis=-1) -``` - -**Network Definition:** - -```python -class CNNv2(nn.Module): - def __init__(self, kernel_sizes, num_layers=3): - super().__init__() - if isinstance(kernel_sizes, int): - kernel_sizes = [kernel_sizes] * num_layers - self.kernel_sizes = kernel_sizes - self.layers = nn.ModuleList() - - # All layers: 12D input (4 prev + 8 static) → 4D output - for kernel_size in kernel_sizes: - self.layers.append( - nn.Conv2d(12, 4, kernel_size=kernel_size, - padding=kernel_size//2, bias=False) - ) - - def forward(self, input_rgbd, static_features): - # Layer 0: input RGBD (4D) + static (8D) = 12D - x = torch.cat([input_rgbd, static_features], dim=1) - x = self.layers[0](x) - x = torch.sigmoid(x) # Soft [0,1] for layer 0 - - # Layer 1+: previous output (4D) + static (8D) = 12D - for i in range(1, len(self.layers)): - x_input = torch.cat([x, static_features], dim=1) - x = self.layers[i](x_input) - if i < len(self.layers) - 1: - x = F.relu(x) - else: - x = torch.sigmoid(x) # Soft [0,1] for final layer - - return x # RGBA output -``` - -**Training Configuration:** - -```python -# Hyperparameters -kernel_sizes = [3, 3, 3] # Per-layer kernel sizes (e.g., [1,3,5]) -num_layers = 3 # Number of CNN layers -mip_level = 0 # Mip level for p0-p3: 0=orig, 1=half, 2=quarter, 3=eighth -grayscale_loss = False # Compute loss on grayscale (Y) instead of RGBA -learning_rate = 1e-3 -batch_size = 16 -epochs = 5000 - -# Dataset: Input RGB, Target RGBA (preserves alpha channel from image) -# Model outputs RGBA, loss compares all 4 channels (or grayscale if --grayscale-loss) - -# Training loop (standard PyTorch f32) -for epoch in range(epochs): - for rgb_batch, depth_batch, target_batch in dataloader: - # Compute static features (8D) with mip level - static_feat = compute_static_features(rgb_batch, depth_batch, mip_level) - - # Input RGBD (4D) - input_rgbd = torch.cat([rgb_batch, depth_batch.unsqueeze(1)], dim=1) - - # Forward pass - output = model(input_rgbd, static_feat) - - # Loss computation (grayscale or RGBA) - if grayscale_loss: - # Convert RGBA to grayscale: Y = 0.299*R + 0.587*G + 0.114*B - output_gray = 0.299 * output[:, 0:1] + 0.587 * output[:, 1:2] + 0.114 * output[:, 2:3] - target_gray = 0.299 * target[:, 0:1] + 0.587 * target[:, 1:2] + 0.114 * target[:, 2:3] - loss = criterion(output_gray, target_gray) - else: - loss = criterion(output, target_batch) - - # Backward pass - optimizer.zero_grad() - loss.backward() - optimizer.step() -``` - -**Checkpoint Format:** - -```python -torch.save({ - 'state_dict': model.state_dict(), # f32 weights - 'config': { - 'kernel_sizes': [3, 3, 3], # Per-layer kernel sizes - 'num_layers': 3, - 'mip_level': 0, # Mip level used for p0-p3 - 'grayscale_loss': False, # Whether grayscale loss was used - 'features': ['p0', 'p1', 'p2', 'p3', 'uv.x', 'uv.y', 'sin10_x', 'bias'] - }, - 'epoch': epoch, - 'loss': loss.item() -}, f'checkpoints/checkpoint_epoch_{epoch}.pth') -``` - ---- - -## Export Workflow - -### Script: `training/export_cnn_v2_shader.py` - -**Process:** -1. Load checkpoint (f32 PyTorch weights) -2. Extract layer configs (kernels, channels) -3. Quantize weights to float16: `weights_f16 = weights_f32.astype(np.float16)` -4. Generate WGSL shader per layer -5. Write to `workspaces/<workspace>/shaders/cnn_v2/cnn_v2_*.wgsl` - -**Example Generated Shader:** - -```wgsl -// cnn_v2_layer_0.wgsl - Auto-generated from checkpoint_epoch_5000.pth - -const KERNEL_SIZE: u32 = 1u; -const IN_CHANNELS: u32 = 8u; // 7 features + bias -const OUT_CHANNELS: u32 = 16u; - -// Weights quantized to float16 (stored as f32 in shader) -const weights: array<f32, 128> = array( - 0.123047, -0.089844, 0.234375, 0.456055, ... -); - -@group(0) @binding(0) var static_features: texture_2d<u32>; -@group(0) @binding(1) var output_texture: texture_storage_2d<rgba32uint, write>; - -@compute @workgroup_size(8, 8) -fn main(@builtin(global_invocation_id) id: vec3<u32>) { - // Load static features (8D) - let static_feat = get_static_features(vec2<i32>(id.xy)); - - // Convolution (1×1 kernel = pointwise) - var output: array<f32, OUT_CHANNELS>; - for (var c: u32 = 0u; c < OUT_CHANNELS; c++) { - var sum: f32 = 0.0; - for (var k: u32 = 0u; k < IN_CHANNELS; k++) { - sum += weights[c * IN_CHANNELS + k] * static_feat[k]; - } - output[c] = max(0.0, sum); // ReLU activation - } - - // Pack and store (8×f16 per texel) - textureStore(output_texture, vec2<i32>(id.xy), pack_f16x8(output)); -} -``` - -**Float16 Quantization:** -- Training uses f32 throughout (PyTorch standard) -- Export converts to np.float16, then back to f32 for WGSL literals -- **Expected discrepancy:** <0.1% MSE (acceptable) -- Validation via HTML tool (see below) - ---- - -## Validation Workflow - -### HTML Tool: `tools/cnn_v2_test/index.html` - -**WebGPU-based testing tool** with layer visualization. - -**Usage:** -1. Open `tools/cnn_v2_test/index.html` in browser -2. Drop `.bin` weights file (from `export_cnn_v2_weights.py`) -3. Drop PNG test image -4. View results with layer inspection - -**Features:** -- Live CNN inference with WebGPU -- Layer-by-layer visualization (static features + all CNN layers) -- Weight visualization (per-layer kernels) -- View modes: CNN output, original, diff (×10) -- Blend control for comparing with original - -**Export weights:** -```bash -./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \ - --output-weights workspaces/main/cnn_v2_weights.bin -``` - -See `doc/CNN_V2_WEB_TOOL.md` for detailed documentation - ---- - -## Implementation Checklist - -### Phase 1: Shaders (Core Infrastructure) - -- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl` - Static features compute - - [ ] RGBD sampling from framebuffer - - [ ] UV coordinate calculation - - [ ] sin(10\*uv.x) computation - - [ ] Bias dimension (constant 1.0) - - [ ] Float16 packing via `pack2x16float()` - - [ ] Output to `texture_storage_2d<rgba32uint>` - -- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_layer_template.wgsl` - Layer template - - [ ] Static features unpacking - - [ ] Previous layer unpacking (8×f16) - - [ ] Convolution implementation (1×1, 3×3, 5×5) - - [ ] ReLU activation - - [ ] Output packing (8×f16) - - [ ] Proper padding handling - -### Phase 2: C++ Effect Class - -- [ ] `src/effects/cnn_v2_effect.h` - Header - - [ ] Class declaration inheriting from `PostProcessEffect` - - [ ] Static features texture member - - [ ] Layer textures vector - - [ ] Pipeline and bind group members - -- [ ] `src/effects/cnn_v2_effect.cc` - Implementation - - [ ] Constructor: Load shaders, create textures - - [ ] `init()`: Create pipelines, bind groups - - [ ] `render()`: Multi-pass execution - - [ ] Pass 0: Compute static features - - [ ] Pass 1-N: CNN layers - - [ ] Final: Composite to output - - [ ] Proper resource cleanup - -- [ ] Integration - - [ ] Add to `src/gpu/demo_effects.h` includes - - [ ] Add `cnn_v2_effect.cc` to `CMakeLists.txt` (headless + normal) - - [ ] Add shaders to `workspaces/main/assets.txt` - - [ ] Add to `src/tests/gpu/test_demo_effects.cc` - -### Phase 3: Training Pipeline - -- [ ] `training/train_cnn_v2.py` - Training script - - [ ] Static feature extraction function - - [ ] CNNv2 PyTorch model class - - [ ] Patch-based dataloader - - [ ] Training loop with checkpointing - - [ ] Command-line argument parsing - - [ ] Inference mode (ground truth generation) - -- [ ] `training/export_cnn_v2_shader.py` - Export script - - [ ] Checkpoint loading - - [ ] Weight extraction and f16 quantization - - [ ] Per-layer WGSL generation - - [ ] File output to workspace shaders/ - - [ ] Metadata preservation - -### Phase 4: Tools & Validation - -- [x] HTML validation tool - WebGPU inference with layer visualization - - [ ] Command-line argument parsing - - [ ] Shader export orchestration - - [ ] Build orchestration - - [ ] Batch image processing - - [ ] Results display - -- [ ] `src/tools/cnn_test_main.cc` - Tool updates - - [ ] Add `--cnn-version v2` flag - - [ ] CNNv2Effect instantiation path - - [ ] Static features pass execution - - [ ] Multi-layer processing - -### Phase 5: Documentation - -- [ ] `doc/HOWTO.md` - Usage guide - - [ ] Training section (CNN v2) - - [ ] Export section - - [ ] Validation section - - [ ] Examples - -- [ ] `README.md` - Project overview update - - [ ] Mention CNN v2 capability - ---- - -## File Structure - -### New Files - -``` -# Shaders (generated by export script) -workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl # Static features compute -workspaces/main/shaders/cnn_v2/cnn_v2_layer_0.wgsl # Input layer (generated) -workspaces/main/shaders/cnn_v2/cnn_v2_layer_1.wgsl # Inner layer (generated) -workspaces/main/shaders/cnn_v2/cnn_v2_layer_2.wgsl # Output layer (generated) - -# C++ implementation -src/effects/cnn_v2_effect.h # Effect class header -src/effects/cnn_v2_effect.cc # Effect implementation - -# Python training/export -training/train_cnn_v2.py # Training script -training/export_cnn_v2_shader.py # Shader generator -training/validation/ # Test images directory - -# Validation -tools/cnn_v2_test/index.html # WebGPU validation tool - -# Documentation -doc/CNN_V2.md # This file -``` - -### Modified Files - -``` -src/gpu/demo_effects.h # Add CNNv2Effect include -CMakeLists.txt # Add cnn_v2_effect.cc -workspaces/main/assets.txt # Add cnn_v2 shaders -workspaces/main/timeline.seq # Optional: add CNNv2Effect -src/tests/gpu/test_demo_effects.cc # Add CNNv2 test case -src/tools/cnn_test_main.cc # Add --cnn-version v2 -doc/HOWTO.md # Add CNN v2 sections -TODO.md # Add CNN v2 task -``` - -### Unchanged (v1 Preserved) - -``` -training/train_cnn.py # Original training -src/effects/cnn_effect.* # Original effect -workspaces/main/shaders/cnn_*.wgsl # Original v1 shaders -``` - ---- - -## Performance Characteristics - -### Static Features Compute -- **Cost:** ~0.1ms @ 1080p -- **Frequency:** Once per frame -- **Operations:** sin(), texture sampling, packing - -### CNN Layers (Example 3-layer) -- **Layer0 (1×1, 8→16):** ~0.3ms -- **Layer1 (3×3, 23→8):** ~0.8ms -- **Layer2 (5×5, 15→4):** ~1.2ms -- **Total:** ~2.4ms @ 1080p - -### Memory Usage -- Static features: 1920×1080×8×2 = 33 MB (f16) -- Layer buffers: 1920×1080×16×2 = 66 MB (max 16 channels) -- Weights: ~6.4 KB (f16, in shader code) -- **Total GPU memory:** ~100 MB - ---- - -## Size Budget - -### CNN v1 vs v2 - -| Metric | v1 | v2 | Delta | -|--------|----|----|-------| -| Weights (count) | 800 | 3268 | +2468 | -| Storage (f32) | 3.2 KB | 13.1 KB | +9.9 KB | -| Storage (f16) | N/A | 6.5 KB | +6.5 KB | -| Shader code | ~500 lines | ~800 lines | +300 lines | - -### Mitigation Strategies - -**Reduce channels:** -- [16,8,4] → [8,4,4] saves ~50% weights -- [16,8,4] → [4,4,4] saves ~60% weights - -**Smaller kernels:** -- [1,3,5] → [1,3,3] saves ~30% weights -- [1,3,5] → [1,1,3] saves ~50% weights - -**Quantization:** -- int8 weights: saves 75% (requires QAT training) -- 4-bit weights: saves 87.5% (extreme, needs research) - -**Target:** Keep CNN v2 under 10 KB for 64k demo constraint - ---- - -## Future Extensions - -### Flexible Feature Layout (Binary Format v3) - -**TODO:** Support arbitrary feature vector layouts and ordering in binary format. - -**Current Limitation:** -- Feature layout hardcoded: `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]` -- Shader must match training script exactly -- Experimentation requires shader recompilation - -**Proposed Enhancement:** -- Add feature descriptor to binary format header -- Specify feature types, sources, and ordering -- Runtime shader generation or dynamic feature indexing -- Examples: `[R, G, B, dx, dy, uv_x, bias]` or `[mip1.r, mip2.g, laplacian, uv_x, sin20_x, bias]` - -**Benefits:** -- Training experiments without C++/shader changes -- A/B test different feature combinations -- Single binary format, multiple architectures -- Faster iteration on feature engineering - -**Implementation Options:** -1. **Static approach:** Generate shader code from descriptor at load time -2. **Dynamic approach:** Array-based indexing with feature map uniform -3. **Hybrid:** Precompile common layouts, fallback to dynamic - -See `doc/CNN_V2_BINARY_FORMAT.md` for proposed descriptor format. - ---- - -### More Features (uint8 Packing) - -```wgsl -// 16 uint8 features per texel (texture_storage_2d<rgba8unorm>) -// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y, -// sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, variance, bias] -``` -- Trade precision for quantity -- Requires quantization-aware training - -### Temporal Features - -- Previous frame RGBA (motion awareness) -- Optical flow vectors -- Requires multi-frame buffer - -### Learned Position Encodings - -- Replace hand-crafted sin(10\*uv) with learned embeddings -- Requires separate embedding network -- Similar to NeRF position encoding - -### Dynamic Architecture - -- Runtime kernel size selection based on scene -- Conditional layer execution (skip connections) -- Layer pruning for performance - ---- - -## References - -- **v1 Implementation:** `src/effects/cnn_effect.*` -- **Training Guide:** `doc/HOWTO.md` (CNN Training section) -- **Test Tool:** `doc/CNN_TEST_TOOL.md` -- **Shader System:** `doc/SEQUENCE.md` -- **Size Measurement:** `doc/SIZE_MEASUREMENT.md` - ---- - -## Appendix: Design Decisions - -### Why Bias as Static Feature? - -**Alternatives considered:** -1. Separate bias array per layer (Option B) -2. Bias as static feature = 1.0 (Option A, chosen) - -**Decision rationale:** -- Simpler shader code (fewer bindings) -- Standard NN formulation (augmented input) -- Saves 56-112 bytes per model -- 7 features sufficient for v1 implementation -- Can extend to uint8 packing if >7 features needed - -### Why Float16 for Weights? - -**Alternatives considered:** -1. Keep f32 (larger, more accurate) -2. Use f16 (smaller, GPU-native) -3. Use int8 (smallest, needs QAT) - -**Decision rationale:** -- f16 saves 50% vs f32 (critical for 64k target) -- GPU-native support (pack2x16float in WGSL) -- <0.1% accuracy loss (acceptable) -- Simpler than int8 quantization - -### Why Multi-Frequency Position Encoding? - -**Inspiration:** NeRF (Neural Radiance Fields) - -**Benefits:** -- Helps network learn high-frequency details -- Better than raw UV coordinates -- Small footprint (1D per frequency) - -**Future:** Add sin(20\*uv), sin(40\*uv) if >7 features available - ---- - -## Related Documentation - -- `doc/CNN_V2_BINARY_FORMAT.md` - Binary weight file specification (.bin format) -- `doc/CNN_V2_WEB_TOOL.md` - WebGPU testing tool with layer visualization -- `doc/CNN_TEST_TOOL.md` - C++ offline validation tool (deprecated) -- `doc/HOWTO.md` - Training and validation workflows - ---- - -**Document Version:** 1.0 -**Last Updated:** 2026-02-12 -**Status:** Design approved, ready for implementation diff --git a/doc/CNN_V2_BINARY_FORMAT.md b/doc/CNN_V2_BINARY_FORMAT.md deleted file mode 100644 index 59c859d..0000000 --- a/doc/CNN_V2_BINARY_FORMAT.md +++ /dev/null @@ -1,235 +0,0 @@ -# CNN v2 Binary Weight Format Specification - -Binary format for storing trained CNN v2 weights with static feature architecture. - -**File Extension:** `.bin` -**Byte Order:** Little-endian -**Version:** 2.0 (supports mip-level for parametric features) -**Backward Compatible:** Version 1.0 files supported (mip_level=0) - ---- - -## File Structure - -**Version 2 (current):** -``` -┌─────────────────────┐ -│ Header (20 bytes) │ -├─────────────────────┤ -│ Layer Info │ -│ (20 bytes × N) │ -├─────────────────────┤ -│ Weight Data │ -│ (variable size) │ -└─────────────────────┘ -``` - -**Version 1 (legacy):** -``` -┌─────────────────────┐ -│ Header (16 bytes) │ -├─────────────────────┤ -│ Layer Info │ -│ (20 bytes × N) │ -├─────────────────────┤ -│ Weight Data │ -│ (variable size) │ -└─────────────────────┘ -``` - ---- - -## Header - -**Version 2 (20 bytes):** - -| Offset | Type | Field | Description | -|--------|------|----------------|--------------------------------------| -| 0x00 | u32 | magic | Magic number: `0x32_4E_4E_43` ("CNN2") | -| 0x04 | u32 | version | Format version (2 for current) | -| 0x08 | u32 | num_layers | Number of CNN layers (excludes static features) | -| 0x0C | u32 | total_weights | Total f16 weight count across all layers | -| 0x10 | u32 | mip_level | Mip level for p0-p3 features (0=original, 1=half, 2=quarter, 3=eighth) | - -**Version 1 (16 bytes) - Legacy:** - -| Offset | Type | Field | Description | -|--------|------|----------------|--------------------------------------| -| 0x00 | u32 | magic | Magic number: `0x32_4E_4E_43` ("CNN2") | -| 0x04 | u32 | version | Format version (1) | -| 0x08 | u32 | num_layers | Number of CNN layers | -| 0x0C | u32 | total_weights | Total f16 weight count | - -**Note:** Loaders should check version field and handle both formats. Version 1 files treated as mip_level=0. - ---- - -## Layer Info (20 bytes per layer) - -Repeated `num_layers` times: -- **Version 2:** Starting at offset 0x14 (20 bytes) -- **Version 1:** Starting at offset 0x10 (16 bytes) - -| Offset | Type | Field | Description | -|-------------|------|----------------|--------------------------------------| -| 0x00 | u32 | kernel_size | Convolution kernel dimension (3, 5, 7, etc.) | -| 0x04 | u32 | in_channels | Input channel count (includes 8 static features for Layer 1) | -| 0x08 | u32 | out_channels | Output channel count (max 8) | -| 0x0C | u32 | weight_offset | Weight array start index (f16 units, relative to weight data section) | -| 0x10 | u32 | weight_count | Number of f16 weights for this layer | - -**Layer Order:** Sequential (Layer 1, Layer 2, Layer 3, ...) - ---- - -## Weight Data (variable size) - -Starts at offset: -- **Version 2:** `20 + (num_layers × 20)` -- **Version 1:** `16 + (num_layers × 20)` - -**Format:** Packed f16 pairs stored as u32 -**Packing:** `u32 = (f16_hi << 16) | f16_lo` -**Storage:** Sequential by layer, then by output channel, input channel, spatial position - -**Weight Indexing:** -``` -weight_idx = output_ch × (in_channels × kernel_size²) + - input_ch × kernel_size² + - (ky × kernel_size + kx) -``` - -Where: -- `output_ch` ∈ [0, out_channels) -- `input_ch` ∈ [0, in_channels) -- `ky`, `kx` ∈ [0, kernel_size) - -**Unpacking f16 from u32:** -```c -uint32_t packed = weights_buffer[weight_idx / 2]; -uint16_t f16_bits = (weight_idx % 2 == 0) ? (packed & 0xFFFF) : (packed >> 16); -``` - ---- - -## Example: 3-Layer Network (Version 2) - -**Configuration:** -- Mip level: 0 (original resolution) -- Layer 0: 12→4, kernel 3×3 (432 weights) -- Layer 1: 12→4, kernel 3×3 (432 weights) -- Layer 2: 12→4, kernel 3×3 (432 weights) - -**File Layout:** -``` -Offset Size Content ------- ---- ------- -0x00 20 Header (magic, version=2, layers=3, weights=1296, mip_level=0) -0x14 20 Layer 0 info (kernel=3, in=12, out=4, offset=0, count=432) -0x28 20 Layer 1 info (kernel=3, in=12, out=4, offset=432, count=432) -0x3C 20 Layer 2 info (kernel=3, in=12, out=4, offset=864, count=432) -0x50 2592 Weight data (1296 u32 packed f16 pairs) - ---- -Total: 2672 bytes (~2.6 KB) -``` - ---- - -## Static Features - -Not stored in .bin file (computed at runtime): - -**8D Input Features:** -1. **p0** - Parametric feature 0 (from mip level) -2. **p1** - Parametric feature 1 (from mip level) -3. **p2** - Parametric feature 2 (from mip level) -4. **p3** - Parametric feature 3 (depth or from mip level) -5. **UV_X** - Normalized x coordinate [0,1] -6. **UV_Y** - Normalized y coordinate [0,1] -7. **sin(20 × UV_Y)** - Spatial frequency encoding (vertical, frequency=20) -8. **1.0** - Bias term - -**Mip Level Usage (p0-p3):** -- `mip_level=0`: RGB from original resolution (mip 0) -- `mip_level=1`: RGB from half resolution (mip 1), upsampled -- `mip_level=2`: RGB from quarter resolution (mip 2), upsampled -- `mip_level=3`: RGB from eighth resolution (mip 3), upsampled - -**Layer 0** receives input RGBD (4D) + static features (8D) = 12D input → 4D output. -**Layer 1+** receive previous layer output (4D) + static features (8D) = 12D input → 4D output. - ---- - -## Validation - -**Magic Check:** -```c -uint32_t magic; -fread(&magic, 4, 1, fp); -if (magic != 0x32_4E_4E_43) { error("Invalid CNN v2 file"); } -``` - -**Version Check:** -```c -uint32_t version; -fread(&version, 4, 1, fp); -if (version != 1 && version != 2) { error("Unsupported version"); } -uint32_t header_size = (version == 1) ? 16 : 20; -``` - -**Size Check:** -```c -expected_size = header_size + (num_layers × 20) + (total_weights × 2); -if (file_size != expected_size) { error("Size mismatch"); } -``` - -**Weight Offset Sanity:** -```c -// Each layer's offset should match cumulative count -uint32_t cumulative = 0; -for (int i = 0; i < num_layers; i++) { - if (layers[i].weight_offset != cumulative) { error("Invalid offset"); } - cumulative += layers[i].weight_count; -} -if (cumulative != total_weights) { error("Total mismatch"); } -``` - ---- - -## Future Extensions - -**TODO: Flexible Feature Layout** - -Current limitation: Feature vector layout is hardcoded as `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]`. - -Proposed enhancement for version 3: -- Add feature descriptor section to header -- Specify feature count, types, and ordering -- Support arbitrary 7D feature combinations (e.g., `[R, G, B, dx, dy, uv_x, bias]`) -- Allow runtime shader generation based on descriptor -- Enable experimentation without recompiling shaders - -Example descriptor format: -``` -struct FeatureDescriptor { - u32 feature_count; // Number of features (typically 7-8) - u32 feature_types[8]; // Type enum per feature - u32 feature_sources[8]; // Source enum (mip0, mip1, gradient, etc.) - u32 reserved[8]; // Future use -} -``` - -Benefits: -- Training can experiment with different feature combinations -- No shader recompilation needed -- Single binary format supports multiple architectures -- Easier A/B testing of feature effectiveness - ---- - -## Related Files - -- `training/export_cnn_v2_weights.py` - Binary export tool -- `src/effects/cnn_v2_effect.cc` - C++ loader -- `tools/cnn_v2_test/index.html` - WebGPU validator -- `doc/CNN_V2.md` - Architecture design diff --git a/doc/CNN_V2_DEBUG_TOOLS.md b/doc/CNN_V2_DEBUG_TOOLS.md deleted file mode 100644 index 8d1289a..0000000 --- a/doc/CNN_V2_DEBUG_TOOLS.md +++ /dev/null @@ -1,143 +0,0 @@ -# CNN v2 Debugging Tools - -Tools for investigating CNN v2 mismatch between HTML tool and cnn_test. - ---- - -## Identity Weight Generator - -**Purpose:** Generate trivial .bin files with identity passthrough for debugging. - -**Script:** `training/gen_identity_weights.py` - -**Usage:** -```bash -# 1×1 identity (default) -./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity.bin - -# 3×3 identity -./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity_3x3.bin --kernel-size 3 - -# Mix mode: 50-50 blend (0.5*p0+0.5*p4, etc) -./training/gen_identity_weights.py output.bin --mix - -# Static features only: p4→ch0, p5→ch1, p6→ch2, p7→ch3 -./training/gen_identity_weights.py output.bin --p47 - -# Custom mip level -./training/gen_identity_weights.py output.bin --kernel-size 1 --mip-level 2 -``` - -**Output:** -- Single layer, 12D→4D (4 input channels + 8 static features) -- Identity mode: Output Ch{0,1,2,3} = Input Ch{0,1,2,3} -- Mix mode (--mix): Output Ch{i} = 0.5*Input Ch{i} + 0.5*Input Ch{i+4} (50-50 blend, avoids overflow) -- Static mode (--p47): Output Ch{i} = Input Ch{i+4} (static features only, visualizes p4-p7) -- Minimal file size (~136 bytes for 1×1, ~904 bytes for 3×3) - -**Validation:** -Load in HTML tool or cnn_test - output should match input (RGB only, ignoring static features). - ---- - -## Composited Layer Visualization - -**Purpose:** Save current layer view as single composited image (4 channels side-by-side, grayscale). - -**Location:** HTML tool - "Layer Visualization" panel - -**Usage:** -1. Load image + weights in HTML tool -2. Select layer to visualize (Static 0-3, Static 4-7, Layer 0, Layer 1, etc.) -3. Click "Save Composited" button -4. Downloads PNG: `composited_layer{N}_{W}x{H}.png` - -**Output:** -- 4 channels stacked horizontally -- Grayscale representation -- Useful for comparing layer activations across tools - ---- - -## Debugging Strategy - -### Track a) Binary Conversion Chain - -**Hypothesis:** Conversion error in .bin ↔ base64 ↔ Float32Array - -**Test:** -1. Generate identity weights: - ```bash - ./training/gen_identity_weights.py workspaces/main/weights/test_identity.bin - ``` - -2. Load in HTML tool - output should match input RGB - -3. If mismatch: - - Check Python export: f16 packing in `export_cnn_v2_weights.py` line 105 - - Check HTML parsing: `unpackF16()` in `index.html` line 805-815 - - Check weight indexing: `get_weight()` shader function - -**Key locations:** -- Python: `np.float16` → `view(np.uint32)` (line 105 of export script) -- JS: `DataView` → `unpackF16()` → manual f16 decode (line 773-803) -- WGSL: `unpack2x16float()` built-in (line 492 of shader) - -### Track b) Layer Visualization - -**Purpose:** Confirm layer outputs match between HTML and C++ - -**Method:** -1. Run identical input through both tools -2. Save composited layers from HTML tool -3. Compare with cnn_test output -4. Use identity weights to isolate weight loading from computation - -### Track c) Trivial Test Case - -**Use identity weights to test:** -- Weight loading (binary parsing) -- Feature generation (static features) -- Convolution (should be passthrough) -- Output packing - -**Expected behavior:** -- Input RGB → Output RGB (exact match) -- Static features ignored (all zeros in identity matrix) - ---- - -## Known Issues - -### ~~Layer 0 Visualization Scale~~ [FIXED] - -**Issue:** Layer 0 output displayed at 0.5× brightness (divided by 2). - -**Cause:** Line 1530 used `vizScale = 0.5` for all CNN layers, but Layer 0 is clamped [0,1] and doesn't need dimming. - -**Fix:** Use scale 1.0 for Layer 0 output (layerIdx=1), 0.5 only for middle layers (ReLU, unbounded). - -### Remaining Mismatch - -**Current:** HTML tool and cnn_test produce different outputs for same input/weights. - -**Suspects:** -1. F16 unpacking difference (CPU vs GPU vs JS) -2. Static feature generation (RGBD, UV, sin encoding) -3. Convolution kernel iteration order -4. Output packing/unpacking - -**Next steps:** -1. Test with identity weights (eliminates weight loading) -2. Compare composited layer outputs -3. Add debug visualization for static features -4. Hex dump comparison (first 8 pixels) - use `--debug-hex` flag in cnn_test - ---- - -## Related Documentation - -- `doc/CNN_V2.md` - CNN v2 architecture -- `doc/CNN_V2_WEB_TOOL.md` - HTML tool documentation -- `doc/CNN_TEST_TOOL.md` - cnn_test CLI tool -- `training/export_cnn_v2_weights.py` - Binary export format diff --git a/doc/CNN_V2_WEB_TOOL.md b/doc/CNN_V2_WEB_TOOL.md deleted file mode 100644 index b6f5b0b..0000000 --- a/doc/CNN_V2_WEB_TOOL.md +++ /dev/null @@ -1,348 +0,0 @@ -# CNN v2 Web Testing Tool - -Browser-based WebGPU tool for validating CNN v2 inference with layer visualization and weight inspection. - -**Location:** `tools/cnn_v2_test/index.html` - ---- - -## Status (2026-02-13) - -**Working:** -- ✅ WebGPU initialization and device setup -- ✅ Binary weight file parsing (v1 and v2 formats) -- ✅ Automatic mip-level detection from binary format v2 -- ✅ Weight statistics (min/max per layer) -- ✅ UI layout with collapsible panels -- ✅ Mode switching (Activations/Weights tabs) -- ✅ Canvas context management (2D for weights, WebGPU for activations) -- ✅ Weight visualization infrastructure (layer selection, grid layout) -- ✅ Layer naming matches codebase convention (Layer 0, Layer 1, Layer 2) -- ✅ Static features split visualization (Static 0-3, Static 4-7) -- ✅ All layers visible including output layer (Layer 2) -- ✅ Video playback support (MP4, WebM) with frame-by-frame controls -- ✅ Video looping (automatic continuous playback) -- ✅ Mip level selection (p0-p3 features at different resolutions) - -**Recent Changes (Latest):** -- Binary format v2 support: Reads mip_level from 20-byte header -- Backward compatible: v1 (16-byte header) → mip_level=0 -- Auto-update UI dropdown when loading weights with mip_level -- Display mip_level in metadata panel -- Code refactoring: Extracted FULLSCREEN_QUAD_VS shader (reused 3× across pipelines) -- Added helper methods: `getDimensions()`, `setVideoControlsEnabled()` -- Improved code organization with section headers and comments -- Moved Mip Level selector to bottom of left sidebar (removed "Features (p0-p3)" label) -- Added `loop` attribute to video element for automatic continuous playback - -**Previous Fixes:** -- Fixed Layer 2 not appearing (was excluded from layerOutputs due to isOutput check) -- Fixed canvas context switching (force clear before recreation) -- Added Static 0-3 / Static 4-7 buttons to view all 8 static feature channels -- Aligned naming with train_cnn_v2.py/.wgsl: Layer 0, Layer 1, Layer 2 (not Layer 1, 2, 3) -- Disabled Static buttons in weights mode (no learnable weights) - -**Known Issues:** -- Layer activation visualization may show black if texture data not properly unpacked -- Weight kernel display depends on correct 2D context creation after canvas recreation - ---- - -## Architecture - -### File Structure -- Single-file HTML tool (~1100 lines) -- Embedded shaders: STATIC_SHADER, CNN_SHADER, DISPLAY_SHADER, LAYER_VIZ_SHADER -- Shared WGSL component: FULLSCREEN_QUAD_VS (reused across render pipelines) -- **Embedded default weights:** DEFAULT_WEIGHTS_B64 (base64-encoded binary v2) - - Current: 4 layers (3×3, 5×5, 3×3, 3×3), 2496 f16 weights, mip_level=2 - - Source: `workspaces/main/weights/cnn_v2_weights.bin` - - Updates: Re-encode binary with `base64 -i <file>` and update constant -- Pure WebGPU (no external dependencies) - -### Code Organization - -**Recent Refactoring (2026-02-13):** -- Extracted `FULLSCREEN_QUAD_VS` constant: Reused fullscreen quad vertex shader (2 triangles covering NDC) -- Added helper methods to CNNTester class: - - `getDimensions()`: Returns current source dimensions (video or image) - - `setVideoControlsEnabled(enabled)`: Centralized video control enable/disable -- Consolidated duplicate vertex shader code (used in mipmap generation, display, layer visualization) -- Added section headers in JavaScript for better navigation -- Improved inline comments explaining shader architecture - -**Benefits:** -- Reduced code duplication (~40 lines saved) -- Easier maintenance (single source of truth for fullscreen quad) -- Clearer separation of concerns - -### Key Components - -**1. Weight Parsing** -- Reads binary format v2: header (20B) + layer info (20B×N) + f16 weights -- Backward compatible with v1: header (16B), mip_level defaults to 0 -- Computes min/max per layer via f16 unpacking -- Stores `{ layers[], weights[], mipLevel, fileSize }` -- Auto-sets UI mip-level dropdown from loaded weights - -**2. CNN Pipeline** -- Static features computation (RGBD + UV + sin + bias → 7D packed) -- Layer-by-layer convolution with storage buffer weights -- Ping-pong buffers for intermediate results -- Copy to persistent textures for visualization - -**3. Visualization Modes** - -**Activations Mode:** -- 4 grayscale views per layer (channels 0-3 of up to 8 total) -- WebGPU compute → unpack f16 → scale → grayscale -- Auto-scale: Static features = 1.0, CNN layers = 0.2 -- Static features: Shows R,G,B,D (first 4 of 8: RGBD+UV+sin+bias) -- CNN layers: Shows first 4 output channels - -**Weights Mode:** -- 2D canvas rendering per output channel -- Shows all input kernels horizontally -- Normalized by layer min/max → [0, 1] → grayscale -- 20px cells, 2px padding between kernels - -### Texture Management - -**Persistent Storage (layerTextures[]):** -- One texture per layer output (static + all CNN layers) -- `rgba32uint` format (packed f16 data) -- `COPY_DST` usage for storing results - -**Compute Buffers (computeTextures[]):** -- 2 textures for ping-pong computation -- Reused across all layers -- `COPY_SRC` usage for copying to persistent storage - -**Pipeline:** -``` -Static pass → copy to layerTextures[0] -For each CNN layer i: - Compute (ping-pong) → copy to layerTextures[i+1] -``` - -### Layer Indexing - -**UI Layer Buttons:** -- "Static" → layerOutputs[0] (7D input features) -- "Layer 1" → layerOutputs[1] (CNN layer 1 output, uses weights.layers[0]) -- "Layer 2" → layerOutputs[2] (CNN layer 2 output, uses weights.layers[1]) -- "Layer N" → layerOutputs[N] (CNN layer N output, uses weights.layers[N-1]) - -**Weights Table:** -- "Layer 1" → weights.layers[0] (first CNN layer weights) -- "Layer 2" → weights.layers[1] (second CNN layer weights) -- "Layer N" → weights.layers[N-1] - -**Consistency:** Both UI and weights table use same numbering (1, 2, 3...) for CNN layers. - ---- - -## Known Issues - -### Issue #1: Layer Activations Show Black - -**Symptom:** -- All 4 channel canvases render black -- UV gradient test (debug mode 10) works -- Raw packed data test (mode 11) shows black -- Unpacked f16 test (mode 12) shows black - -**Diagnosis:** -- Texture access works (UV gradient visible) -- Texture data is all zeros (packed.x = 0) -- Textures being read are empty - -**Root Cause:** -- `copyTextureToTexture` operations may not be executing -- Possible ordering issue (copies not submitted before visualization) -- Alternative: textures created with wrong usage flags - -**Investigation Steps Taken:** -1. Added `onSubmittedWorkDone()` wait before visualization -2. Verified texture creation with `COPY_SRC` and `COPY_DST` flags -3. Confirmed separate texture allocation per layer (no aliasing) -4. Added debug shader modes to isolate issue - -**Next Steps:** -- Verify encoder contains copy commands (add debug logging) -- Check if compute passes actually write data (add known-value test) -- Test copyTextureToTexture in isolation -- Consider CPU readback to verify texture contents - -### Issue #2: Weight Visualization Empty - -**Symptom:** -- Canvases created with correct dimensions (logged) -- No visual output (black canvases) -- Console logs show method execution - -**Potential Causes:** -1. Weight indexing calculation incorrect -2. Canvas not properly attached to DOM when rendering -3. 2D context operations not flushing -4. Min/max normalization producing black (all values equal?) - -**Debug Added:** -- Comprehensive logging of dimensions, indices, ranges -- Canvas context check before rendering - -**Next Steps:** -- Add test rendering (fixed gradient) to verify 2D context works -- Log sample weight values to verify data access -- Check if canvas is visible in DOM inspector -- Verify min/max calculation produces valid range - ---- - -## UI Layout - -### Header -- Controls: Blend slider, Depth input, View mode display -- Drop zone for .bin weight files - -### Content Area - -**Left Sidebar (300px):** -1. Drop zone for .bin weight files -2. Weights Info panel (file size, layer table with min/max) -3. Weights Visualization panel (per-layer kernel display) -4. **Mip Level selector** (bottom) - Select p0/p1/p2 for static features - -**Main Canvas (center):** -- CNN output display with video controls (Play/Pause, Frame ◄/►) -- Supports both PNG images and video files (MP4, WebM) -- Video loops automatically for continuous playback - -**Right Sidebar (panels):** -1. **Layer Visualization Panel** (top, flex: 1) - - Layer selection buttons (Static 0-3, Static 4-7, Layer 0, Layer 1, ...) - - 2×2 grid of channel views (grayscale activations) - - 4× zoom view at bottom - -### Footer -- Status line (GPU timing, dimensions, mode) -- Console log (scrollable, color-coded) - ---- - -## Shader Details - -### LAYER_VIZ_SHADER - -**Purpose:** Display single channel from packed layer texture - -**Inputs:** -- `@binding(0) layer_tex: texture_2d<u32>` - Packed f16 layer data -- `@binding(1) viz_params: vec2<f32>` - (channel_idx, scale) - -**Debug Modes:** -- Channel 10: UV gradient (texture coordinate test) -- Channel 11: Raw packed u32 data -- Channel 12: First unpacked f16 value - -**Normal Operation:** -- Unpack all 8 f16 channels from rgba32uint -- Select channel by index (0-7) -- Apply scale factor (1.0 for static, 0.2 for CNN) -- Clamp to [0, 1] and output grayscale - -**Scale Rationale:** -- Static features (RGBD, UV): already in [0, 1] range -- CNN activations: post-ReLU [0, ~5], need scaling for visibility - ---- - -## Binary Weight Format - -See `doc/CNN_V2_BINARY_FORMAT.md` for complete specification. - -**Quick Summary:** -- Header: 16 bytes (magic, version, layer count, total weights) -- Layer info: 20 bytes × N (kernel size, channels, offsets) -- Weights: Packed f16 pairs as u32 - ---- - -## Testing Workflow - -### Load & Parse -1. Drop PNG image → displays original -2. Drop .bin weights → parses and shows info table -3. Auto-runs CNN pipeline - -### Verify Pipeline -1. Check console for "Running CNN pipeline" -2. Verify "Completed in Xms" -3. Check "Layer visualization ready: N layers" - -### Debug Activations -1. Select "Activations" tab -2. Click layer buttons to switch -3. Check console for texture/canvas logs -4. If black: note which debug modes work (UV vs data) - -### Debug Weights -1. Select "Weights" tab -2. Click Layer 1 or Layer 2 (Layer 0 has no weights) -3. Check console for "Visualizing Layer N weights" -4. Check canvas dimensions logged -5. Verify weight range is non-trivial (not [0, 0]) - ---- - -## Integration with Main Project - -**Training Pipeline:** -```bash -# Generate weights -./training/train_cnn_v2.py --export-binary - -# Test in browser -open tools/cnn_v2_test/index.html -# Drop: workspaces/main/cnn_v2_weights.bin -# Drop: training/input/test.png -``` - -**Validation:** -- Compare against demo CNNv2Effect (visual check) -- Verify layer count matches binary file -- Check weight ranges match training logs - ---- - -## Future Enhancements - -- [ ] Fix layer activation visualization (black texture issue) -- [ ] Fix weight kernel display (empty canvas issue) -- [ ] Add per-channel auto-scaling (compute min/max from visible data) -- [ ] Export rendered outputs (download PNG) -- [ ] Side-by-side comparison with original -- [ ] Heatmap mode (color-coded activations) -- [ ] Weight statistics overlay (mean, std, sparsity) -- [ ] Batch processing (multiple images in sequence) -- [ ] Integration with Python training (live reload) - ---- - -## Code Metrics - -- Total lines: ~1100 -- JavaScript: ~700 lines -- WGSL shaders: ~300 lines -- HTML/CSS: ~100 lines - -**Dependencies:** None (pure WebGPU + HTML5) - ---- - -## Related Files - -- `doc/CNN_V2.md` - CNN v2 architecture and design -- `doc/CNN_TEST_TOOL.md` - C++ offline testing tool (deprecated) -- `training/train_cnn_v2.py` - Training script with binary export -- `workspaces/main/cnn_v2_weights.bin` - Trained weights diff --git a/doc/COMPLETED.md b/doc/COMPLETED.md index 55fac50..8d30cca 100644 --- a/doc/COMPLETED.md +++ b/doc/COMPLETED.md @@ -67,7 +67,7 @@ Use `read @doc/archive/FILENAME.md` to access archived documents. - **Changes**: - Added `get_common_uniforms()` helper to Effect base class - Refactored all render()/compute() signatures from 5 parameters to single `CommonPostProcessUniforms&` - - Fixed uninitialized uniforms in CircleMaskEffect and CNNEffect + - Fixed uninitialized uniforms in CircleMaskEffect and CNNv1Effect - Updated 19 effect implementations + headers - Fixed WGSL syntax error in FlashEffect (u.audio_intensity → audio_intensity) - **Impact**: @@ -93,7 +93,7 @@ Use `read @doc/archive/FILENAME.md` to access archived documents. - All 36 tests pass (100%) - Processes 64×64 test image successfully - Ready for ground-truth validation vs Python training script - - Documented in `doc/CNN_TEST_TOOL.md` + - Documented in `cnn_v1/docs/CNN_TEST_TOOL.md` ## Recently Completed (February 10, 2026) @@ -103,7 +103,7 @@ Use `read @doc/archive/FILENAME.md` to access archived documents. - Created `BindGroupLayoutBuilder` and `BindGroupBuilder` for declarative bind group creation - Created `RenderPipelineBuilder` to simplify pipeline setup with ShaderComposer integration - Created `SamplerCache` singleton to deduplicate sampler instances - - Refactored `post_process_helper.cc`, `cnn_effect.cc`, `rotating_cube_effect.cc` + - Refactored `post_process_helper.cc`, `cnn_v1_effect.cc`, `rotating_cube_effect.cc` - **Result**: - Bind group creation: 19 instances reduced from 14→4 lines each - Pipeline creation: 30-50 lines reduced to 8 lines diff --git a/doc/HOWTO.md b/doc/HOWTO.md index 0dc9ec7..4cafaa2 100644 --- a/doc/HOWTO.md +++ b/doc/HOWTO.md @@ -100,7 +100,7 @@ make run_util_tests # Utility tests Extracts patches at salient points, trains on center pixels only (matches WGSL sliding window): ```bash # Train with 32×32 patches at detected corners/edges -./training/train_cnn.py \ +./cnn_v1/training/train_cnn.py \ --input training/input/ --target training/output/ \ --patch-size 32 --patches-per-image 64 --detector harris \ --layers 3 --kernel_sizes 3,5,3 --epochs 5000 --batch_size 16 \ @@ -117,7 +117,7 @@ Extracts patches at salient points, trains on center pixels only (matches WGSL s ### Full-Image Processes entire image with sliding window (matches WGSL): ```bash -./training/train_cnn.py \ +./cnn_v1/training/train_cnn.py \ --input training/input/ --target training/output/ \ --layers 3 --kernel_sizes 3,5,3 --epochs 10000 --batch_size 8 \ --checkpoint-every 1000 @@ -126,10 +126,10 @@ Processes entire image with sliding window (matches WGSL): ### Export & Validation ```bash # Generate shaders from checkpoint -./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth +./cnn_v1/training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth # Generate ground truth (sliding window, no tiling) -./training/train_cnn.py --infer input.png \ +./cnn_v1/training/train_cnn.py --infer input.png \ --export-only checkpoints/checkpoint_epoch_5000.pth \ --output ground_truth.png ``` @@ -145,31 +145,31 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding **Complete Pipeline** (recommended): ```bash # Train → Export → Build → Validate (default config) -./scripts/train_cnn_v2_full.sh +./cnn_v2/scripts/train_cnn_v2_full.sh # Rapid debug (1 layer, 3×3, 5 epochs) -./scripts/train_cnn_v2_full.sh --num-layers 1 --kernel-sizes 3 --epochs 5 --output-weights test.bin +./cnn_v2/scripts/train_cnn_v2_full.sh --num-layers 1 --kernel-sizes 3 --epochs 5 --output-weights test.bin # Custom training parameters -./scripts/train_cnn_v2_full.sh --epochs 500 --batch-size 32 --checkpoint-every 100 +./cnn_v2/scripts/train_cnn_v2_full.sh --epochs 500 --batch-size 32 --checkpoint-every 100 # Custom architecture -./scripts/train_cnn_v2_full.sh --kernel-sizes 3,5,3 --num-layers 3 --mip-level 1 +./cnn_v2/scripts/train_cnn_v2_full.sh --kernel-sizes 3,5,3 --num-layers 3 --mip-level 1 # Custom output path -./scripts/train_cnn_v2_full.sh --output-weights workspaces/test/cnn_weights.bin +./cnn_v2/scripts/train_cnn_v2_full.sh --output-weights workspaces/test/cnn_weights.bin # Grayscale loss (compute loss on luminance instead of RGBA) -./scripts/train_cnn_v2_full.sh --grayscale-loss +./cnn_v2/scripts/train_cnn_v2_full.sh --grayscale-loss # Custom directories -./scripts/train_cnn_v2_full.sh --input training/input --target training/target_2 +./cnn_v2/scripts/train_cnn_v2_full.sh --input training/input --target training/target_2 # Full-image mode (instead of patch-based) -./scripts/train_cnn_v2_full.sh --full-image --image-size 256 +./cnn_v2/scripts/train_cnn_v2_full.sh --full-image --image-size 256 # See all options -./scripts/train_cnn_v2_full.sh --help +./cnn_v2/scripts/train_cnn_v2_full.sh --help ``` **Defaults:** 200 epochs, 3×3 kernels, 8→4→4 channels, batch-size 16, patch-based (8×8, harris detector). @@ -184,33 +184,33 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding **Validation Only** (skip training): ```bash # Use latest checkpoint -./scripts/train_cnn_v2_full.sh --validate +./cnn_v2/scripts/train_cnn_v2_full.sh --validate # Use specific checkpoint -./scripts/train_cnn_v2_full.sh --validate checkpoints/checkpoint_epoch_50.pth +./cnn_v2/scripts/train_cnn_v2_full.sh --validate checkpoints/checkpoint_epoch_50.pth ``` **Manual Training:** ```bash # Default config -./training/train_cnn_v2.py \ +./cnn_v2/training/train_cnn_v2.py \ --input training/input/ --target training/target_2/ \ --epochs 100 --batch-size 16 --checkpoint-every 5 # Custom architecture (per-layer kernel sizes) -./training/train_cnn_v2.py \ +./cnn_v2/training/train_cnn_v2.py \ --input training/input/ --target training/target_2/ \ --kernel-sizes 1,3,5 \ --epochs 5000 --batch-size 16 # Mip-level for p0-p3 features (0=original, 1=half, 2=quarter, 3=eighth) -./training/train_cnn_v2.py \ +./cnn_v2/training/train_cnn_v2.py \ --input training/input/ --target training/target_2/ \ --mip-level 1 \ --epochs 100 --batch-size 16 # Grayscale loss (compute loss on luminance Y = 0.299*R + 0.587*G + 0.114*B) -./training/train_cnn_v2.py \ +./cnn_v2/training/train_cnn_v2.py \ --input training/input/ --target training/target_2/ \ --grayscale-loss \ --epochs 100 --batch-size 16 @@ -236,7 +236,7 @@ Use `--quiet` for streamlined output in scripts (used automatically by train_cnn ``` -**Validation:** Use HTML tool (`tools/cnn_v2_test/index.html`) for CNN v2 validation. See `doc/CNN_V2_WEB_TOOL.md`. +**Validation:** Use HTML tool (`cnn_v2/tools/cnn_v2_test/index.html`) for CNN v2 validation. See `cnn_v2/docs/CNN_V2_WEB_TOOL.md`. --- @@ -323,11 +323,11 @@ See `doc/ASSET_SYSTEM.md` and `doc/WORKSPACE_SYSTEM.md`. **Status:** - **CNN v2:** ✅ Fully functional, matches CNNv2Effect -- **CNN v1:** ⚠️ Produces incorrect output, use CNNEffect in demo for validation +- **CNN v1:** ⚠️ Produces incorrect output, use CNNv1Effect in demo for validation **Note:** `--weights` loads layer count and kernel sizes from the binary file, overriding `--layers` and forcing CNN v2. -See `doc/CNN_TEST_TOOL.md` for full documentation. +See `cnn_v1/docs/CNN_TEST_TOOL.md` for full documentation. --- |
