diff options
| author | skal <pascal.massimino@gmail.com> | 2026-02-15 18:52:48 +0100 |
|---|---|---|
| committer | skal <pascal.massimino@gmail.com> | 2026-02-15 18:52:48 +0100 |
| commit | d4b67e2f6ab48ab9ec658140be4f1999f604559a (patch) | |
| tree | 2502b0dc89748f7cfe674d3c177bd1528ce1c231 /doc/CNN_FLATTEN_ANALYSIS.md | |
| parent | 161a59fa50bb92e3664c389fa03b95aefe349b3f (diff) | |
archive(cnn): move CNN v1 to cnn_v1/ subdirectory
Consolidate CNN v1 (CNNEffect) into dedicated directory:
- C++ effect: src/effects → cnn_v1/src/
- Shaders: workspaces/main/shaders/cnn → cnn_v1/shaders/
- Training: training/train_cnn.py → cnn_v1/training/
- Docs: doc/CNN*.md → cnn_v1/docs/
Updated all references:
- CMake source list
- C++ includes (relative paths: ../../cnn_v1/src/)
- Asset paths (../../cnn_v1/shaders/)
- Documentation cross-references
CNN v1 remains active in timeline. For new work, use CNN v2 with
enhanced features (7D static, storage buffer, sigmoid activation).
Tests: 34/34 passing (100%)
Diffstat (limited to 'doc/CNN_FLATTEN_ANALYSIS.md')
| -rw-r--r-- | doc/CNN_FLATTEN_ANALYSIS.md | 189 |
1 files changed, 0 insertions, 189 deletions
diff --git a/doc/CNN_FLATTEN_ANALYSIS.md b/doc/CNN_FLATTEN_ANALYSIS.md deleted file mode 100644 index bf63c5d..0000000 --- a/doc/CNN_FLATTEN_ANALYSIS.md +++ /dev/null @@ -1,189 +0,0 @@ -# CNN Shader Flatten Mode - Technical Analysis - -**Status:** Analysis complete - flatten mode NOT RECOMMENDED - -**Date:** February 2026 - ---- - -## Context - -Current CNN architecture uses **3 sequential render passes** (linear chaining): -- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer -- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer -- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original - -Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers. - ---- - -## Current Architecture - -**Shader Structure:** -- 1 pipeline with layer branching (`layer_index` uniform) -- 5 bindings: sampler, input texture, uniforms, layer params, original capture -- Total shader size: ~8 KB (snippets + weights) - -**Performance Profile:** -- 3 render pass dispatches -- 2 framebuffer writes + reads between layers -- Memory bandwidth: ~2× framebuffer size per layer -- Register pressure: Low (per-layer isolation) - -**Weight Buffer:** 290 vec4s (4.6 KB) - already unified - ---- - -## Flatten Approaches Evaluated - -### Option A: Full Flatten (All 3 Layers) - -**Cascading Receptive Field:** - -To compute final output at position (x, y): -- Layer 2 needs 3×3 neighborhood of Layer 1 outputs -- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs -- Each Layer 0 output needs 5×5 neighborhood of input samples - -**Effective input sampling:** 9×9 pixels (vs current 5×5 max) - -**Intermediate Storage (per thread/pixel):** -``` -Layer 0 outputs: 5×5 positions × 4 channels = 100 floats -Layer 1 outputs: 3×3 positions × 4 channels = 36 floats - TOTAL = 136 floats (544 bytes) -``` - -**GPU Register Pressure:** -- Modern GPUs: 32-64 KB registers per SM, shared across warps -- 544 bytes/thread → max 64 threads/SM (**low occupancy**) -- Current multi-pass: ~4-8 bytes/thread (high occupancy) - -**Pros:** -- 1 dispatch vs 3 (reduce CPU overhead) -- Zero framebuffer bandwidth between layers - -**Cons:** -- **Severe register pressure** (10-20× increase) -- Reduced occupancy → potential performance loss -- Complex shader (harder debug, larger binary) -- 9×9 input sampling - -**Assessment:** ❌ **Not Recommended** -Register cost outweighs bandwidth savings. - ---- - -### Option B: Partial Flatten (Layers 1 + 2) - -Keep Layer 0 separate, flatten only Layers 1 and 2. - -**Pass Structure:** -1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer -2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader - -**Intermediate Storage:** -``` -Layer 0 samples: 3×3 × 4 = 36 floats (read once) -Layer 1 outputs: 3×3 × 4 = 36 floats (computed) - TOTAL = 72 floats (288 bytes) -``` - -**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs - -**Pros:** -- 2 passes vs 3 (33% reduction) -- 1 framebuffer write saved -- More manageable register usage - -**Cons:** -- Still significant register pressure (288 bytes vs ~8 bytes baseline) -- Medium complexity increase -- Layer 0 (heaviest kernel) still separate - -**Assessment:** ⚠️ **Marginal Benefit** -Saves 1 pass but register cost still high. - ---- - -### Option C: Keep Current Multi-Pass ✅ - -**Rationale:** -- Current architecture well-suited to GPU design (high throughput via parallelism) -- Minimal register usage → high occupancy → hides memory latency -- Framebuffer bandwidth cost < register pressure cost -- Clean separation aids debugging/iteration -- Modular (easy to add/remove layers) - -**Alternative Optimizations (if bandwidth critical):** -1. Merge passes via render pass load/store ops (Vulkan subpasses) -2. Reduce intermediate channel count (4→3 or 2) -3. Hybrid: Compute shaders + workgroup shared memory -4. Layer pruning (2-layer vs 3-layer quality comparison) - ---- - -## Recommendation - -**✅ Keep current multi-pass architecture** - -### Decision Matrix - -| Factor | Multi-Pass | Partial Flatten | Full Flatten | -|--------|-----------|----------------|--------------| -| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme | -| Occupancy | ✅ High | ⚠️ Medium | ❌ Low | -| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest | -| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High | -| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard | -| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest | - -**Modern GPU Architecture Favors:** -- High parallelism (many small threads) over complex threads -- Hiding latency via occupancy over minimizing operations -- Memory bandwidth via caching, not elimination - ---- - -## Alternative: Compute Shader + Shared Memory - -**If bandwidth becomes critical:** -- Use compute shader with workgroup shared memory -- Load tile + halos into shared memory (9×9 input samples) -- Compute all 3 layers for tile interior (avoids redundant sampling) -- Requires explicit synchronization (`workgroupBarrier`) - -**Trade-offs:** -- ✅ Low register pressure + low bandwidth -- ❌ Compute pipeline complexity (no render pass integration) -- ❌ Tile edge handling -- ❌ Larger code size - ---- - -## Conclusion - -Current 3-pass architecture is **appropriate for demo64k**: -- Size-efficient (modular shaders) -- Performance adequate (bandwidth not bottleneck) -- Maintainable (clean layer isolation) - -**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint. - -### Size Optimization Alternatives (Better ROI) - -If size optimization critical, focus on: -1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization) -2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s) -3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels) - -These yield better size/performance than shader architecture changes. - ---- - -## References - -- `doc/CNN_EFFECT.md` - CNN implementation details -- `doc/CNN.md` - High-level CNN design -- `src/effects/cnn_effect.cc` - Current implementation -- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets |
