From 409bbfb08fae03bfb7daa554a799bd8480806799 Mon Sep 17 00:00:00 2001 From: skal Date: Wed, 11 Feb 2026 23:35:44 +0100 Subject: docs: Add CNN flatten mode technical analysis Comprehensive analysis of single-pass CNN shader architecture: - Full flatten (3 layers): 544 bytes/thread register pressure - NOT recommended - Partial flatten (layers 1+2): 288 bytes/thread - marginal benefit - Current multi-pass: Optimal for GPU occupancy and maintainability Recommendation: Keep current 3-pass architecture. Alternative size optimizations: weight quantization, kernel reduction. handoff(Claude): CNN flatten analysis documented --- doc/CNN_FLATTEN_ANALYSIS.md | 189 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 189 insertions(+) create mode 100644 doc/CNN_FLATTEN_ANALYSIS.md (limited to 'doc/CNN_FLATTEN_ANALYSIS.md') diff --git a/doc/CNN_FLATTEN_ANALYSIS.md b/doc/CNN_FLATTEN_ANALYSIS.md new file mode 100644 index 0000000..88f3db6 --- /dev/null +++ b/doc/CNN_FLATTEN_ANALYSIS.md @@ -0,0 +1,189 @@ +# CNN Shader Flatten Mode - Technical Analysis + +**Status:** Analysis complete - flatten mode NOT RECOMMENDED + +**Date:** February 2026 + +--- + +## Context + +Current CNN architecture uses **3 sequential render passes** (linear chaining): +- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer +- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer +- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original + +Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers. + +--- + +## Current Architecture + +**Shader Structure:** +- 1 pipeline with layer branching (`layer_index` uniform) +- 5 bindings: sampler, input texture, uniforms, layer params, original capture +- Total shader size: ~8 KB (snippets + weights) + +**Performance Profile:** +- 3 render pass dispatches +- 2 framebuffer writes + reads between layers +- Memory bandwidth: ~2× framebuffer size per layer +- Register pressure: Low (per-layer isolation) + +**Weight Buffer:** 290 vec4s (4.6 KB) - already unified + +--- + +## Flatten Approaches Evaluated + +### Option A: Full Flatten (All 3 Layers) + +**Cascading Receptive Field:** + +To compute final output at position (x, y): +- Layer 2 needs 3×3 neighborhood of Layer 1 outputs +- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs +- Each Layer 0 output needs 5×5 neighborhood of input samples + +**Effective input sampling:** 9×9 pixels (vs current 5×5 max) + +**Intermediate Storage (per thread/pixel):** +``` +Layer 0 outputs: 5×5 positions × 4 channels = 100 floats +Layer 1 outputs: 3×3 positions × 4 channels = 36 floats + TOTAL = 136 floats (544 bytes) +``` + +**GPU Register Pressure:** +- Modern GPUs: 32-64 KB registers per SM, shared across warps +- 544 bytes/thread → max 64 threads/SM (**low occupancy**) +- Current multi-pass: ~4-8 bytes/thread (high occupancy) + +**Pros:** +- 1 dispatch vs 3 (reduce CPU overhead) +- Zero framebuffer bandwidth between layers + +**Cons:** +- **Severe register pressure** (10-20× increase) +- Reduced occupancy → potential performance loss +- Complex shader (harder debug, larger binary) +- 9×9 input sampling + +**Assessment:** ❌ **Not Recommended** +Register cost outweighs bandwidth savings. + +--- + +### Option B: Partial Flatten (Layers 1 + 2) + +Keep Layer 0 separate, flatten only Layers 1 and 2. + +**Pass Structure:** +1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer +2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader + +**Intermediate Storage:** +``` +Layer 0 samples: 3×3 × 4 = 36 floats (read once) +Layer 1 outputs: 3×3 × 4 = 36 floats (computed) + TOTAL = 72 floats (288 bytes) +``` + +**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs + +**Pros:** +- 2 passes vs 3 (33% reduction) +- 1 framebuffer write saved +- More manageable register usage + +**Cons:** +- Still significant register pressure (288 bytes vs ~8 bytes baseline) +- Medium complexity increase +- Layer 0 (heaviest kernel) still separate + +**Assessment:** ⚠️ **Marginal Benefit** +Saves 1 pass but register cost still high. + +--- + +### Option C: Keep Current Multi-Pass ✅ + +**Rationale:** +- Current architecture well-suited to GPU design (high throughput via parallelism) +- Minimal register usage → high occupancy → hides memory latency +- Framebuffer bandwidth cost < register pressure cost +- Clean separation aids debugging/iteration +- Modular (easy to add/remove layers) + +**Alternative Optimizations (if bandwidth critical):** +1. Merge passes via render pass load/store ops (Vulkan subpasses) +2. Reduce intermediate channel count (4→3 or 2) +3. Hybrid: Compute shaders + workgroup shared memory +4. Layer pruning (2-layer vs 3-layer quality comparison) + +--- + +## Recommendation + +**✅ Keep current multi-pass architecture** + +### Decision Matrix + +| Factor | Multi-Pass | Partial Flatten | Full Flatten | +|--------|-----------|----------------|--------------| +| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme | +| Occupancy | ✅ High | ⚠️ Medium | ❌ Low | +| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest | +| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High | +| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard | +| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest | + +**Modern GPU Architecture Favors:** +- High parallelism (many small threads) over complex threads +- Hiding latency via occupancy over minimizing operations +- Memory bandwidth via caching, not elimination + +--- + +## Alternative: Compute Shader + Shared Memory + +**If bandwidth becomes critical:** +- Use compute shader with workgroup shared memory +- Load tile + halos into shared memory (9×9 input samples) +- Compute all 3 layers for tile interior (avoids redundant sampling) +- Requires explicit synchronization (`workgroupBarrier`) + +**Trade-offs:** +- ✅ Low register pressure + low bandwidth +- ❌ Compute pipeline complexity (no render pass integration) +- ❌ Tile edge handling +- ❌ Larger code size + +--- + +## Conclusion + +Current 3-pass architecture is **appropriate for demo64k**: +- Size-efficient (modular shaders) +- Performance adequate (bandwidth not bottleneck) +- Maintainable (clean layer isolation) + +**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint. + +### Size Optimization Alternatives (Better ROI) + +If size optimization critical, focus on: +1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization) +2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s) +3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels) + +These yield better size/performance than shader architecture changes. + +--- + +## References + +- `doc/CNN_EFFECT.md` - CNN implementation details +- `doc/CNN.md` - High-level CNN design +- `src/gpu/effects/cnn_effect.cc` - Current implementation +- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets -- cgit v1.2.3