# CNN Shader Flatten Mode - Technical Analysis **Status:** Analysis complete - flatten mode NOT RECOMMENDED **Date:** February 2026 --- ## Context Current CNN architecture uses **3 sequential render passes** (linear chaining): - **Layer 0:** 5×5 conv (7→4 channels) → framebuffer - **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer - **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers. --- ## Current Architecture **Shader Structure:** - 1 pipeline with layer branching (`layer_index` uniform) - 5 bindings: sampler, input texture, uniforms, layer params, original capture - Total shader size: ~8 KB (snippets + weights) **Performance Profile:** - 3 render pass dispatches - 2 framebuffer writes + reads between layers - Memory bandwidth: ~2× framebuffer size per layer - Register pressure: Low (per-layer isolation) **Weight Buffer:** 290 vec4s (4.6 KB) - already unified --- ## Flatten Approaches Evaluated ### Option A: Full Flatten (All 3 Layers) **Cascading Receptive Field:** To compute final output at position (x, y): - Layer 2 needs 3×3 neighborhood of Layer 1 outputs - Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs - Each Layer 0 output needs 5×5 neighborhood of input samples **Effective input sampling:** 9×9 pixels (vs current 5×5 max) **Intermediate Storage (per thread/pixel):** ``` Layer 0 outputs: 5×5 positions × 4 channels = 100 floats Layer 1 outputs: 3×3 positions × 4 channels = 36 floats TOTAL = 136 floats (544 bytes) ``` **GPU Register Pressure:** - Modern GPUs: 32-64 KB registers per SM, shared across warps - 544 bytes/thread → max 64 threads/SM (**low occupancy**) - Current multi-pass: ~4-8 bytes/thread (high occupancy) **Pros:** - 1 dispatch vs 3 (reduce CPU overhead) - Zero framebuffer bandwidth between layers **Cons:** - **Severe register pressure** (10-20× increase) - Reduced occupancy → potential performance loss - Complex shader (harder debug, larger binary) - 9×9 input sampling **Assessment:** ❌ **Not Recommended** Register cost outweighs bandwidth savings. --- ### Option B: Partial Flatten (Layers 1 + 2) Keep Layer 0 separate, flatten only Layers 1 and 2. **Pass Structure:** 1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer 2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader **Intermediate Storage:** ``` Layer 0 samples: 3×3 × 4 = 36 floats (read once) Layer 1 outputs: 3×3 × 4 = 36 floats (computed) TOTAL = 72 floats (288 bytes) ``` **Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs **Pros:** - 2 passes vs 3 (33% reduction) - 1 framebuffer write saved - More manageable register usage **Cons:** - Still significant register pressure (288 bytes vs ~8 bytes baseline) - Medium complexity increase - Layer 0 (heaviest kernel) still separate **Assessment:** ⚠️ **Marginal Benefit** Saves 1 pass but register cost still high. --- ### Option C: Keep Current Multi-Pass ✅ **Rationale:** - Current architecture well-suited to GPU design (high throughput via parallelism) - Minimal register usage → high occupancy → hides memory latency - Framebuffer bandwidth cost < register pressure cost - Clean separation aids debugging/iteration - Modular (easy to add/remove layers) **Alternative Optimizations (if bandwidth critical):** 1. Merge passes via render pass load/store ops (Vulkan subpasses) 2. Reduce intermediate channel count (4→3 or 2) 3. Hybrid: Compute shaders + workgroup shared memory 4. Layer pruning (2-layer vs 3-layer quality comparison) --- ## Recommendation **✅ Keep current multi-pass architecture** ### Decision Matrix | Factor | Multi-Pass | Partial Flatten | Full Flatten | |--------|-----------|----------------|--------------| | Register pressure | ✅ Low | ⚠️ High | ❌ Extreme | | Occupancy | ✅ High | ⚠️ Medium | ❌ Low | | Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest | | Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High | | Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard | | Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest | **Modern GPU Architecture Favors:** - High parallelism (many small threads) over complex threads - Hiding latency via occupancy over minimizing operations - Memory bandwidth via caching, not elimination --- ## Alternative: Compute Shader + Shared Memory **If bandwidth becomes critical:** - Use compute shader with workgroup shared memory - Load tile + halos into shared memory (9×9 input samples) - Compute all 3 layers for tile interior (avoids redundant sampling) - Requires explicit synchronization (`workgroupBarrier`) **Trade-offs:** - ✅ Low register pressure + low bandwidth - ❌ Compute pipeline complexity (no render pass integration) - ❌ Tile edge handling - ❌ Larger code size --- ## Conclusion Current 3-pass architecture is **appropriate for demo64k**: - Size-efficient (modular shaders) - Performance adequate (bandwidth not bottleneck) - Maintainable (clean layer isolation) **Flatten mode not recommended** unless profiling reveals specific bandwidth constraint. ### Size Optimization Alternatives (Better ROI) If size optimization critical, focus on: 1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization) 2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s) 3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels) These yield better size/performance than shader architecture changes. --- ## References - `doc/CNN_EFFECT.md` - CNN implementation details - `doc/CNN.md` - High-level CNN design - `src/gpu/effects/cnn_effect.cc` - Current implementation - `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets