diff options
Diffstat (limited to 'doc/CNN_FLATTEN_ANALYSIS.md')
| -rw-r--r-- | doc/CNN_FLATTEN_ANALYSIS.md | 189 |
1 files changed, 0 insertions, 189 deletions
diff --git a/doc/CNN_FLATTEN_ANALYSIS.md b/doc/CNN_FLATTEN_ANALYSIS.md deleted file mode 100644 index bf63c5d..0000000 --- a/doc/CNN_FLATTEN_ANALYSIS.md +++ /dev/null @@ -1,189 +0,0 @@ -# CNN Shader Flatten Mode - Technical Analysis - -**Status:** Analysis complete - flatten mode NOT RECOMMENDED - -**Date:** February 2026 - ---- - -## Context - -Current CNN architecture uses **3 sequential render passes** (linear chaining): -- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer -- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer -- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original - -Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers. - ---- - -## Current Architecture - -**Shader Structure:** -- 1 pipeline with layer branching (`layer_index` uniform) -- 5 bindings: sampler, input texture, uniforms, layer params, original capture -- Total shader size: ~8 KB (snippets + weights) - -**Performance Profile:** -- 3 render pass dispatches -- 2 framebuffer writes + reads between layers -- Memory bandwidth: ~2× framebuffer size per layer -- Register pressure: Low (per-layer isolation) - -**Weight Buffer:** 290 vec4s (4.6 KB) - already unified - ---- - -## Flatten Approaches Evaluated - -### Option A: Full Flatten (All 3 Layers) - -**Cascading Receptive Field:** - -To compute final output at position (x, y): -- Layer 2 needs 3×3 neighborhood of Layer 1 outputs -- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs -- Each Layer 0 output needs 5×5 neighborhood of input samples - -**Effective input sampling:** 9×9 pixels (vs current 5×5 max) - -**Intermediate Storage (per thread/pixel):** -``` -Layer 0 outputs: 5×5 positions × 4 channels = 100 floats -Layer 1 outputs: 3×3 positions × 4 channels = 36 floats - TOTAL = 136 floats (544 bytes) -``` - -**GPU Register Pressure:** -- Modern GPUs: 32-64 KB registers per SM, shared across warps -- 544 bytes/thread → max 64 threads/SM (**low occupancy**) -- Current multi-pass: ~4-8 bytes/thread (high occupancy) - -**Pros:** -- 1 dispatch vs 3 (reduce CPU overhead) -- Zero framebuffer bandwidth between layers - -**Cons:** -- **Severe register pressure** (10-20× increase) -- Reduced occupancy → potential performance loss -- Complex shader (harder debug, larger binary) -- 9×9 input sampling - -**Assessment:** ❌ **Not Recommended** -Register cost outweighs bandwidth savings. - ---- - -### Option B: Partial Flatten (Layers 1 + 2) - -Keep Layer 0 separate, flatten only Layers 1 and 2. - -**Pass Structure:** -1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer -2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader - -**Intermediate Storage:** -``` -Layer 0 samples: 3×3 × 4 = 36 floats (read once) -Layer 1 outputs: 3×3 × 4 = 36 floats (computed) - TOTAL = 72 floats (288 bytes) -``` - -**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs - -**Pros:** -- 2 passes vs 3 (33% reduction) -- 1 framebuffer write saved -- More manageable register usage - -**Cons:** -- Still significant register pressure (288 bytes vs ~8 bytes baseline) -- Medium complexity increase -- Layer 0 (heaviest kernel) still separate - -**Assessment:** ⚠️ **Marginal Benefit** -Saves 1 pass but register cost still high. - ---- - -### Option C: Keep Current Multi-Pass ✅ - -**Rationale:** -- Current architecture well-suited to GPU design (high throughput via parallelism) -- Minimal register usage → high occupancy → hides memory latency -- Framebuffer bandwidth cost < register pressure cost -- Clean separation aids debugging/iteration -- Modular (easy to add/remove layers) - -**Alternative Optimizations (if bandwidth critical):** -1. Merge passes via render pass load/store ops (Vulkan subpasses) -2. Reduce intermediate channel count (4→3 or 2) -3. Hybrid: Compute shaders + workgroup shared memory -4. Layer pruning (2-layer vs 3-layer quality comparison) - ---- - -## Recommendation - -**✅ Keep current multi-pass architecture** - -### Decision Matrix - -| Factor | Multi-Pass | Partial Flatten | Full Flatten | -|--------|-----------|----------------|--------------| -| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme | -| Occupancy | ✅ High | ⚠️ Medium | ❌ Low | -| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest | -| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High | -| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard | -| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest | - -**Modern GPU Architecture Favors:** -- High parallelism (many small threads) over complex threads -- Hiding latency via occupancy over minimizing operations -- Memory bandwidth via caching, not elimination - ---- - -## Alternative: Compute Shader + Shared Memory - -**If bandwidth becomes critical:** -- Use compute shader with workgroup shared memory -- Load tile + halos into shared memory (9×9 input samples) -- Compute all 3 layers for tile interior (avoids redundant sampling) -- Requires explicit synchronization (`workgroupBarrier`) - -**Trade-offs:** -- ✅ Low register pressure + low bandwidth -- ❌ Compute pipeline complexity (no render pass integration) -- ❌ Tile edge handling -- ❌ Larger code size - ---- - -## Conclusion - -Current 3-pass architecture is **appropriate for demo64k**: -- Size-efficient (modular shaders) -- Performance adequate (bandwidth not bottleneck) -- Maintainable (clean layer isolation) - -**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint. - -### Size Optimization Alternatives (Better ROI) - -If size optimization critical, focus on: -1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization) -2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s) -3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels) - -These yield better size/performance than shader architecture changes. - ---- - -## References - -- `doc/CNN_EFFECT.md` - CNN implementation details -- `doc/CNN.md` - High-level CNN design -- `src/effects/cnn_effect.cc` - Current implementation -- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets |
