# CNN Shader Flatten Mode - Technical Analysis

**Status:** Analysis complete - flatten mode NOT RECOMMENDED

**Date:** February 2026

---

## Context

Current CNN architecture uses **3 sequential render passes** (linear chaining):
- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer
- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer
- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original

Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers.

---

## Current Architecture

**Shader Structure:**
- 1 pipeline with layer branching (`layer_index` uniform)
- 5 bindings: sampler, input texture, uniforms, layer params, original capture
- Total shader size: ~8 KB (snippets + weights)

**Performance Profile:**
- 3 render pass dispatches
- 2 framebuffer writes + reads between layers
- Memory bandwidth: ~2× framebuffer size per layer
- Register pressure: Low (per-layer isolation)

**Weight Buffer:** 290 vec4s (4.6 KB) - already unified

---

## Flatten Approaches Evaluated

### Option A: Full Flatten (All 3 Layers)

**Cascading Receptive Field:**

To compute final output at position (x, y):
- Layer 2 needs 3×3 neighborhood of Layer 1 outputs
- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs
- Each Layer 0 output needs 5×5 neighborhood of input samples

**Effective input sampling:** 9×9 pixels (vs current 5×5 max)

**Intermediate Storage (per thread/pixel):**
```
Layer 0 outputs: 5×5 positions × 4 channels = 100 floats
Layer 1 outputs: 3×3 positions × 4 channels =  36 floats
                                   TOTAL = 136 floats (544 bytes)
```

**GPU Register Pressure:**
- Modern GPUs: 32-64 KB registers per SM, shared across warps
- 544 bytes/thread → max 64 threads/SM (**low occupancy**)
- Current multi-pass: ~4-8 bytes/thread (high occupancy)

**Pros:**
- 1 dispatch vs 3 (reduce CPU overhead)
- Zero framebuffer bandwidth between layers

**Cons:**
- **Severe register pressure** (10-20× increase)
- Reduced occupancy → potential performance loss
- Complex shader (harder debug, larger binary)
- 9×9 input sampling

**Assessment:** ❌ **Not Recommended**
Register cost outweighs bandwidth savings.

---

### Option B: Partial Flatten (Layers 1 + 2)

Keep Layer 0 separate, flatten only Layers 1 and 2.

**Pass Structure:**
1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer
2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader

**Intermediate Storage:**
```
Layer 0 samples: 3×3 × 4 = 36 floats (read once)
Layer 1 outputs: 3×3 × 4 = 36 floats (computed)
                 TOTAL = 72 floats (288 bytes)
```

**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs

**Pros:**
- 2 passes vs 3 (33% reduction)
- 1 framebuffer write saved
- More manageable register usage

**Cons:**
- Still significant register pressure (288 bytes vs ~8 bytes baseline)
- Medium complexity increase
- Layer 0 (heaviest kernel) still separate

**Assessment:** ⚠️ **Marginal Benefit**
Saves 1 pass but register cost still high.

---

### Option C: Keep Current Multi-Pass ✅

**Rationale:**
- Current architecture well-suited to GPU design (high throughput via parallelism)
- Minimal register usage → high occupancy → hides memory latency
- Framebuffer bandwidth cost < register pressure cost
- Clean separation aids debugging/iteration
- Modular (easy to add/remove layers)

**Alternative Optimizations (if bandwidth critical):**
1. Merge passes via render pass load/store ops (Vulkan subpasses)
2. Reduce intermediate channel count (4→3 or 2)
3. Hybrid: Compute shaders + workgroup shared memory
4. Layer pruning (2-layer vs 3-layer quality comparison)

---

## Recommendation

**✅ Keep current multi-pass architecture**

### Decision Matrix

| Factor | Multi-Pass | Partial Flatten | Full Flatten |
|--------|-----------|----------------|--------------|
| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme |
| Occupancy | ✅ High | ⚠️ Medium | ❌ Low |
| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest |
| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High |
| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard |
| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest |

**Modern GPU Architecture Favors:**
- High parallelism (many small threads) over complex threads
- Hiding latency via occupancy over minimizing operations
- Memory bandwidth via caching, not elimination

---

## Alternative: Compute Shader + Shared Memory

**If bandwidth becomes critical:**
- Use compute shader with workgroup shared memory
- Load tile + halos into shared memory (9×9 input samples)
- Compute all 3 layers for tile interior (avoids redundant sampling)
- Requires explicit synchronization (`workgroupBarrier`)

**Trade-offs:**
- ✅ Low register pressure + low bandwidth
- ❌ Compute pipeline complexity (no render pass integration)
- ❌ Tile edge handling
- ❌ Larger code size

---

## Conclusion

Current 3-pass architecture is **appropriate for demo64k**:
- Size-efficient (modular shaders)
- Performance adequate (bandwidth not bottleneck)
- Maintainable (clean layer isolation)

**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint.

### Size Optimization Alternatives (Better ROI)

If size optimization critical, focus on:
1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization)
2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s)
3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels)

These yield better size/performance than shader architecture changes.

---

## References

- `doc/CNN_EFFECT.md` - CNN implementation details
- `doc/CNN.md` - High-level CNN design
- `src/gpu/effects/cnn_effect.cc` - Current implementation
- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets