doc/CNN_FLATTEN_ANALYSIS.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189

# CNN Shader Flatten Mode - Technical Analysis

**Status:** Analysis complete - flatten mode NOT RECOMMENDED

**Date:** February 2026

---

## Context

Current CNN architecture uses **3 sequential render passes** (linear chaining):
- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer
- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer
- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original

Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers.

---

## Current Architecture

**Shader Structure:**
- 1 pipeline with layer branching (`layer_index` uniform)
- 5 bindings: sampler, input texture, uniforms, layer params, original capture
- Total shader size: ~8 KB (snippets + weights)

**Performance Profile:**
- 3 render pass dispatches
- 2 framebuffer writes + reads between layers
- Memory bandwidth: ~2× framebuffer size per layer
- Register pressure: Low (per-layer isolation)

**Weight Buffer:** 290 vec4s (4.6 KB) - already unified

---

## Flatten Approaches Evaluated

### Option A: Full Flatten (All 3 Layers)

**Cascading Receptive Field:**

To compute final output at position (x, y):
- Layer 2 needs 3×3 neighborhood of Layer 1 outputs
- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs
- Each Layer 0 output needs 5×5 neighborhood of input samples

**Effective input sampling:** 9×9 pixels (vs current 5×5 max)

**Intermediate Storage (per thread/pixel):**
```
Layer 0 outputs: 5×5 positions × 4 channels = 100 floats
Layer 1 outputs: 3×3 positions × 4 channels =  36 floats
                                   TOTAL = 136 floats (544 bytes)
```

**GPU Register Pressure:**
- Modern GPUs: 32-64 KB registers per SM, shared across warps
- 544 bytes/thread → max 64 threads/SM (**low occupancy**)
- Current multi-pass: ~4-8 bytes/thread (high occupancy)

**Pros:**
- 1 dispatch vs 3 (reduce CPU overhead)
- Zero framebuffer bandwidth between layers

**Cons:**
- **Severe register pressure** (10-20× increase)
- Reduced occupancy → potential performance loss
- Complex shader (harder debug, larger binary)
- 9×9 input sampling

**Assessment:** ❌ **Not Recommended**
Register cost outweighs bandwidth savings.

---

### Option B: Partial Flatten (Layers 1 + 2)

Keep Layer 0 separate, flatten only Layers 1 and 2.

**Pass Structure:**
1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer
2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader

**Intermediate Storage:**
```
Layer 0 samples: 3×3 × 4 = 36 floats (read once)
Layer 1 outputs: 3×3 × 4 = 36 floats (computed)
                 TOTAL = 72 floats (288 bytes)
```

**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs

**Pros:**
- 2 passes vs 3 (33% reduction)
- 1 framebuffer write saved
- More manageable register usage

**Cons:**
- Still significant register pressure (288 bytes vs ~8 bytes baseline)
- Medium complexity increase
- Layer 0 (heaviest kernel) still separate

**Assessment:** ⚠️ **Marginal Benefit**
Saves 1 pass but register cost still high.

---

### Option C: Keep Current Multi-Pass ✅

**Rationale:**
- Current architecture well-suited to GPU design (high throughput via parallelism)
- Minimal register usage → high occupancy → hides memory latency
- Framebuffer bandwidth cost < register pressure cost
- Clean separation aids debugging/iteration
- Modular (easy to add/remove layers)

**Alternative Optimizations (if bandwidth critical):**
1. Merge passes via render pass load/store ops (Vulkan subpasses)
2. Reduce intermediate channel count (4→3 or 2)
3. Hybrid: Compute shaders + workgroup shared memory
4. Layer pruning (2-layer vs 3-layer quality comparison)

---

## Recommendation

**✅ Keep current multi-pass architecture**

### Decision Matrix

| Factor | Multi-Pass | Partial Flatten | Full Flatten |
|--------|-----------|----------------|--------------|
| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme |
| Occupancy | ✅ High | ⚠️ Medium | ❌ Low |
| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest |
| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High |
| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard |
| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest |

**Modern GPU Architecture Favors:**
- High parallelism (many small threads) over complex threads
- Hiding latency via occupancy over minimizing operations
- Memory bandwidth via caching, not elimination

---

## Alternative: Compute Shader + Shared Memory

**If bandwidth becomes critical:**
- Use compute shader with workgroup shared memory
- Load tile + halos into shared memory (9×9 input samples)
- Compute all 3 layers for tile interior (avoids redundant sampling)
- Requires explicit synchronization (`workgroupBarrier`)

**Trade-offs:**
- ✅ Low register pressure + low bandwidth
- ❌ Compute pipeline complexity (no render pass integration)
- ❌ Tile edge handling
- ❌ Larger code size

---

## Conclusion

Current 3-pass architecture is **appropriate for demo64k**:
- Size-efficient (modular shaders)
- Performance adequate (bandwidth not bottleneck)
- Maintainable (clean layer isolation)

**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint.

### Size Optimization Alternatives (Better ROI)

If size optimization critical, focus on:
1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization)
2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s)
3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels)

These yield better size/performance than shader architecture changes.

---

## References

- `doc/CNN_EFFECT.md` - CNN implementation details
- `doc/CNN.md` - High-level CNN design
- `src/gpu/effects/cnn_effect.cc` - Current implementation
- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets