doc/CNN_EFFECT.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400

# CNN Post-Processing Effect

Neural network-based stylization for rendered scenes.

---

## Overview

Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead.

**Key Features:**
- Position-aware layer 0 (coordinate input for vignetting, edge effects)
- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining
- Original input available to all layers via framebuffer capture
- Configurable final blend with original scene
- Modular WGSL shader architecture
- Hardcoded weights (trained offline via PyTorch)
- ~5-8 KB binary footprint

---

## Architecture

### RGBD → Grayscale Pipeline

**Input:** RGBD (RGB + inverse depth D=1/z)
**Output:** Grayscale (1 channel)
**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]

**Architecture:**
- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD
- **Final layer (N-1):** Conv2d(7→1) - output grayscale

```wgsl
// Inner layers: 7→4 (RGBD output, vec4-optimized)
fn cnn_conv3x3_7to4(
  tex: texture_2d<f32>,
  samp: sampler,
  uv: vec2<f32>,
  resolution: vec2<f32>,
  gray: f32,                               # Grayscale [-1,1]
  weights: array<vec4<f32>, 72>           # 9 pos × 4 ch × 2 vec4 (8 floats per filter)
) -> vec4<f32>

// Final layer: 7→1 (grayscale output, vec4-optimized)
fn cnn_conv3x3_7to1(
  tex: texture_2d<f32>,
  samp: sampler,
  uv: vec2<f32>,
  resolution: vec2<f32>,
  gray: f32,
  weights: array<vec4<f32>, 18>           # 9 pos × 2 vec4 (8 floats per filter)
) -> f32
```

**Input normalization:**
- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1]
- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1]
- **Grayscale** computed once in fs_main using dot product: `dot(original.rgb, vec3(0.2126, 0.7152, 0.0722))`
- **Inter-layer data** stays in [-1,1] (no denormalization)
- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1]

**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer

### Multi-Layer Architecture

CNNEffect supports multi-layer networks via automatic effect chaining:

1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7`
2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2)
3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"`
4. **Original input binding**: All layers access original via `@binding(4)`
5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)`

**Framebuffer Capture API:**
- `Effect::needs_framebuffer_capture()` - effect requests pre-capture
- MainSequence automatically blits input → `"captured_frame"` auxiliary texture
- Generic mechanism usable by any effect

### File Structure

```
src/gpu/effects/
  cnn_effect.h/cc         # CNNEffect class + framebuffer capture

workspaces/main/shaders/cnn/
  cnn_activation.wgsl     # tanh, ReLU, sigmoid, leaky_relu
  cnn_conv3x3.wgsl        # 3×3 convolution (standard + coord-aware)
  cnn_conv5x5.wgsl        # 5×5 convolution (standard + coord-aware)
  cnn_conv7x7.wgsl        # 7×7 convolution (standard + coord-aware)
  cnn_weights_generated.wgsl  # Weight arrays (auto-generated by train_cnn.py)
  cnn_layer.wgsl          # Main shader with layer switches (auto-generated by train_cnn.py)
```

---

## Training Workflow

### 1. Prepare Training Data

Input/target image pairs:
```
training/input/img_000.png   # RGBA (RGB + alpha)
training/output/img_000.png  # Grayscale target
```

**Note:** Alpha channel can be depth (1/z) or constant (255). Network learns from RGB primarily.

### 2. Train Network

**Patch-based (Recommended)** - Preserves natural pixel scale:
```bash
python3 training/train_cnn.py \
  --input training/input --target training/output \
  --patch-size 32 --patches-per-image 64 --detector harris \
  --layers 3 --kernel-sizes 3,5,3 \
  --epochs 5000 --batch-size 16 --checkpoint-every 1000
```

**Detectors:** `harris` (corners), `fast` (features), `shi-tomasi` (corners), `gradient` (edges)

**Full-image (Legacy)** - Resizes to 256×256:
```bash
python3 training/train_cnn.py \
  --input training/input --target training/output \
  --layers 3 --kernel-sizes 3,5,3 \
  --epochs 10000 --batch-size 8 --checkpoint-every 1000
```

**Auto-generates:**
- `cnn_weights_generated.wgsl` - Weight arrays
- `cnn_layer.wgsl` - Layer shader

### 3. Export & Validate

```bash
# Export shaders
./training/train_cnn.py --export-only checkpoints/checkpoint_epoch_5000.pth

# Generate ground truth
./training/train_cnn.py --infer input.png \
  --export-only checkpoints/checkpoint_epoch_5000.pth --output ground_truth.png
```

### 4. Rebuild Demo

```bash
cmake --build build -j4 && ./build/demo64k
```

---

## Usage

### C++ Integration

**Single layer (manual):**
```cpp
#include "gpu/effects/cnn_effect.h"

CNNEffectParams p;
p.layer_index = 0;
p.total_layers = 1;
p.blend_amount = 1.0f;
auto cnn = std::make_shared<CNNEffect>(ctx, p);
timeline.add_effect(cnn, start_time, end_time);
```

**Multi-layer (automatic via timeline compiler):**

Use timeline syntax - `seq_compiler` expands to multiple instances.

### Timeline Examples

**Single-layer CNN (full stylization):**
```
SEQUENCE 10.0 0
  EFFECT + Hybrid3DEffect 0.00 5.00
  EFFECT + CNNEffect 0.50 5.00 layers=1
```

**Multi-layer CNN with blend:**
```
SEQUENCE 10.0 0
  EFFECT + Hybrid3DEffect 0.00 5.00
  EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7
```

Expands to:
```cpp
// Layer 0 (captures original, blend=1.0)
{
  CNNEffectParams p;
  p.layer_index = 0;
  p.total_layers = 3;
  p.blend_amount = 1.0f;
  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1);
}
// Layer 1 (blend=1.0)
{
  CNNEffectParams p;
  p.layer_index = 1;
  p.total_layers = 3;
  p.blend_amount = 1.0f;
  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2);
}
// Layer 2 (final blend=0.7)
{
  CNNEffectParams p;
  p.layer_index = 2;
  p.total_layers = 3;
  p.blend_amount = 0.7f;
  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3);
}
```

---

## Shader Structure

**Bindings:**
```wgsl
@group(0) @binding(0) var smplr: sampler;
@group(0) @binding(1) var txt: texture_2d<f32>;              // Current layer input
@group(0) @binding(2) var<uniform> uniforms: CommonUniforms;
@group(0) @binding(3) var<uniform> params: CNNLayerParams;
@group(0) @binding(4) var original_input: texture_2d<f32>;   // Layer 0 input (captured)
```

**Fragment shader logic:**
```wgsl
@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> {
    let uv = p.xy / uniforms.resolution;
    let original_raw = textureSample(original_input, smplr, uv);
    let original = (original_raw - 0.5) * 2.0;  // Normalize to [-1,1]
    let gray = dot(original.rgb, vec3<f32>(0.2126, 0.7152, 0.0722));
    var result = vec4<f32>(0.0);

    if (params.layer_index == 0) {
        result = cnn_conv3x3_7to4_src(txt, smplr, uv, uniforms.resolution,
                                      weights_layer0);
        result = cnn_tanh(result);
    }
    else if (params.layer_index == 1) {
        result = cnn_conv5x5_7to4(txt, smplr, uv, uniforms.resolution,
                                   gray, weights_layer1);
        result = cnn_tanh(result);
    }
    // ... other layers

    // Blend with ORIGINAL input (not previous layer)
    return mix(original_raw, result, params.blend_amount);
}
```

**Weight Storage (vec4-optimized):**

**Inner layers (7→4 RGBD output):**
```wgsl
// Structure: array<vec4<f32>, 72>
// 9 pos × 4 ch × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
const weights_layer0: array<vec4<f32>, 72> = array(
  vec4<f32>(w0_r, w0_g, w0_b, w0_d),        // pos0_ch0 (rgba weights)
  vec4<f32>(w0_u, w0_v, w0_gray, bias0),    // pos0_ch0 (uv, gray, bias)
  vec4<f32>(w1_r, w1_g, w1_b, w1_d),        // pos0_ch1 (rgba weights)
  vec4<f32>(w1_u, w1_v, w1_gray, bias1),    // pos0_ch1 (uv, gray, bias)
  // ... 68 more vec4s
);
```

**Final layer (7→1 grayscale output):**
```wgsl
// Structure: array<vec4<f32>, 18>
// 9 pos × 2 vec4 (8 floats per filter: [rgba][uv,gray,1])
const weights_layerN: array<vec4<f32>, 18> = array(
  vec4<f32>(w0_r, w0_g, w0_b, w0_d),        // pos0 (rgba weights)
  vec4<f32>(w0_u, w0_v, w0_gray, bias0),    // pos0 (uv, gray, bias)
  // ... 16 more vec4s
);
```

**Optimization:** Bias integrated as 4th component via `vec4(uv, gray, 1.0)` input. Two dot4 operations replace 8 scalar MADs.

---

## Size Budget

| Component | Size | Notes |
|-----------|------|-------|
| Activation functions | ~200 B | 4 functions |
| Conv3x3 (standard + coord) | ~500 B | Both variants |
| Conv5x5 (standard + coord) | ~700 B | Both variants |
| Conv7x7 (standard + coord) | ~900 B | Both variants |
| Main shader | ~800 B | Layer composition |
| C++ implementation | ~300 B | Effect class |
| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) |
| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes |
| **Total** | **5-9 KB** | Acceptable for 64k |

**Optimization strategies:**
- Quantize weights (float32 → int8)
- Prune near-zero weights
- Use separable convolutions

---

## Testing

```bash
./build/test_demo_effects  # CNN construction/shader tests
./build/demo64k            # Visual test
```

---

## Blend Parameter Behavior

**blend_amount** controls final compositing with original:
- `blend=0.0`: Pure original (no CNN effect)
- `blend=0.5`: 50% original + 50% CNN
- `blend=1.0`: Pure CNN output (full stylization)

**Important:** Blend uses captured layer 0 input, not previous layer output.

**Example use cases:**
- `blend=1.0`: Full stylization (default)
- `blend=0.7`: Subtle effect preserving original details
- `blend=0.3`: Light artistic touch

## Troubleshooting

**Shader compilation fails:**
- Check `cnn_weights_generated.wgsl` syntax
- Verify snippets registered in `shaders.cc::InitShaderComposer()`
- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`)

**Black/corrupted output:**
- Weights untrained (identity placeholder)
- Check `captured_frame` auxiliary texture is registered
- Verify layer priorities in timeline are sequential

**Wrong blend result:**
- Ensure layer 0 has `needs_framebuffer_capture() == true`
- Check MainSequence framebuffer capture logic
- Verify `original_input` binding is populated

**Training loss not decreasing:**
- Lower learning rate (`--learning-rate 0.0001`)
- More epochs (`--epochs 1000`)
- Check input/target image alignment

---

## Vec4 Optimization

**Architecture:** Weights stored as vec4 pairs for SIMD efficiency.

**Input representation:**
```wgsl
let rgbd = textureSample(...);              // vec4: [r, g, b, d]
let in1 = vec4<f32>(uv_norm, gray, 1.0);   // vec4: [u, v, gray, 1.0]
```

**Weight indexing:**
```wgsl
var pos = 0;  // Direct weight array index
for (var dy = -1; dy <= 1; dy++) {
  for (var dx = -1; dx <= 1; dx++) {
    // Unrolled channel loop (4 output channels)
    sum.r += dot(weights[pos+0], rgbd) + dot(weights[pos+1], in1);
    sum.g += dot(weights[pos+2], rgbd) + dot(weights[pos+3], in1);
    sum.b += dot(weights[pos+4], rgbd) + dot(weights[pos+5], in1);
    sum.a += dot(weights[pos+6], rgbd) + dot(weights[pos+7], in1);
    pos += 8;  // 4 channels × 2 vec4s per channel
  }
}
```

**Benefits:**
- **SIMD-native:** GPU executes `dot(vec4, vec4)` as single instruction (4 parallel MADs)
- **Memory bandwidth:** 2 vec4 loads vs 8 scalar loads (better cache alignment)
- **Bias integration:** Free via `[..., 1.0]` component (no separate add)
- **Code simplicity:** Eliminates inner loop, direct indexing with `pos`
- **Performance:** 2-3× GPU throughput improvement over scalar version

**Weight layout per filter (8 floats):**
- vec4[0]: [w_r, w_g, w_b, w_d]     (rgba input weights)
- vec4[1]: [w_u, w_v, w_gray, bias] (uv, grayscale, bias)

**3×3 kernel sizes:**
- Inner layer (7→4): 72 vec4s (9 pos × 4 ch × 2 vec4 = 2304 bytes)
- Final layer (7→1): 18 vec4s (9 pos × 1 ch × 2 vec4 = 288 bytes)

---

## References

- **Training Script:** `training/train_cnn.py`
- **Shader Composition:** `doc/SEQUENCE.md`
- **Effect System:** `src/gpu/effect.h`