doc/CNN_EFFECT.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367

# CNN Post-Processing Effect

Neural network-based stylization for rendered scenes.

---

## Overview

Trainable convolutional neural network layers for artistic stylization (painterly, sketch, cel-shaded effects) with minimal runtime overhead.

**Key Features:**
- Position-aware layer 0 (coordinate input for vignetting, edge effects)
- Multi-layer convolutions (3×3, 5×5, 7×7 kernels) with automatic chaining
- Original input available to all layers via framebuffer capture
- Configurable final blend with original scene
- Modular WGSL shader architecture
- Hardcoded weights (trained offline via PyTorch)
- ~5-8 KB binary footprint

---

## Architecture

### RGBD → Grayscale Pipeline

**Input:** RGBD (RGB + inverse depth D=1/z)
**Output:** Grayscale (1 channel)
**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]

**Architecture:**
- **Inner layers (0..N-2):** Conv2d(7→4) - output RGBD
- **Final layer (N-1):** Conv2d(7→1) - output grayscale

```wgsl
// Inner layers: 7→4 (RGBD output)
fn cnn_conv3x3_7to4(
  tex: texture_2d<f32>,
  samp: sampler,
  uv: vec2<f32>,
  resolution: vec2<f32>,
  original: vec4<f32>,                     # Original RGBD [-1,1]
  weights: array<array<f32, 8>, 36>       # 9 pos × 4 out × (7 weights + bias)
) -> vec4<f32>

// Final layer: 7→1 (grayscale output)
fn cnn_conv3x3_7to1(
  tex: texture_2d<f32>,
  samp: sampler,
  uv: vec2<f32>,
  resolution: vec2<f32>,
  original: vec4<f32>,
  weights: array<array<f32, 8>, 9>        # 9 pos × (7 weights + bias)
) -> f32
```

**Input normalization:**
- **fs_main** normalizes textures once: `(tex - 0.5) * 2` → [-1,1]
- **Conv functions** normalize UV coords: `(uv - 0.5) * 2` → [-1,1]
- **Grayscale** computed from normalized RGBD: `0.2126*R + 0.7152*G + 0.0722*B`
- **Inter-layer data** stays in [-1,1] (no denormalization)
- **Final output** denormalized for display: `(result + 1.0) * 0.5` → [0,1]

**Activation:** tanh for inner layers (output stays [-1,1]), none for final layer

### Multi-Layer Architecture

CNNEffect supports multi-layer networks via automatic effect chaining:

1. **Timeline specifies total layers**: `CNNEffect layers=3 blend=0.7`
2. **Compiler expands to chain**: 3 separate CNNEffect instances (layer 0→1→2)
3. **Framebuffer capture**: Layer 0 captures original input to `"captured_frame"`
4. **Original input binding**: All layers access original via `@binding(4)`
5. **Final blend**: Last layer blends result with original: `mix(original, result, 0.7)`

**Framebuffer Capture API:**
- `Effect::needs_framebuffer_capture()` - effect requests pre-capture
- MainSequence automatically blits input → `"captured_frame"` auxiliary texture
- Generic mechanism usable by any effect

### File Structure

```
src/gpu/effects/
  cnn_effect.h/cc         # CNNEffect class + framebuffer capture

workspaces/main/shaders/cnn/
  cnn_activation.wgsl     # tanh, ReLU, sigmoid, leaky_relu
  cnn_conv3x3.wgsl        # 3×3 convolution (standard + coord-aware)
  cnn_conv5x5.wgsl        # 5×5 convolution (standard + coord-aware)
  cnn_conv7x7.wgsl        # 7×7 convolution (standard + coord-aware)
  cnn_weights_generated.wgsl  # Weight arrays (auto-generated by train_cnn.py)
  cnn_layer.wgsl          # Main shader with layer switches (auto-generated by train_cnn.py)
```

---

## Training Workflow

### 1. Prepare Training Data

Collect input/target image pairs:
- **Input:** RGBA (RGB + depth as alpha channel, D=1/z)
- **Target:** Grayscale stylized output

```bash
training/input/img_000.png   # RGBA render (RGB + depth)
training/output/img_000.png  # Grayscale target
```

**Note:** Input images must be RGBA where alpha = inverse depth (1/z)

### 2. Train Network

```bash
python3 training/train_cnn.py \
  --input training/input \
  --target training/output \
  --layers 1 \
  --kernel-sizes 3 \
  --epochs 500 \
  --checkpoint-every 50
```

**Multi-layer example (3 layers with varying kernel sizes):**
```bash
python3 training/train_cnn.py \
  --input training/input \
  --target training/output \
  --layers 3 \
  --kernel-sizes 3,5,3 \
  --epochs 1000 \
  --checkpoint-every 100
```

**Note:** Training script auto-generates:
- `cnn_weights_generated.wgsl` - weight arrays for all layers
- `cnn_layer.wgsl` - shader with layer switches and original input binding

**Resume from checkpoint:**
```bash
python3 training/train_cnn.py \
  --input training/input \
  --target training/output \
  --resume training/checkpoints/checkpoint_epoch_200.pth
```

**Export WGSL from checkpoint (no training):**
```bash
python3 training/train_cnn.py \
  --export-only training/checkpoints/checkpoint_epoch_200.pth \
  --output workspaces/main/shaders/cnn/cnn_weights_generated.wgsl
```

**Generate ground truth (for shader validation):**
```bash
python3 training/train_cnn.py \
  --infer training/input/img_000.png \
  --export-only training/checkpoints/checkpoint_epoch_200.pth \
  --output training/ground_truth.png
```

### 3. Rebuild Demo

Training script auto-generates both `cnn_weights_generated.wgsl` and `cnn_layer.wgsl`:
```bash
cmake --build build -j4
./build/demo64k
```

---

## Usage

### C++ Integration

**Single layer (manual):**
```cpp
#include "gpu/effects/cnn_effect.h"

CNNEffectParams p;
p.layer_index = 0;
p.total_layers = 1;
p.blend_amount = 1.0f;
auto cnn = std::make_shared<CNNEffect>(ctx, p);
timeline.add_effect(cnn, start_time, end_time);
```

**Multi-layer (automatic via timeline compiler):**

Use timeline syntax - `seq_compiler` expands to multiple instances.

### Timeline Examples

**Single-layer CNN (full stylization):**
```
SEQUENCE 10.0 0
  EFFECT + Hybrid3DEffect 0.00 5.00
  EFFECT + CNNEffect 0.50 5.00 layers=1
```

**Multi-layer CNN with blend:**
```
SEQUENCE 10.0 0
  EFFECT + Hybrid3DEffect 0.00 5.00
  EFFECT + CNNEffect 0.50 5.00 layers=3 blend=0.7
```

Expands to:
```cpp
// Layer 0 (captures original, blend=1.0)
{
  CNNEffectParams p;
  p.layer_index = 0;
  p.total_layers = 3;
  p.blend_amount = 1.0f;
  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 1);
}
// Layer 1 (blend=1.0)
{
  CNNEffectParams p;
  p.layer_index = 1;
  p.total_layers = 3;
  p.blend_amount = 1.0f;
  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 2);
}
// Layer 2 (final blend=0.7)
{
  CNNEffectParams p;
  p.layer_index = 2;
  p.total_layers = 3;
  p.blend_amount = 0.7f;
  seq->add_effect(std::make_shared<CNNEffect>(ctx, p), 0.50f, 5.00f, 3);
}
```

---

## Shader Structure

**Bindings:**
```wgsl
@group(0) @binding(0) var smplr: sampler;
@group(0) @binding(1) var txt: texture_2d<f32>;              // Current layer input
@group(0) @binding(2) var<uniform> uniforms: CommonUniforms;
@group(0) @binding(3) var<uniform> params: CNNLayerParams;
@group(0) @binding(4) var original_input: texture_2d<f32>;   // Layer 0 input (captured)
```

**Fragment shader logic:**
```wgsl
@fragment fn fs_main(@builtin(position) p: vec4<f32>) -> @location(0) vec4<f32> {
    let uv = p.xy / uniforms.resolution;
    let input = textureSample(txt, smplr, uv);               // Layer N-1 output
    let original = textureSample(original_input, smplr, uv); // Layer 0 input

    var result = vec4<f32>(0.0);

    if (params.layer_index == 0) {
        result = cnn_conv3x3_with_coord(txt, smplr, uv, uniforms.resolution,
                                        rgba_weights_layer0, coord_weights_layer0, bias_layer0);
        result = cnn_tanh(result);
    }
    // ... other layers

    // Blend with ORIGINAL input (not previous layer)
    return mix(original, result, params.blend_amount);
}
```

**Weight Storage:**

**Inner layers (7→4 RGBD output):**
```wgsl
// Structure: array<array<f32, 8>, 36>
// 9 positions × 4 output channels, each with 7 weights + bias
const weights_layer0: array<array<f32, 8>, 36> = array(
  array<f32, 8>(w0_r, w0_g, w0_b, w0_d, w0_u, w0_v, w0_gray, bias0),  // pos0_ch0
  array<f32, 8>(w1_r, w1_g, w1_b, w1_d, w1_u, w1_v, w1_gray, bias1),  // pos0_ch1
  // ... 34 more entries
);
```

**Final layer (7→1 grayscale output):**
```wgsl
// Structure: array<array<f32, 8>, 9>
// 9 positions, each with 7 weights + bias
const weights_layerN: array<array<f32, 8>, 9> = array(
  array<f32, 8>(w0_r, w0_g, w0_b, w0_d, w0_u, w0_v, w0_gray, bias0),  // pos0
  // ... 8 more entries
);
```

---

## Size Budget

| Component | Size | Notes |
|-----------|------|-------|
| Activation functions | ~200 B | 4 functions |
| Conv3x3 (standard + coord) | ~500 B | Both variants |
| Conv5x5 (standard + coord) | ~700 B | Both variants |
| Conv7x7 (standard + coord) | ~900 B | Both variants |
| Main shader | ~800 B | Layer composition |
| C++ implementation | ~300 B | Effect class |
| **Coord weights** | **+32 B** | Per-layer overhead (layer 0 only) |
| **RGBA weights** | **2-6 KB** | Depends on depth/kernel sizes |
| **Total** | **5-9 KB** | Acceptable for 64k |

**Optimization strategies:**
- Quantize weights (float32 → int8)
- Prune near-zero weights
- Use separable convolutions

---

## Testing

```bash
./build/test_demo_effects  # CNN construction/shader tests
./build/demo64k            # Visual test
```

---

## Blend Parameter Behavior

**blend_amount** controls final compositing with original:
- `blend=0.0`: Pure original (no CNN effect)
- `blend=0.5`: 50% original + 50% CNN
- `blend=1.0`: Pure CNN output (full stylization)

**Important:** Blend uses captured layer 0 input, not previous layer output.

**Example use cases:**
- `blend=1.0`: Full stylization (default)
- `blend=0.7`: Subtle effect preserving original details
- `blend=0.3`: Light artistic touch

## Troubleshooting

**Shader compilation fails:**
- Check `cnn_weights_generated.wgsl` syntax
- Verify snippets registered in `shaders.cc::InitShaderComposer()`
- Ensure `cnn_layer.wgsl` has 5 bindings (including `original_input`)

**Black/corrupted output:**
- Weights untrained (identity placeholder)
- Check `captured_frame` auxiliary texture is registered
- Verify layer priorities in timeline are sequential

**Wrong blend result:**
- Ensure layer 0 has `needs_framebuffer_capture() == true`
- Check MainSequence framebuffer capture logic
- Verify `original_input` binding is populated

**Training loss not decreasing:**
- Lower learning rate (`--learning-rate 0.0001`)
- More epochs (`--epochs 1000`)
- Check input/target image alignment

---

## References

- **Training Script:** `training/train_cnn.py`
- **Shader Composition:** `doc/SEQUENCE.md`
- **Effect System:** `src/gpu/effect.h`