summaryrefslogtreecommitdiff
path: root/doc/CNN_RGBD_GRAYSCALE_SUMMARY.md
blob: 4c13693c20558f95fcaaee74312a0b3ee9a14c97 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# CNN RGBD→Grayscale Architecture Implementation

## Summary

Implemented CNN architecture upgrade: RGBD input → grayscale output with 7-channel augmented input.

## Changes Made

### Architecture

**Input:** RGBD (4 channels: RGB + inverse depth D=1/z)
**Output:** Grayscale (1 channel)
**Layer Input:** 7 channels = [RGBD, UV coords, grayscale] all normalized to [-1,1]

**Layer Configuration:**
- Inner layers (0..N-2): Conv2d(7→4) - output RGBD with tanh activation
- Final layer (N-1): Conv2d(7→1) - output grayscale, no activation

### Input Normalization (all to [-1,1])

- **RGBD:** `(rgbd - 0.5) * 2`
- **UV coords:** `(uv - 0.5) * 2`
- **Grayscale:** `(0.2126*R + 0.7152*G + 0.0722*B - 0.5) * 2`

**Rationale:** Zero-centered inputs for tanh activation, better gradient flow.

### Modified Files

**Training (`/Users/skal/demo/training/train_cnn.py`):**
1. Removed `CoordConv2d` class
2. Updated `SimpleCNN`:
   - Inner layers: `Conv2d(7, 4)` - RGBD output
   - Final layer: `Conv2d(7, 1)` - grayscale output
3. Updated `forward()`:
   - Normalize RGBD/coords/gray to [-1,1]
   - Concatenate 7-channel input for each layer
   - Apply tanh (inner) or none (final)
   - Denormalize final output
4. Updated `export_weights_to_wgsl()`:
   - Inner: `array<array<f32, 8>, 36>` (9 pos × 4 ch × 8 values)
   - Final: `array<array<f32, 8>, 9>` (9 pos × 8 values)
5. Updated `generate_layer_shader()`:
   - Use `cnn_conv3x3_7to4` for inner layers
   - Use `cnn_conv3x3_7to1` for final layer
   - Denormalize outputs from [-1,1] to [0,1]
6. Updated `ImagePairDataset`:
   - Load RGBA input (was RGB)

**Shaders (`/Users/skal/demo/workspaces/main/shaders/cnn/cnn_conv3x3.wgsl`):**
1. Added `cnn_conv3x3_7to4()`:
   - 7-channel input: [RGBD, uv_x, uv_y, gray]
   - 4-channel output: RGBD
   - Weights: `array<array<f32, 8>, 36>`
2. Added `cnn_conv3x3_7to1()`:
   - 7-channel input: [RGBD, uv_x, uv_y, gray]
   - 1-channel output: grayscale
   - Weights: `array<array<f32, 8>, 9>`

**Documentation (`/Users/skal/demo/doc/CNN_EFFECT.md`):**
1. Updated architecture section with RGBD→grayscale pipeline
2. Updated training data requirements (RGBA input)
3. Updated weight storage format

### No C++ Changes

CNNLayerParams and bind groups remain unchanged.

## Data Flow

1. Layer 0 captures original RGBD to `captured_frame`
2. Each layer:
   - Samples previous layer output (RGBD in [0,1])
   - Normalizes RGBD to [-1,1]
   - Computes UV coords and grayscale, normalizes to [-1,1]
   - Concatenates 7-channel input
   - Applies convolution with layer-specific weights
   - Outputs RGBD (inner) or grayscale (final) in [-1,1]
   - Applies tanh (inner only)
   - Denormalizes to [0,1] for texture storage
   - Blends with original

## Next Steps

1. **Prepare RGBD training data:**
   - Input: RGBA images (RGB + depth in alpha)
   - Target: Grayscale stylized output

2. **Train network:**
   ```bash
   python3 training/train_cnn.py \
     --input training/input \
     --target training/output \
     --layers 3 \
     --epochs 1000
   ```

3. **Verify generated shaders:**
   - Check `cnn_weights_generated.wgsl` structure
   - Check `cnn_layer.wgsl` uses new conv functions

4. **Test in demo:**
   ```bash
   cmake --build build -j4
   ./build/demo64k
   ```

## Design Rationale

**Why [-1,1] normalization?**
- Centered inputs for tanh (operates best around 0)
- Better gradient flow
- Standard ML practice for normalized data

**Why RGBD throughout vs RGB?**
- Depth information propagates through network
- Enables depth-aware stylization
- Consistent 4-channel processing

**Why 7-channel input?**
- Coordinates: position-dependent effects (vignettes)
- Grayscale: luminance-aware processing
- RGBD: full color+depth information
- Enables richer feature learning

## Testing Checklist

- [ ] Train network with RGBD input data
- [ ] Verify `cnn_weights_generated.wgsl` structure
- [ ] Verify `cnn_layer.wgsl` uses `7to4`/`7to1` functions
- [ ] Build demo without errors
- [ ] Visual test: inner layers show RGBD evolution
- [ ] Visual test: final layer produces grayscale
- [ ] Visual test: blending works correctly
- [ ] Compare quality with previous RGB→RGB architecture