summaryrefslogtreecommitdiff
path: root/doc/CNN_TEST_TOOL.md
blob: 82d57990b90a323cf9bdb392e15d7106691d490e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# CNN Shader Testing Tool

Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Supports both CNN v1 (render pipeline) and v2 (compute, storage buffer).

---

## Purpose

- Validate trained weights against ground truth
- Debug CNN layer behavior in isolation
- Generate test outputs for training workflow
- Match Python training script's inference mode

---

## Architecture

**Two implementations:**

1. **CNN v1** (render pipeline, texture atlas weights)
   - 3 fixed layers
   - RGBA16Float intermediates
   - BGRA8Unorm final output

2. **CNN v2** (compute shaders, storage buffer weights)
   - Dynamic layer count from binary
   - 7D static features (RGBD + UV + sin + bias)
   - RGBA32Uint packed f16 intermediates
   - Storage buffer: ~3-5 KB weights

**Core GPU utility:** `src/gpu/texture_readback.{h,cc}`
- Synchronous texture-to-CPU readback
- Supports RGBA16Float, RGBA32Uint, BGRA8Unorm
- Protected with STRIP_ALL (0 bytes in release)

---

## Usage

```bash
cnn_test input.png output.png [OPTIONS]

OPTIONS:
  --cnn-version N          CNN version: 1 (default) or 2
  --blend F                Final blend amount (0.0-1.0, default: 1.0)
  --format ppm|png         Output format (default: png)
  --layers N               Number of CNN layers (1-10, v1 only, default: 3)
  --save-intermediates DIR Save intermediate layers to directory
  --debug-hex              Print first 8 pixels as hex (debug)
  --help                   Show usage
```

**Examples:**
```bash
# CNN v1 (render pipeline, 3 layers)
./build/cnn_test input.png output.png --cnn-version 1

# CNN v2 (compute, storage buffer, dynamic layers)
./build/cnn_test input.png output.png --cnn-version 2

# 50% blend with original (v2)
./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5

# Debug hex dump
./build/cnn_test input.png output.png --cnn-version 2 --debug-hex
```

---

## Implementation Details

### Core Readback Utility

**File:** `src/gpu/texture_readback.{h,cc}`

**Function:**
```cpp
std::vector<uint8_t> read_texture_pixels(
    WGPUInstance instance,
    WGPUDevice device,
    WGPUTexture texture,
    int width,
    int height);
```

**Features:**
- Returns BGRA8 format (4 bytes per pixel)
- Synchronous blocking operation
- Cross-platform async callback handling (Win32 vs Native API)
- Automatic staging buffer creation and cleanup

**Refactored OffscreenRenderTarget:**
```cpp
std::vector<uint8_t> OffscreenRenderTarget::read_pixels() {
#if !defined(STRIP_ALL)
  return read_texture_pixels(instance_, device_, texture_, width_, height_);
#else
  return std::vector<uint8_t>();
#endif
}
```

### CNN v1 Pipeline (Render)

**Fixed 3-layer architecture:**
- Ping-pong RGBA16Float textures
- CNNLayerParams (binding 3): layer_index, blend_amount
- Shader composer resolves #include directives

### CNN v2 Pipeline (Compute)

**Dynamic layer architecture:**
1. **Static features compute:** Generate 7D features (RGBD + UV + sin + bias)
2. **Layer computes:** N layers from binary weights (3-5 typically)
   - Storage buffer weights (read-only)
   - RGBA32Uint packed f16 textures (ping-pong)
   - CNNv2LayerParams: kernel_size, channels, weight_offset, blend
3. **Readback:** RGBA32Uint → f16 decode → u8 clamp

**Binary format:** Header (20B) + layer info (20B×N) + f16 weights

---

## Build Integration

**CMakeLists.txt:**

1. Added `src/gpu/texture_readback.cc` to GPU_SOURCES (both sections)
2. Tool target:
```cmake
add_executable(cnn_test
    tools/cnn_test.cc
    src/tests/common/webgpu_test_fixture.cc
    src/tests/common/offscreen_render_target.cc
    ${PLATFORM_SOURCES}
    ${GEN_DEMO_CC})

target_link_libraries(cnn_test PRIVATE
    gpu util procedural ${DEMO_LIBS})

add_dependencies(cnn_test generate_demo_assets)

target_compile_definitions(cnn_test PRIVATE
    STB_IMAGE_IMPLEMENTATION
    STB_IMAGE_WRITE_IMPLEMENTATION)
```

**Build:**
```bash
cmake -S . -B build -DDEMO_BUILD_TOOLS=ON
cmake --build build -j4
```

---

## Validation Workflow (CNN v2)

### 1. Train and Export
```bash
# Train and export weights
./scripts/train_cnn_v2_full.sh --epochs 200 --batch-size 16
```

### 2. Tool Inference
```bash
# Run tool with v2
./build/cnn_test training/input/img_000.png output.png --cnn-version 2
```

### 3. Visual Comparison
Compare output.png with training/target_X/img_000.png

---

## Status

**CNN v1:** Builds and runs, produces incorrect output (all white). Use CNNEffect in demo for visual validation.

**CNN v2:** ⚠️ Partially functional. Readback works but output differs from HTML validation tool.
- Loads binary weights from `workspaces/main/weights/cnn_v2_weights.bin`
- Matches CNNv2Effect architecture
- **Known Issue:** Visual output differs from `tools/cnn_v2_test/index.html` despite matching shader code
- Root cause under investigation (weight indexing? texture sampling? activation clamping?)
- Use HTML tool (`tools/cnn_v2_test/index.html`) for accurate validation

---

## Technical Notes (Readback Fix)

**Original Bug:** Buffer mapping returned `WGPUMapAsyncStatus_Unknown` (status=5)

**Root Cause:** Callback mode mismatch
- Used `WGPUCallbackMode_WaitAnyOnly` (fires only during `wgpuInstanceWaitAny`)
- Called `wgpuInstanceProcessEvents` in wait loop (wrong API for this mode)
- Callback never fired → timeout → empty buffer

**Fix Applied:**
1. Changed callback mode to `WGPUCallbackMode_AllowProcessEvents`
2. Replaced `wgpuInstanceProcessEvents` with `wgpuDevicePoll(device, true, nullptr)`
3. Added pre-mapping device poll to ensure copy completes

**Relevant Code:** `src/gpu/texture_readback.cc` lines 97-110

**Reference:** WebGPU spec - Asynchronous Operations, Callback Modes

---

## Limitations

- **CNN v1:** Produces incorrect output, use for debugging only
- **Single image:** Batch processing requires shell loop
- **No real-time preview:** Offline processing only
- **PNG input:** stb_image (JPEG/PNG/BMP/TGA also supported)

---

## Technical Notes

**CNN v2 f16 decoding:**
- RGBA32Uint texture stores 8×f16 as 4×u32
- Custom decoder: extract u16, decode f16→f32, clamp [0,1]→u8
- Handles denormals, infinity, NaN

**Cross-platform:**
- macOS, Linux (native WebGPU)
- Windows (mingw-w64 cross-compile)

**Size impact:**
- Debug/STRIP_ALL=OFF: compiled
- STRIP_ALL=ON: 0 bytes (compiled out)
- FINAL_STRIP=ON: tool not built