# CNN Shader Testing Tool

Standalone tool for validating trained CNN shaders with GPU-to-CPU readback.

---

## Purpose

- Validate trained weights (`cnn_weights_generated.wgsl`) against ground truth
- Debug CNN layer behavior in isolation
- Generate test outputs for patch-based training workflow
- Match Python training script's inference mode (`train_cnn.py --infer`)

---

## Architecture

**Two-part implementation:**

1. **Core GPU utility:** `src/gpu/texture_readback.{h,cc}` (~150 lines)
   - Synchronous texture-to-CPU readback
   - Reusable for screenshots, validation, video export
   - Protected with STRIP_ALL (0 bytes in release builds)

2. **Standalone tool:** `tools/cnn_test.cc` (~450 lines)
   - Custom CNN inference pipeline
   - No MainSequence dependency
   - Asset-based shader loading with automatic include resolution

---

## Usage

```bash
cnn_test input.png output.png [OPTIONS]

OPTIONS:
  --blend F         Final blend amount (0.0-1.0, default: 1.0)
  --format ppm|png  Output format (default: png)
  --help            Show usage
```

**Examples:**
```bash
# Full CNN processing
./build/cnn_test input.png output.png

# 50% blend with original
./build/cnn_test input.png output.png --blend 0.5

# No CNN effect (original passthrough)
./build/cnn_test input.png output.png --blend 0.0

# PPM output format
./build/cnn_test input.png output.ppm --format ppm
```

---

## Implementation Details

### Core Readback Utility

**File:** `src/gpu/texture_readback.{h,cc}`

**Function:**
```cpp
std::vector<uint8_t> read_texture_pixels(
    WGPUInstance instance,
    WGPUDevice device,
    WGPUTexture texture,
    int width,
    int height);
```

**Features:**
- Returns BGRA8 format (4 bytes per pixel)
- Synchronous blocking operation
- Cross-platform async callback handling (Win32 vs Native API)
- Automatic staging buffer creation and cleanup

**Refactored OffscreenRenderTarget:**
```cpp
std::vector<uint8_t> OffscreenRenderTarget::read_pixels() {
#if !defined(STRIP_ALL)
  return read_texture_pixels(instance_, device_, texture_, width_, height_);
#else
  return std::vector<uint8_t>();
#endif
}
```

### CNN Processing Pipeline

**Fixed 3-layer architecture** (matches trained CNN):
1. Layer 0: Initial convolution
2. Layer 1: Intermediate convolution
3. Layer 2: Final convolution + blend with original

**Ping-pong textures:**
- 2 intermediate render targets
- 1 original input reference (binding 4)

**Uniforms:**
- `CommonPostProcessUniforms` (binding 2): resolution, aspect_ratio, time, beat, audio_intensity
- `CNNLayerParams` (binding 3): layer_index, blend_amount

**Shader composition:**
- Uses `ShaderComposer::Get()` via `RenderPipelineBuilder`
- Automatically resolves `#include` directives
- Registers CNN snippets: activation, conv3×3, conv5×5, weights

---

## Build Integration

**CMakeLists.txt:**

1. Added `src/gpu/texture_readback.cc` to GPU_SOURCES (both sections)
2. Tool target:
```cmake
add_executable(cnn_test
    tools/cnn_test.cc
    src/tests/common/webgpu_test_fixture.cc
    src/tests/common/offscreen_render_target.cc
    ${PLATFORM_SOURCES}
    ${GEN_DEMO_CC})

target_link_libraries(cnn_test PRIVATE
    gpu util procedural ${DEMO_LIBS})

add_dependencies(cnn_test generate_demo_assets)

target_compile_definitions(cnn_test PRIVATE
    STB_IMAGE_IMPLEMENTATION
    STB_IMAGE_WRITE_IMPLEMENTATION)
```

**Build:**
```bash
cmake -S . -B build -DDEMO_BUILD_TOOLS=ON
cmake --build build -j4
```

---

## Validation Workflow

### 1. Ground Truth Generation
```bash
# Generate ground truth from Python
./training/train_cnn.py --infer test.png \
  --export-only training/checkpoints/checkpoint_epoch_5000.pth \
  --output ground_truth.png
```

### 2. Tool Inference
```bash
# Run tool (always 3 layers, matching trained CNN)
./build/cnn_test test.png tool_output.png --blend 1.0
```

### 3. Comparison
```bash
# Compare (MSE should be low)
python -c "
import numpy as np
from PIL import Image
gt = np.array(Image.open('ground_truth.png'))
out = np.array(Image.open('tool_output.png'))
mse = np.mean((gt.astype(float) - out.astype(float)) ** 2)
print(f'MSE: {mse:.4f}')
assert mse < 10.0, f'MSE too high: {mse}'
"
```

---

## Known Issues

**BUG: Black output (uninitialized input texture)**
- Tool produces all-black output (MSE 64860 vs ground truth)
- Root cause: First intermediate texture not initialized with input image
- Multi-layer processing starts with uninitialized data
- Fix required: Copy input_texture → intermediate_textures[0] before layer loop

---

## Limitations

- **Fixed layer count:** Cannot run partial networks (3 layers hardcoded)
- **Single image:** Batch processing requires shell loop
- **No real-time preview:** Offline processing only
- **PNG input only:** Uses stb_image (JPEG/PNG/BMP/TGA supported)

---

## Future Enhancements

- Batch processing (directory input)
- Interactive preview mode
- Per-layer weight inspection
- Checksum validation against training checkpoints
- CUDA/Metal direct backends (bypass WebGPU overhead)

---

## Technical Notes

**Number of layers is fixed by trained CNN architecture:**
- Defined in `cnn_weights_generated.wgsl`
- Cannot meaningfully run partial networks (layer outputs have different formats/ranges)
- Tool always processes full 3-layer stack

**Blend parameter:**
- Applied only to final layer (layer 2)
- Intermediate layers always use blend=1.0
- `mix(input, cnn_output, blend_amount)` in shader

**Cross-platform:**
- Tested on macOS (native WebGPU)
- Builds on Windows via mingw-w64 cross-compile
- Linux support via native WebGPU

**Size impact:**
- Debug/STRIP_ALL=OFF: ~150 lines compiled
- STRIP_ALL=ON: 0 bytes (entirely compiled out)
- FINAL_STRIP=ON: 0 bytes (tool not built)