# CNN v2: Parametric Static Features **Technical Design Document** --- ## Overview CNN v2 extends the original CNN post-processing effect with parametric static features, enabling richer spatial and frequency-domain inputs for improved visual quality. **Key improvements over v1:** - 7D static feature input (vs 4D RGB) - Multi-frequency position encoding (NeRF-style) - Per-layer configurable kernel sizes (1×1, 3×3, 5×5) - Variable channel counts per layer - Float16 weight storage (~3.2 KB for 3-layer model) - Bias integrated as static feature dimension - Storage buffer architecture (dynamic layer count) - Binary weight format for runtime loading **Status:** ✅ Complete. Training pipeline functional, validation tools ready. **TODO:** 8-bit quantization with QAT for 2× size reduction (~1.6 KB) --- ## Architecture ### Pipeline Overview ``` Input RGBD → Static Features Compute → CNN Layers → Output RGBA └─ computed once/frame ─┘ └─ multi-pass ─┘ ``` **Detailed Data Flow:** ``` ┌─────────────────────────────────────────┐ │ Static Features (computed once) │ │ 8D: p0,p1,p2,p3,uv_x,uv_y,sin10x,bias │ └──────────────┬──────────────────────────┘ │ │ 8D (broadcast to all layers) ├───────────────────────────┐ │ │ ┌──────────────┐ │ │ │ Input RGBD │──────────────┤ │ │ 4D │ 4D │ │ └──────────────┘ │ │ ▼ │ ┌────────────┐ │ │ Layer 0 │ (12D input) │ │ (CNN) │ = 4D + 8D │ │ 12D → 4D │ │ └─────┬──────┘ │ │ 4D output │ │ │ ├───────────────────────────┘ │ │ ▼ │ ┌────────────┐ │ │ Layer 1 │ (12D input) │ │ (CNN) │ = 4D + 8D │ │ 12D → 4D │ │ └─────┬──────┘ │ │ 4D output │ │ │ ├───────────────────────────┘ ▼ │ ... │ │ │ ▼ │ ┌────────────┐ │ │ Layer N │ (12D input) │ │ (output) │◄──────────────────┘ │ 12D → 4D │ └─────┬──────┘ │ 4D (RGBA) ▼ Output ``` **Key Points:** - Static features computed once, broadcast to all CNN layers - Each layer: previous 4D output + 8D static → 12D input → 4D output - Ping-pong buffering between layers - Layer 0 special case: uses input RGBD instead of previous layer output **Static Features Texture:** - Name: `static_features` - Format: `texture_storage_2d` (4×u32) - Data: 8 float16 values packed via `pack2x16float()` - Computed once per frame, read by all CNN layers - Lifetime: Entire frame (all CNN layer passes) **CNN Layers:** - Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels - Layer 1+: previous output (4D) + static (8D) = 12D → 4 channels - All layers: uniform 12D input, 4D output (ping-pong buffer) - Storage: `texture_storage_2d` (4 channels as 2×f16 pairs) --- ## Static Features (7D + 1 bias) ### Feature Layout **8 float16 values per pixel:** ```wgsl // Slot 0-3: Parametric features (p0, p1, p2, p3) // Can be: mip1/2 RGBD, grayscale, gradients, etc. // Distinct from input image RGBD (fed only to Layer 0) let p0 = ...; // Parametric feature 0 (e.g., mip1.r or grayscale) let p1 = ...; // Parametric feature 1 let p2 = ...; // Parametric feature 2 let p3 = ...; // Parametric feature 3 // Slot 4-5: UV coordinates (normalized screen space) let uv_x = coord.x / resolution.x; // Horizontal position [0,1] let uv_y = coord.y / resolution.y; // Vertical position [0,1] // Slot 6: Multi-frequency position encoding let sin10_x = sin(10.0 * uv_x); // Periodic feature (frequency=10) // Slot 7: Bias dimension (always 1.0) let bias = 1.0; // Learned bias per output channel // Packed storage: [p0, p1, p2, p3, uv.x, uv.y, sin(10*uv.x), 1.0] ``` ### Feature Rationale | Feature | Dimension | Purpose | Priority | |---------|-----------|---------|----------| | p0-p3 | 4D | Parametric auxiliary features (mips, gradients, etc.) | Essential | | UV coords | 2D | Spatial position awareness | Essential | | sin(10\*uv.x) | 1D | Periodic position encoding | Medium | | Bias | 1D | Learned bias (standard NN) | Essential | **Note:** Input image RGBD (mip 0) fed only to Layer 0. Subsequent layers see static features + previous layer output. **Why bias as static feature:** - Simpler shader code (single weight array) - Standard NN formulation: y = Wx (x includes bias term) - Saves 56-112 bytes (no separate bias buffer) - 7 features sufficient for initial implementation ### Future Feature Extensions **Option: Replace sin(10\*uv.x) with:** - `sin(20*uv.x)` - Higher frequency encoding - `gray_mip1` - Multi-scale luminance - `dx`, `dy` - Sobel gradients - `variance` - Local texture measure - `laplacian` - Edge detection **Option: uint8 packing (16+ features):** ```wgsl // texture_storage_2d stores 16 uint8 values // Trade precision for feature count // [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y, // sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, var, bias] ``` Requires quantization-aware training. --- ## Layer Structure ### Example 3-Layer Network ``` Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels (3×3 kernel) Layer 1: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel) Layer 2: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel, output RGBA) ``` **Output:** 4 channels (RGBA). Training targets preserve alpha from target images. ### Weight Calculations **Per-layer weights (uniform 12D→4D, 3×3 kernels):** ``` Layer 0: 12 × 3 × 3 × 4 = 432 weights Layer 1: 12 × 3 × 3 × 4 = 432 weights Layer 2: 12 × 3 × 3 × 4 = 432 weights Total: 1296 weights ``` **Storage sizes:** - f32: 1296 × 4 = 5,184 bytes (~5.1 KB) - f16: 1296 × 2 = 2,592 bytes (~2.5 KB) ✓ **recommended** **Comparison to v1:** - v1: ~800 weights (3.2 KB f32) - v2: ~1296 weights (2.5 KB f16) - **Uniform architecture, smaller than v1 f32** ### Kernel Size Guidelines **1×1 kernel (pointwise):** - No spatial context, channel mixing only - Weights: `12 × 4 = 48` per layer - Use for: Fast inference, channel remapping **3×3 kernel (standard conv):** - Local spatial context (recommended) - Weights: `12 × 9 × 4 = 432` per layer - Use for: Most layers (balanced quality/size) **5×5 kernel (large receptive field):** - Wide spatial context - Weights: `12 × 25 × 4 = 1200` per layer - Use for: Output layer, fine detail enhancement ### Channel Storage (4×f16 per texel) ```wgsl @group(0) @binding(1) var layer_input: texture_2d; fn unpack_channels(coord: vec2) -> vec4 { let packed = textureLoad(layer_input, coord, 0); let v0 = unpack2x16float(packed.x); // [ch0, ch1] let v1 = unpack2x16float(packed.y); // [ch2, ch3] return vec4(v0.x, v0.y, v1.x, v1.y); } fn pack_channels(values: vec4) -> vec4 { return vec4( pack2x16float(vec2(values.x, values.y)), pack2x16float(vec2(values.z, values.w)), 0u, // Unused 0u // Unused ); } ``` --- ## Training Workflow ### Script: `training/train_cnn_v2.py` **Static Feature Extraction:** ```python def compute_static_features(rgb, depth): """Generate parametric features (8D: p0-p3 + spatial).""" h, w = rgb.shape[:2] # Parametric features (example: use input RGBD, but could be mips/gradients) p0, p1, p2, p3 = rgb[..., 0], rgb[..., 1], rgb[..., 2], depth # UV coordinates (normalized) uv_x = np.linspace(0, 1, w)[None, :].repeat(h, axis=0) uv_y = np.linspace(0, 1, h)[:, None].repeat(w, axis=1) # Multi-frequency position encoding sin10_x = np.sin(10.0 * uv_x) # Bias dimension (always 1.0) bias = np.ones_like(p0) # Stack: [p0, p1, p2, p3, uv.x, uv.y, sin10_x, bias] return np.stack([p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias], axis=-1) ``` **Network Definition:** ```python class CNNv2(nn.Module): def __init__(self, kernel_sizes, num_layers=3): super().__init__() if isinstance(kernel_sizes, int): kernel_sizes = [kernel_sizes] * num_layers self.kernel_sizes = kernel_sizes self.layers = nn.ModuleList() # All layers: 12D input (4 prev + 8 static) → 4D output for kernel_size in kernel_sizes: self.layers.append( nn.Conv2d(12, 4, kernel_size=kernel_size, padding=kernel_size//2, bias=False) ) def forward(self, input_rgbd, static_features): # Layer 0: input RGBD (4D) + static (8D) = 12D x = torch.cat([input_rgbd, static_features], dim=1) x = self.layers[0](x) x = torch.clamp(x, 0, 1) # Output layer 0 (4 channels) # Layer 1+: previous output (4D) + static (8D) = 12D for i in range(1, len(self.layers)): x_input = torch.cat([x, static_features], dim=1) x = self.layers[i](x_input) if i < len(self.layers) - 1: x = F.relu(x) else: x = torch.clamp(x, 0, 1) # Final output [0,1] return x # RGBA output ``` **Training Configuration:** ```python # Hyperparameters kernel_sizes = [3, 3, 3] # Per-layer kernel sizes (e.g., [1,3,5]) num_layers = 3 # Number of CNN layers learning_rate = 1e-3 batch_size = 16 epochs = 5000 # Dataset: Input RGB, Target RGBA (preserves alpha channel from image) # Model outputs RGBA, loss compares all 4 channels # Training loop (standard PyTorch f32) for epoch in range(epochs): for rgb_batch, depth_batch, target_batch in dataloader: # Compute static features (8D) static_feat = compute_static_features(rgb_batch, depth_batch) # Input RGBD (4D) input_rgbd = torch.cat([rgb_batch, depth_batch.unsqueeze(1)], dim=1) # Forward pass output = model(input_rgbd, static_feat) loss = criterion(output, target_batch) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() ``` **Checkpoint Format:** ```python torch.save({ 'state_dict': model.state_dict(), # f32 weights 'config': { 'kernel_sizes': [3, 3, 3], # Per-layer kernel sizes 'num_layers': 3, 'features': ['p0', 'p1', 'p2', 'p3', 'uv.x', 'uv.y', 'sin10_x', 'bias'] }, 'epoch': epoch, 'loss': loss.item() }, f'checkpoints/checkpoint_epoch_{epoch}.pth') ``` --- ## Export Workflow ### Script: `training/export_cnn_v2_shader.py` **Process:** 1. Load checkpoint (f32 PyTorch weights) 2. Extract layer configs (kernels, channels) 3. Quantize weights to float16: `weights_f16 = weights_f32.astype(np.float16)` 4. Generate WGSL shader per layer 5. Write to `workspaces//shaders/cnn_v2/cnn_v2_*.wgsl` **Example Generated Shader:** ```wgsl // cnn_v2_layer_0.wgsl - Auto-generated from checkpoint_epoch_5000.pth const KERNEL_SIZE: u32 = 1u; const IN_CHANNELS: u32 = 8u; // 7 features + bias const OUT_CHANNELS: u32 = 16u; // Weights quantized to float16 (stored as f32 in shader) const weights: array = array( 0.123047, -0.089844, 0.234375, 0.456055, ... ); @group(0) @binding(0) var static_features: texture_2d; @group(0) @binding(1) var output_texture: texture_storage_2d; @compute @workgroup_size(8, 8) fn main(@builtin(global_invocation_id) id: vec3) { // Load static features (8D) let static_feat = get_static_features(vec2(id.xy)); // Convolution (1×1 kernel = pointwise) var output: array; for (var c: u32 = 0u; c < OUT_CHANNELS; c++) { var sum: f32 = 0.0; for (var k: u32 = 0u; k < IN_CHANNELS; k++) { sum += weights[c * IN_CHANNELS + k] * static_feat[k]; } output[c] = max(0.0, sum); // ReLU activation } // Pack and store (8×f16 per texel) textureStore(output_texture, vec2(id.xy), pack_f16x8(output)); } ``` **Float16 Quantization:** - Training uses f32 throughout (PyTorch standard) - Export converts to np.float16, then back to f32 for WGSL literals - **Expected discrepancy:** <0.1% MSE (acceptable) - Validation via HTML tool (see below) --- ## Validation Workflow ### HTML Tool: `tools/cnn_v2_test/index.html` **WebGPU-based testing tool** with layer visualization. **Usage:** 1. Open `tools/cnn_v2_test/index.html` in browser 2. Drop `.bin` weights file (from `export_cnn_v2_weights.py`) 3. Drop PNG test image 4. View results with layer inspection **Features:** - Live CNN inference with WebGPU - Layer-by-layer visualization (static features + all CNN layers) - Weight visualization (per-layer kernels) - View modes: CNN output, original, diff (×10) - Blend control for comparing with original **Export weights:** ```bash ./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \ --output-weights workspaces/main/cnn_v2_weights.bin ``` See `doc/CNN_V2_WEB_TOOL.md` for detailed documentation --- ## Implementation Checklist ### Phase 1: Shaders (Core Infrastructure) - [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl` - Static features compute - [ ] RGBD sampling from framebuffer - [ ] UV coordinate calculation - [ ] sin(10\*uv.x) computation - [ ] Bias dimension (constant 1.0) - [ ] Float16 packing via `pack2x16float()` - [ ] Output to `texture_storage_2d` - [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_layer_template.wgsl` - Layer template - [ ] Static features unpacking - [ ] Previous layer unpacking (8×f16) - [ ] Convolution implementation (1×1, 3×3, 5×5) - [ ] ReLU activation - [ ] Output packing (8×f16) - [ ] Proper padding handling ### Phase 2: C++ Effect Class - [ ] `src/gpu/effects/cnn_v2_effect.h` - Header - [ ] Class declaration inheriting from `PostProcessEffect` - [ ] Static features texture member - [ ] Layer textures vector - [ ] Pipeline and bind group members - [ ] `src/gpu/effects/cnn_v2_effect.cc` - Implementation - [ ] Constructor: Load shaders, create textures - [ ] `init()`: Create pipelines, bind groups - [ ] `render()`: Multi-pass execution - [ ] Pass 0: Compute static features - [ ] Pass 1-N: CNN layers - [ ] Final: Composite to output - [ ] Proper resource cleanup - [ ] Integration - [ ] Add to `src/gpu/demo_effects.h` includes - [ ] Add `cnn_v2_effect.cc` to `CMakeLists.txt` (headless + normal) - [ ] Add shaders to `workspaces/main/assets.txt` - [ ] Add to `src/tests/gpu/test_demo_effects.cc` ### Phase 3: Training Pipeline - [ ] `training/train_cnn_v2.py` - Training script - [ ] Static feature extraction function - [ ] CNNv2 PyTorch model class - [ ] Patch-based dataloader - [ ] Training loop with checkpointing - [ ] Command-line argument parsing - [ ] Inference mode (ground truth generation) - [ ] `training/export_cnn_v2_shader.py` - Export script - [ ] Checkpoint loading - [ ] Weight extraction and f16 quantization - [ ] Per-layer WGSL generation - [ ] File output to workspace shaders/ - [ ] Metadata preservation ### Phase 4: Tools & Validation - [x] HTML validation tool - WebGPU inference with layer visualization - [ ] Command-line argument parsing - [ ] Shader export orchestration - [ ] Build orchestration - [ ] Batch image processing - [ ] Results display - [ ] `src/tools/cnn_test_main.cc` - Tool updates - [ ] Add `--cnn-version v2` flag - [ ] CNNv2Effect instantiation path - [ ] Static features pass execution - [ ] Multi-layer processing ### Phase 5: Documentation - [ ] `doc/HOWTO.md` - Usage guide - [ ] Training section (CNN v2) - [ ] Export section - [ ] Validation section - [ ] Examples - [ ] `README.md` - Project overview update - [ ] Mention CNN v2 capability --- ## File Structure ### New Files ``` # Shaders (generated by export script) workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl # Static features compute workspaces/main/shaders/cnn_v2/cnn_v2_layer_0.wgsl # Input layer (generated) workspaces/main/shaders/cnn_v2/cnn_v2_layer_1.wgsl # Inner layer (generated) workspaces/main/shaders/cnn_v2/cnn_v2_layer_2.wgsl # Output layer (generated) # C++ implementation src/gpu/effects/cnn_v2_effect.h # Effect class header src/gpu/effects/cnn_v2_effect.cc # Effect implementation # Python training/export training/train_cnn_v2.py # Training script training/export_cnn_v2_shader.py # Shader generator training/validation/ # Test images directory # Validation tools/cnn_v2_test/index.html # WebGPU validation tool # Documentation doc/CNN_V2.md # This file ``` ### Modified Files ``` src/gpu/demo_effects.h # Add CNNv2Effect include CMakeLists.txt # Add cnn_v2_effect.cc workspaces/main/assets.txt # Add cnn_v2 shaders workspaces/main/timeline.seq # Optional: add CNNv2Effect src/tests/gpu/test_demo_effects.cc # Add CNNv2 test case src/tools/cnn_test_main.cc # Add --cnn-version v2 doc/HOWTO.md # Add CNN v2 sections TODO.md # Add CNN v2 task ``` ### Unchanged (v1 Preserved) ``` training/train_cnn.py # Original training src/gpu/effects/cnn_effect.* # Original effect workspaces/main/shaders/cnn_*.wgsl # Original v1 shaders ``` --- ## Performance Characteristics ### Static Features Compute - **Cost:** ~0.1ms @ 1080p - **Frequency:** Once per frame - **Operations:** sin(), texture sampling, packing ### CNN Layers (Example 3-layer) - **Layer0 (1×1, 8→16):** ~0.3ms - **Layer1 (3×3, 23→8):** ~0.8ms - **Layer2 (5×5, 15→4):** ~1.2ms - **Total:** ~2.4ms @ 1080p ### Memory Usage - Static features: 1920×1080×8×2 = 33 MB (f16) - Layer buffers: 1920×1080×16×2 = 66 MB (max 16 channels) - Weights: ~6.4 KB (f16, in shader code) - **Total GPU memory:** ~100 MB --- ## Size Budget ### CNN v1 vs v2 | Metric | v1 | v2 | Delta | |--------|----|----|-------| | Weights (count) | 800 | 3268 | +2468 | | Storage (f32) | 3.2 KB | 13.1 KB | +9.9 KB | | Storage (f16) | N/A | 6.5 KB | +6.5 KB | | Shader code | ~500 lines | ~800 lines | +300 lines | ### Mitigation Strategies **Reduce channels:** - [16,8,4] → [8,4,4] saves ~50% weights - [16,8,4] → [4,4,4] saves ~60% weights **Smaller kernels:** - [1,3,5] → [1,3,3] saves ~30% weights - [1,3,5] → [1,1,3] saves ~50% weights **Quantization:** - int8 weights: saves 75% (requires QAT training) - 4-bit weights: saves 87.5% (extreme, needs research) **Target:** Keep CNN v2 under 10 KB for 64k demo constraint --- ## Future Extensions ### More Features (uint8 Packing) ```wgsl // 16 uint8 features per texel (texture_storage_2d) // [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y, // sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, variance, bias] ``` - Trade precision for quantity - Requires quantization-aware training ### Temporal Features - Previous frame RGBA (motion awareness) - Optical flow vectors - Requires multi-frame buffer ### Learned Position Encodings - Replace hand-crafted sin(10\*uv) with learned embeddings - Requires separate embedding network - Similar to NeRF position encoding ### Dynamic Architecture - Runtime kernel size selection based on scene - Conditional layer execution (skip connections) - Layer pruning for performance --- ## References - **v1 Implementation:** `src/gpu/effects/cnn_effect.*` - **Training Guide:** `doc/HOWTO.md` (CNN Training section) - **Test Tool:** `doc/CNN_TEST_TOOL.md` - **Shader System:** `doc/SEQUENCE.md` - **Size Measurement:** `doc/SIZE_MEASUREMENT.md` --- ## Appendix: Design Decisions ### Why Bias as Static Feature? **Alternatives considered:** 1. Separate bias array per layer (Option B) 2. Bias as static feature = 1.0 (Option A, chosen) **Decision rationale:** - Simpler shader code (fewer bindings) - Standard NN formulation (augmented input) - Saves 56-112 bytes per model - 7 features sufficient for v1 implementation - Can extend to uint8 packing if >7 features needed ### Why Float16 for Weights? **Alternatives considered:** 1. Keep f32 (larger, more accurate) 2. Use f16 (smaller, GPU-native) 3. Use int8 (smallest, needs QAT) **Decision rationale:** - f16 saves 50% vs f32 (critical for 64k target) - GPU-native support (pack2x16float in WGSL) - <0.1% accuracy loss (acceptable) - Simpler than int8 quantization ### Why Multi-Frequency Position Encoding? **Inspiration:** NeRF (Neural Radiance Fields) **Benefits:** - Helps network learn high-frequency details - Better than raw UV coordinates - Small footprint (1D per frequency) **Future:** Add sin(20\*uv), sin(40\*uv) if >7 features available --- ## Related Documentation - `doc/CNN_V2_BINARY_FORMAT.md` - Binary weight file specification (.bin format) - `doc/CNN_V2_WEB_TOOL.md` - WebGPU testing tool with layer visualization - `doc/CNN_TEST_TOOL.md` - C++ offline validation tool (deprecated) - `doc/HOWTO.md` - Training and validation workflows --- **Document Version:** 1.0 **Last Updated:** 2026-02-12 **Status:** Design approved, ready for implementation