# CNN v2: Parametric Static Features **Technical Design Document** --- ## Overview CNN v2 extends the original CNN post-processing effect with parametric static features, enabling richer spatial and frequency-domain inputs for improved visual quality. **Key improvements over v1:** - 7D static feature input (vs 4D RGB) - Multi-frequency position encoding (NeRF-style) - Per-layer configurable kernel sizes (1×1, 3×3, 5×5) - Variable channel counts per layer - Float16 weight storage (~3.2 KB for 3-layer model) - Bias integrated as static feature dimension - Storage buffer architecture (dynamic layer count) - Binary weight format for runtime loading **Status:** ✅ Complete. Training pipeline functional, validation tools ready. **TODO:** 8-bit quantization with QAT for 2× size reduction (~1.6 KB) --- ## Architecture ### Pipeline Overview ``` Input RGBD → Static Features Compute → CNN Layers → Output RGBA └─ computed once/frame ─┘ └─ multi-pass ─┘ ``` **Static Features Texture:** - Name: `static_features` - Format: `texture_storage_2d` (4×u32) - Data: 8 float16 values packed via `pack2x16float()` - Computed once per frame, read by all CNN layers - Lifetime: Entire frame (all CNN layer passes) **CNN Layers:** - Input Layer: 7D static features → C₀ channels - Inner Layers: (7D + Cᵢ₋₁) → Cᵢ channels - Output Layer: (7D + Cₙ) → 4D RGBA - Storage: `texture_storage_2d` (8×f16 per texel recommended) --- ## Static Features (7D + 1 bias) ### Feature Layout **8 float16 values per pixel:** ```wgsl // Slot 0-3: RGBD (core pixel data) let r = rgba.r; // Red channel let g = rgba.g; // Green channel let b = rgba.b; // Blue channel let d = depth; // Depth value // Slot 4-5: UV coordinates (normalized screen space) let uv_x = coord.x / resolution.x; // Horizontal position [0,1] let uv_y = coord.y / resolution.y; // Vertical position [0,1] // Slot 6: Multi-frequency position encoding let sin10_x = sin(10.0 * uv_x); // Periodic feature (frequency=10) // Slot 7: Bias dimension (always 1.0) let bias = 1.0; // Learned bias per output channel // Packed storage: [R, G, B, D, uv.x, uv.y, sin(10*uv.x), 1.0] ``` ### Feature Rationale | Feature | Dimension | Purpose | Priority | |---------|-----------|---------|----------| | RGBD | 4D | Core pixel information | Essential | | UV coords | 2D | Spatial position awareness | Essential | | sin(10\*uv.x) | 1D | Periodic position encoding | Medium | | Bias | 1D | Learned bias (standard NN) | Essential | **Why bias as static feature:** - Simpler shader code (single weight array) - Standard NN formulation: y = Wx (x includes bias term) - Saves 56-112 bytes (no separate bias buffer) - 7 features sufficient for initial implementation ### Future Feature Extensions **Option: Replace sin(10\*uv.x) with:** - `sin(20*uv.x)` - Higher frequency encoding - `gray_mip1` - Multi-scale luminance - `dx`, `dy` - Sobel gradients - `variance` - Local texture measure - `laplacian` - Edge detection **Option: uint8 packing (16+ features):** ```wgsl // texture_storage_2d stores 16 uint8 values // Trade precision for feature count // [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y, // sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, var, bias] ``` Requires quantization-aware training. --- ## Layer Structure ### Example 3-Layer Network ``` Input: 7D static → 16 channels (1×1 kernel, pointwise) Layer1: (7+16)D → 8 channels (3×3 kernel, spatial) Layer2: (7+8)D → 4 channels (5×5 kernel, large receptive field) ``` ### Weight Calculations **Per-layer weights:** ``` Input: 7 × 1 × 1 × 16 = 112 weights Layer1: (7+16) × 3 × 3 × 8 = 1656 weights Layer2: (7+8) × 5 × 5 × 4 = 1500 weights Total: 3268 weights ``` **Storage sizes:** - f32: 3268 × 4 = 13,072 bytes (~12.8 KB) - f16: 3268 × 2 = 6,536 bytes (~6.4 KB) ✓ **recommended** **Comparison to v1:** - v1: ~800 weights (3.2 KB f32) - v2: ~3268 weights (6.4 KB f16) - **Growth: 2× size for parametric features** ### Kernel Size Guidelines **1×1 kernel (pointwise):** - No spatial context, channel mixing only - Weights: `(7 + C_in) × C_out` - Use for: Input layer, bottleneck layers **3×3 kernel (standard conv):** - Local spatial context - Weights: `(7 + C_in) × 9 × C_out` - Use for: Most inner layers **5×5 kernel (large receptive field):** - Wide spatial context - Weights: `(7 + C_in) × 25 × C_out` - Use for: Output layer, detail enhancement ### Channel Storage (8×f16 per texel) ```wgsl @group(0) @binding(1) var layer_input: texture_2d; fn unpack_channels(coord: vec2) -> array { let packed = textureLoad(layer_input, coord, 0); return array( unpack2x16float(packed.x).x, unpack2x16float(packed.x).y, unpack2x16float(packed.y).x, unpack2x16float(packed.y).y, unpack2x16float(packed.z).x, unpack2x16float(packed.z).y, unpack2x16float(packed.w).x, unpack2x16float(packed.w).y ); } fn pack_channels(values: array) -> vec4 { return vec4( pack2x16float(vec2(values[0], values[1])), pack2x16float(vec2(values[2], values[3])), pack2x16float(vec2(values[4], values[5])), pack2x16float(vec2(values[6], values[7])) ); } ``` --- ## Training Workflow ### Script: `training/train_cnn_v2.py` **Static Feature Extraction:** ```python def compute_static_features(rgb, depth): """Generate 7D static features + bias dimension.""" h, w = rgb.shape[:2] # RGBD channels r, g, b = rgb[..., 0], rgb[..., 1], rgb[..., 2] # UV coordinates (normalized) uv_x = np.linspace(0, 1, w)[None, :].repeat(h, axis=0) uv_y = np.linspace(0, 1, h)[:, None].repeat(w, axis=1) # Multi-frequency position encoding sin10_x = np.sin(10.0 * uv_x) # Bias dimension (always 1.0) bias = np.ones_like(r) # Stack: [R, G, B, D, uv.x, uv.y, sin10_x, bias] return np.stack([r, g, b, depth, uv_x, uv_y, sin10_x, bias], axis=-1) ``` **Network Definition:** ```python class CNNv2(nn.Module): def __init__(self, kernels=[1,3,5], channels=[16,8,4]): super().__init__() # Input layer: 8D (7 features + bias) → channels[0] self.layer0 = nn.Conv2d(8, channels[0], kernel_size=kernels[0], padding=kernels[0]//2, bias=False) # Inner layers: (7 features + bias + C_prev) → C_next in_ch_1 = 8 + channels[0] # static + layer0 output self.layer1 = nn.Conv2d(in_ch_1, channels[1], kernel_size=kernels[1], padding=kernels[1]//2, bias=False) # Output layer: (7 features + bias + C_last) → 4 (RGBA) in_ch_2 = 8 + channels[1] self.layer2 = nn.Conv2d(in_ch_2, 4, kernel_size=kernels[2], padding=kernels[2]//2, bias=False) def forward(self, static_features, layer0_input=None): # Layer 0: Use full 8D static features (includes bias) x0 = self.layer0(static_features) x0 = F.relu(x0) # Layer 1: Concatenate static + layer0 output x1_input = torch.cat([static_features, x0], dim=1) x1 = self.layer1(x1_input) x1 = F.relu(x1) # Layer 2: Concatenate static + layer1 output x2_input = torch.cat([static_features, x1], dim=1) output = self.layer2(x2_input) return torch.sigmoid(output) # RGBA output [0,1] ``` **Training Configuration:** ```python # Hyperparameters kernels = [1, 3, 5] # Per-layer kernel sizes channels = [16, 8, 4] # Per-layer output channels learning_rate = 1e-3 batch_size = 16 epochs = 5000 # Training loop (standard PyTorch f32) for epoch in range(epochs): for rgb_batch, depth_batch, target_batch in dataloader: # Compute static features static_feat = compute_static_features(rgb_batch, depth_batch) # Forward pass output = model(static_feat) loss = criterion(output, target_batch) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() ``` **Checkpoint Format:** ```python torch.save({ 'state_dict': model.state_dict(), # f32 weights 'config': { 'kernels': [1, 3, 5], 'channels': [16, 8, 4], 'features': ['R', 'G', 'B', 'D', 'uv.x', 'uv.y', 'sin10_x', 'bias'] }, 'epoch': epoch, 'loss': loss.item() }, f'checkpoints/checkpoint_epoch_{epoch}.pth') ``` --- ## Export Workflow ### Script: `training/export_cnn_v2_shader.py` **Process:** 1. Load checkpoint (f32 PyTorch weights) 2. Extract layer configs (kernels, channels) 3. Quantize weights to float16: `weights_f16 = weights_f32.astype(np.float16)` 4. Generate WGSL shader per layer 5. Write to `workspaces//shaders/cnn_v2_*.wgsl` **Example Generated Shader:** ```wgsl // cnn_v2_layer_0.wgsl - Auto-generated from checkpoint_epoch_5000.pth const KERNEL_SIZE: u32 = 1u; const IN_CHANNELS: u32 = 8u; // 7 features + bias const OUT_CHANNELS: u32 = 16u; // Weights quantized to float16 (stored as f32 in shader) const weights: array = array( 0.123047, -0.089844, 0.234375, 0.456055, ... ); @group(0) @binding(0) var static_features: texture_2d; @group(0) @binding(1) var output_texture: texture_storage_2d; @compute @workgroup_size(8, 8) fn main(@builtin(global_invocation_id) id: vec3) { // Load static features (8D) let static_feat = get_static_features(vec2(id.xy)); // Convolution (1×1 kernel = pointwise) var output: array; for (var c: u32 = 0u; c < OUT_CHANNELS; c++) { var sum: f32 = 0.0; for (var k: u32 = 0u; k < IN_CHANNELS; k++) { sum += weights[c * IN_CHANNELS + k] * static_feat[k]; } output[c] = max(0.0, sum); // ReLU activation } // Pack and store (8×f16 per texel) textureStore(output_texture, vec2(id.xy), pack_f16x8(output)); } ``` **Float16 Quantization:** - Training uses f32 throughout (PyTorch standard) - Export converts to np.float16, then back to f32 for WGSL literals - **Expected discrepancy:** <0.1% MSE (acceptable) - Validation via `validate_cnn_v2.sh` compares outputs --- ## Validation Workflow ### Script: `scripts/validate_cnn_v2.sh` **End-to-end pipeline:** ```bash ./scripts/validate_cnn_v2.sh checkpoints/checkpoint_epoch_5000.pth ``` **Steps automated:** 1. Export checkpoint → .wgsl shaders 2. Rebuild `cnn_test` tool 3. Process test images with CNN v2 4. Display input/output results **Usage:** ```bash # Basic usage ./scripts/validate_cnn_v2.sh checkpoint.pth # Custom paths ./scripts/validate_cnn_v2.sh checkpoint.pth \ -i my_test_images/ \ -o results/ \ -b build_release # Skip rebuild (iterate on checkpoint only) ./scripts/validate_cnn_v2.sh checkpoint.pth --skip-build # Skip export (iterate on test images only) ./scripts/validate_cnn_v2.sh checkpoint.pth --skip-export # Show help ./scripts/validate_cnn_v2.sh --help ``` **Options:** - `-b, --build-dir DIR` - Build directory (default: build) - `-w, --workspace NAME` - Workspace name (default: main) - `-i, --images DIR` - Test images directory (default: training/validation) - `-o, --output DIR` - Output directory (default: validation_results) - `--skip-build` - Use existing cnn_test binary - `--skip-export` - Use existing .wgsl shaders - `-h, --help` - Show full usage **Output:** - Input images: `/*.png` - Output images: `/*_output.png` - Opens results directory in system file browser --- ## Implementation Checklist ### Phase 1: Shaders (Core Infrastructure) - [ ] `workspaces/main/shaders/cnn_v2_static.wgsl` - Static features compute - [ ] RGBD sampling from framebuffer - [ ] UV coordinate calculation - [ ] sin(10\*uv.x) computation - [ ] Bias dimension (constant 1.0) - [ ] Float16 packing via `pack2x16float()` - [ ] Output to `texture_storage_2d` - [ ] `workspaces/main/shaders/cnn_v2_layer_template.wgsl` - Layer template - [ ] Static features unpacking - [ ] Previous layer unpacking (8×f16) - [ ] Convolution implementation (1×1, 3×3, 5×5) - [ ] ReLU activation - [ ] Output packing (8×f16) - [ ] Proper padding handling ### Phase 2: C++ Effect Class - [ ] `src/gpu/effects/cnn_v2_effect.h` - Header - [ ] Class declaration inheriting from `PostProcessEffect` - [ ] Static features texture member - [ ] Layer textures vector - [ ] Pipeline and bind group members - [ ] `src/gpu/effects/cnn_v2_effect.cc` - Implementation - [ ] Constructor: Load shaders, create textures - [ ] `init()`: Create pipelines, bind groups - [ ] `render()`: Multi-pass execution - [ ] Pass 0: Compute static features - [ ] Pass 1-N: CNN layers - [ ] Final: Composite to output - [ ] Proper resource cleanup - [ ] Integration - [ ] Add to `src/gpu/demo_effects.h` includes - [ ] Add `cnn_v2_effect.cc` to `CMakeLists.txt` (headless + normal) - [ ] Add shaders to `workspaces/main/assets.txt` - [ ] Add to `src/tests/gpu/test_demo_effects.cc` ### Phase 3: Training Pipeline - [ ] `training/train_cnn_v2.py` - Training script - [ ] Static feature extraction function - [ ] CNNv2 PyTorch model class - [ ] Patch-based dataloader - [ ] Training loop with checkpointing - [ ] Command-line argument parsing - [ ] Inference mode (ground truth generation) - [ ] `training/export_cnn_v2_shader.py` - Export script - [ ] Checkpoint loading - [ ] Weight extraction and f16 quantization - [ ] Per-layer WGSL generation - [ ] File output to workspace shaders/ - [ ] Metadata preservation ### Phase 4: Tools & Validation - [ ] `scripts/validate_cnn_v2.sh` - End-to-end validation - [ ] Command-line argument parsing - [ ] Shader export orchestration - [ ] Build orchestration - [ ] Batch image processing - [ ] Results display - [ ] `src/tools/cnn_test_main.cc` - Tool updates - [ ] Add `--cnn-version v2` flag - [ ] CNNv2Effect instantiation path - [ ] Static features pass execution - [ ] Multi-layer processing ### Phase 5: Documentation - [ ] `doc/HOWTO.md` - Usage guide - [ ] Training section (CNN v2) - [ ] Export section - [ ] Validation section - [ ] Examples - [ ] `README.md` - Project overview update - [ ] Mention CNN v2 capability --- ## File Structure ### New Files ``` # Shaders (generated by export script) workspaces/main/shaders/cnn_v2_static.wgsl # Static features compute workspaces/main/shaders/cnn_v2_layer_0.wgsl # Input layer (generated) workspaces/main/shaders/cnn_v2_layer_1.wgsl # Inner layer (generated) workspaces/main/shaders/cnn_v2_layer_2.wgsl # Output layer (generated) # C++ implementation src/gpu/effects/cnn_v2_effect.h # Effect class header src/gpu/effects/cnn_v2_effect.cc # Effect implementation # Python training/export training/train_cnn_v2.py # Training script training/export_cnn_v2_shader.py # Shader generator training/validation/ # Test images directory # Scripts scripts/validate_cnn_v2.sh # End-to-end validation # Documentation doc/CNN_V2.md # This file ``` ### Modified Files ``` src/gpu/demo_effects.h # Add CNNv2Effect include CMakeLists.txt # Add cnn_v2_effect.cc workspaces/main/assets.txt # Add cnn_v2 shaders workspaces/main/timeline.seq # Optional: add CNNv2Effect src/tests/gpu/test_demo_effects.cc # Add CNNv2 test case src/tools/cnn_test_main.cc # Add --cnn-version v2 doc/HOWTO.md # Add CNN v2 sections TODO.md # Add CNN v2 task ``` ### Unchanged (v1 Preserved) ``` training/train_cnn.py # Original training src/gpu/effects/cnn_effect.* # Original effect workspaces/main/shaders/cnn_*.wgsl # Original shaders ``` --- ## Performance Characteristics ### Static Features Compute - **Cost:** ~0.1ms @ 1080p - **Frequency:** Once per frame - **Operations:** sin(), texture sampling, packing ### CNN Layers (Example 3-layer) - **Layer0 (1×1, 8→16):** ~0.3ms - **Layer1 (3×3, 23→8):** ~0.8ms - **Layer2 (5×5, 15→4):** ~1.2ms - **Total:** ~2.4ms @ 1080p ### Memory Usage - Static features: 1920×1080×8×2 = 33 MB (f16) - Layer buffers: 1920×1080×16×2 = 66 MB (max 16 channels) - Weights: ~6.4 KB (f16, in shader code) - **Total GPU memory:** ~100 MB --- ## Size Budget ### CNN v1 vs v2 | Metric | v1 | v2 | Delta | |--------|----|----|-------| | Weights (count) | 800 | 3268 | +2468 | | Storage (f32) | 3.2 KB | 13.1 KB | +9.9 KB | | Storage (f16) | N/A | 6.5 KB | +6.5 KB | | Shader code | ~500 lines | ~800 lines | +300 lines | ### Mitigation Strategies **Reduce channels:** - [16,8,4] → [8,4,4] saves ~50% weights - [16,8,4] → [4,4,4] saves ~60% weights **Smaller kernels:** - [1,3,5] → [1,3,3] saves ~30% weights - [1,3,5] → [1,1,3] saves ~50% weights **Quantization:** - int8 weights: saves 75% (requires QAT training) - 4-bit weights: saves 87.5% (extreme, needs research) **Target:** Keep CNN v2 under 10 KB for 64k demo constraint --- ## Future Extensions ### More Features (uint8 Packing) ```wgsl // 16 uint8 features per texel (texture_storage_2d) // [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y, // sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, variance, bias] ``` - Trade precision for quantity - Requires quantization-aware training ### Temporal Features - Previous frame RGBA (motion awareness) - Optical flow vectors - Requires multi-frame buffer ### Learned Position Encodings - Replace hand-crafted sin(10\*uv) with learned embeddings - Requires separate embedding network - Similar to NeRF position encoding ### Dynamic Architecture - Runtime kernel size selection based on scene - Conditional layer execution (skip connections) - Layer pruning for performance --- ## References - **v1 Implementation:** `src/gpu/effects/cnn_effect.*` - **Training Guide:** `doc/HOWTO.md` (CNN Training section) - **Test Tool:** `doc/CNN_TEST_TOOL.md` - **Shader System:** `doc/SEQUENCE.md` - **Size Measurement:** `doc/SIZE_MEASUREMENT.md` --- ## Appendix: Design Decisions ### Why Bias as Static Feature? **Alternatives considered:** 1. Separate bias array per layer (Option B) 2. Bias as static feature = 1.0 (Option A, chosen) **Decision rationale:** - Simpler shader code (fewer bindings) - Standard NN formulation (augmented input) - Saves 56-112 bytes per model - 7 features sufficient for v1 implementation - Can extend to uint8 packing if >7 features needed ### Why Float16 for Weights? **Alternatives considered:** 1. Keep f32 (larger, more accurate) 2. Use f16 (smaller, GPU-native) 3. Use int8 (smallest, needs QAT) **Decision rationale:** - f16 saves 50% vs f32 (critical for 64k target) - GPU-native support (pack2x16float in WGSL) - <0.1% accuracy loss (acceptable) - Simpler than int8 quantization ### Why Multi-Frequency Position Encoding? **Inspiration:** NeRF (Neural Radiance Fields) **Benefits:** - Helps network learn high-frequency details - Better than raw UV coordinates - Small footprint (1D per frequency) **Future:** Add sin(20\*uv), sin(40\*uv) if >7 features available --- **Document Version:** 1.0 **Last Updated:** 2026-02-12 **Status:** Design approved, ready for implementation