5 files changed, 16 insertions, 1555 deletions
diff --git a/doc/CNN_V2.md b/doc/CNN_V2.md
deleted file mode 100644
index b7fd6f8..0000000
--- a/doc/CNN_V2.md
+++ /dev/null
@@ -1,813 +0,0 @@
-# CNN v2: Parametric Static Features
-
-**Technical Design Document**
-
----
-
-## Overview
-
-CNN v2 extends the original CNN post-processing effect with parametric static features, enabling richer spatial and frequency-domain inputs for improved visual quality.
-
-**Key improvements over v1:**
-- 7D static feature input (vs 4D RGB)
-- Multi-frequency position encoding (NeRF-style)
-- Configurable mip-level for p0-p3 parametric features (0-3)
-- Per-layer configurable kernel sizes (1×1, 3×3, 5×5)
-- Variable channel counts per layer
-- Float16 weight storage (~3.2 KB for 3-layer model)
-- Bias integrated as static feature dimension
-- Storage buffer architecture (dynamic layer count)
-- Binary weight format v2 for runtime loading
-- Sigmoid activation for layer 0 and final layer (smooth [0,1] mapping)
-
-**Status:** ✅ Complete. Sigmoid activation, stable training, validation tools operational.
-
-**Breaking Change:**
-- Models trained with `clamp()` incompatible. Retrain required.
-
-**TODO:**
-- 8-bit quantization with QAT for 2× size reduction (~1.6 KB)
-
----
-
-## Architecture
-
-### Pipeline Overview
-
-```
-Input RGBD → Static Features Compute → CNN Layers → Output RGBA
-             └─ computed once/frame ─┘  └─ multi-pass ─┘
-```
-
-**Detailed Data Flow:**
-
-```
-                  ┌─────────────────────────────────────────┐
-                  │   Static Features (computed once)      │
-                  │   8D: p0,p1,p2,p3,uv_x,uv_y,sin10x,bias │
-                  └──────────────┬──────────────────────────┘
-                                 │
-                                 │ 8D (broadcast to all layers)
-                                 ├───────────────────────────┐
-                                 │                           │
-  ┌──────────────┐              │                           │
-  │ Input RGBD   │──────────────┤                           │
-  │     4D       │     4D       │                           │
-  └──────────────┘              │                           │
-                                 ▼                           │
-                          ┌────────────┐                    │
-                          │  Layer 0   │  (12D input)       │
-                          │   (CNN)    │   = 4D + 8D        │
-                          │  12D → 4D  │                    │
-                          └─────┬──────┘                    │
-                                │ 4D output                 │
-                                │                           │
-                                ├───────────────────────────┘
-                                │                           │
-                                ▼                           │
-                          ┌────────────┐                    │
-                          │  Layer 1   │  (12D input)       │
-                          │   (CNN)    │   = 4D + 8D        │
-                          │  12D → 4D  │                    │
-                          └─────┬──────┘                    │
-                                │ 4D output                 │
-                                │                           │
-                                ├───────────────────────────┘
-                                ▼                           │
-                               ...                          │
-                                │                           │
-                                ▼                           │
-                          ┌────────────┐                    │
-                          │  Layer N   │  (12D input)       │
-                          │ (output)   │◄──────────────────┘
-                          │  12D → 4D  │
-                          └─────┬──────┘
-                                │ 4D (RGBA)
-                                ▼
-                            Output
-```
-
-**Key Points:**
-- Static features computed once, broadcast to all CNN layers
-- Each layer: previous 4D output + 8D static → 12D input → 4D output
-- Ping-pong buffering between layers
-- Layer 0 special case: uses input RGBD instead of previous layer output
-
-**Static Features Texture:**
-- Name: `static_features`
-- Format: `texture_storage_2d<rgba32uint, write>` (4×u32)
-- Data: 8 float16 values packed via `pack2x16float()`
-- Computed once per frame, read by all CNN layers
-- Lifetime: Entire frame (all CNN layer passes)
-
-**CNN Layers:**
-- Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels
-- Layer 1+: previous output (4D) + static (8D) = 12D → 4 channels
-- All layers: uniform 12D input, 4D output (ping-pong buffer)
-- Storage: `texture_storage_2d<rgba32uint>` (4 channels as 2×f16 pairs)
-
-**Activation Functions:**
-- Layer 0 & final layer: `sigmoid(x)` for smooth [0,1] mapping
-- Middle layers: `ReLU` (max(0, x))
-- Rationale: Sigmoid prevents gradient blocking at boundaries, enabling better convergence
-- Breaking change: Models trained with `clamp(x, 0, 1)` are incompatible, retrain required
-
----
-
-## Static Features (7D + 1 bias)
-
-### Feature Layout
-
-**8 float16 values per pixel:**
-
-```wgsl
-// Slot 0-3: Parametric features (p0, p1, p2, p3)
-// Sampled from configurable mip level (0=original, 1=half, 2=quarter, 3=eighth)
-// Training sets mip_level via --mip-level flag, stored in binary format v2
-let p0 = ...;  // RGB.r from selected mip level
-let p1 = ...;  // RGB.g from selected mip level
-let p2 = ...;  // RGB.b from selected mip level
-let p3 = ...;  // Depth or RGB channel from mip level
-
-// Slot 4-5: UV coordinates (normalized screen space)
-let uv_x = coord.x / resolution.x;  // Horizontal position [0,1]
-let uv_y = coord.y / resolution.y;  // Vertical position [0,1]
-
-// Slot 6: Multi-frequency position encoding
-let sin20_y = sin(20.0 * uv_y);     // Periodic feature (frequency=20, vertical)
-
-// Slot 7: Bias dimension (always 1.0)
-let bias = 1.0;                     // Learned bias per output channel
-
-// Packed storage: [p0, p1, p2, p3, uv.x, uv.y, sin(20*uv.y), 1.0]
-```
-
-### Input Channel Mapping
-
-**Weight tensor layout (12 input channels per layer):**
-
-| Input Channel | Feature | Description |
-|--------------|---------|-------------|
-| 0-3 | Previous layer output | 4D RGBA from prior CNN layer (or input RGBD for Layer 0) |
-| 4-11 | Static features | 8D: p0, p1, p2, p3, uv_x, uv_y, sin20_y, bias |
-
-**Static feature channel details:**
-- Channel 4 → p0 (RGB.r from mip level)
-- Channel 5 → p1 (RGB.g from mip level)
-- Channel 6 → p2 (RGB.b from mip level)
-- Channel 7 → p3 (depth or RGB channel from mip level)
-- Channel 8 → p4 (uv_x: normalized horizontal position)
-- Channel 9 → p5 (uv_y: normalized vertical position)
-- Channel 10 → p6 (sin(20*uv_y): periodic encoding)
-- Channel 11 → p7 (bias: constant 1.0)
-
-**Note:** When generating identity weights, p4-p7 correspond to input channels 8-11, not 4-7.
-
-### Feature Rationale
-
-| Feature | Dimension | Purpose | Priority |
-|---------|-----------|---------|----------|
-| p0-p3 | 4D | Parametric auxiliary features (mips, gradients, etc.) | Essential |
-| UV coords | 2D | Spatial position awareness | Essential |
-| sin(20\*uv.y) | 1D | Periodic position encoding (vertical) | Medium |
-| Bias | 1D | Learned bias (standard NN) | Essential |
-
-**Note:** Input image RGBD (mip 0) fed only to Layer 0. Subsequent layers see static features + previous layer output.
-
-**Why bias as static feature:**
-- Simpler shader code (single weight array)
-- Standard NN formulation: y = Wx (x includes bias term)
-- Saves 56-112 bytes (no separate bias buffer)
-- 7 features sufficient for initial implementation
-
-### Future Feature Extensions
-
-**Option: Additional encodings:**
-- `sin(40*uv.y)` - Higher frequency encoding
-- `gray_mip1` - Multi-scale luminance
-- `dx`, `dy` - Sobel gradients
-- `variance` - Local texture measure
-- `laplacian` - Edge detection
-
-**Option: uint8 packing (16+ features):**
-```wgsl
-// texture_storage_2d<rgba8unorm> stores 16 uint8 values
-// Trade precision for feature count
-// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y,
-//  sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, var, bias]
-```
-Requires quantization-aware training.
-
----
-
-## Layer Structure
-
-### Example 3-Layer Network
-
-```
-Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels (3×3 kernel)
-Layer 1: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel)
-Layer 2: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel, output RGBA)
-```
-
-**Output:** 4 channels (RGBA). Training targets preserve alpha from target images.
-
-### Weight Calculations
-
-**Per-layer weights (uniform 12D→4D, 3×3 kernels):**
-```
-Layer 0: 12 × 3 × 3 × 4 = 432 weights
-Layer 1: 12 × 3 × 3 × 4 = 432 weights
-Layer 2: 12 × 3 × 3 × 4 = 432 weights
-Total: 1296 weights
-```
-
-**Storage sizes:**
-- f32: 1296 × 4 = 5,184 bytes (~5.1 KB)
-- f16: 1296 × 2 = 2,592 bytes (~2.5 KB) ✓ **recommended**
-
-**Comparison to v1:**
-- v1: ~800 weights (3.2 KB f32)
-- v2: ~1296 weights (2.5 KB f16)
-- **Uniform architecture, smaller than v1 f32**
-
-### Kernel Size Guidelines
-
-**1×1 kernel (pointwise):**
-- No spatial context, channel mixing only
-- Weights: `12 × 4 = 48` per layer
-- Use for: Fast inference, channel remapping
-
-**3×3 kernel (standard conv):**
-- Local spatial context (recommended)
-- Weights: `12 × 9 × 4 = 432` per layer
-- Use for: Most layers (balanced quality/size)
-
-**5×5 kernel (large receptive field):**
-- Wide spatial context
-- Weights: `12 × 25 × 4 = 1200` per layer
-- Use for: Output layer, fine detail enhancement
-
-### Channel Storage (4×f16 per texel)
-
-```wgsl
-@group(0) @binding(1) var layer_input: texture_2d<u32>;
-
-fn unpack_channels(coord: vec2<i32>) -> vec4<f32> {
-  let packed = textureLoad(layer_input, coord, 0);
-  let v0 = unpack2x16float(packed.x);  // [ch0, ch1]
-  let v1 = unpack2x16float(packed.y);  // [ch2, ch3]
-  return vec4<f32>(v0.x, v0.y, v1.x, v1.y);
-}
-
-fn pack_channels(values: vec4<f32>) -> vec4<u32> {
-  return vec4<u32>(
-    pack2x16float(vec2(values.x, values.y)),
-    pack2x16float(vec2(values.z, values.w)),
-    0u,  // Unused
-    0u   // Unused
-  );
-}
-```
-
----
-
-## Training Workflow
-
-### Script: `training/train_cnn_v2.py`
-
-**Static Feature Extraction:**
-
-```python
-def compute_static_features(rgb, depth, mip_level=0):
-    """Generate parametric features (8D: p0-p3 + spatial).
-
-    Args:
-        mip_level: 0=original, 1=half res, 2=quarter res, 3=eighth res
-    """
-    h, w = rgb.shape[:2]
-
-    # Generate mip level for p0-p3 (downsample then upsample)
-    if mip_level > 0:
-        mip_rgb = rgb.copy()
-        for _ in range(mip_level):
-            mip_rgb = cv2.pyrDown(mip_rgb)
-        for _ in range(mip_level):
-            mip_rgb = cv2.pyrUp(mip_rgb)
-        if mip_rgb.shape[:2] != (h, w):
-            mip_rgb = cv2.resize(mip_rgb, (w, h), interpolation=cv2.INTER_LINEAR)
-    else:
-        mip_rgb = rgb
-
-    # Parametric features from mip level
-    p0, p1, p2, p3 = mip_rgb[..., 0], mip_rgb[..., 1], mip_rgb[..., 2], depth
-
-    # UV coordinates (normalized)
-    uv_x = np.linspace(0, 1, w)[None, :].repeat(h, axis=0)
-    uv_y = np.linspace(0, 1, h)[:, None].repeat(w, axis=1)
-
-    # Multi-frequency position encoding
-    sin10_x = np.sin(10.0 * uv_x)
-
-    # Bias dimension (always 1.0)
-    bias = np.ones_like(p0)
-
-    # Stack: [p0, p1, p2, p3, uv.x, uv.y, sin10_x, bias]
-    return np.stack([p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias], axis=-1)
-```
-
-**Network Definition:**
-
-```python
-class CNNv2(nn.Module):
-    def __init__(self, kernel_sizes, num_layers=3):
-        super().__init__()
-        if isinstance(kernel_sizes, int):
-            kernel_sizes = [kernel_sizes] * num_layers
-        self.kernel_sizes = kernel_sizes
-        self.layers = nn.ModuleList()
-
-        # All layers: 12D input (4 prev + 8 static) → 4D output
-        for kernel_size in kernel_sizes:
-            self.layers.append(
-                nn.Conv2d(12, 4, kernel_size=kernel_size,
-                         padding=kernel_size//2, bias=False)
-            )
-
-    def forward(self, input_rgbd, static_features):
-        # Layer 0: input RGBD (4D) + static (8D) = 12D
-        x = torch.cat([input_rgbd, static_features], dim=1)
-        x = self.layers[0](x)
-        x = torch.sigmoid(x)  # Soft [0,1] for layer 0
-
-        # Layer 1+: previous output (4D) + static (8D) = 12D
-        for i in range(1, len(self.layers)):
-            x_input = torch.cat([x, static_features], dim=1)
-            x = self.layers[i](x_input)
-            if i < len(self.layers) - 1:
-                x = F.relu(x)
-            else:
-                x = torch.sigmoid(x)  # Soft [0,1] for final layer
-
-        return x  # RGBA output
-```
-
-**Training Configuration:**
-
-```python
-# Hyperparameters
-kernel_sizes = [3, 3, 3]     # Per-layer kernel sizes (e.g., [1,3,5])
-num_layers = 3               # Number of CNN layers
-mip_level = 0                # Mip level for p0-p3: 0=orig, 1=half, 2=quarter, 3=eighth
-grayscale_loss = False       # Compute loss on grayscale (Y) instead of RGBA
-learning_rate = 1e-3
-batch_size = 16
-epochs = 5000
-
-# Dataset: Input RGB, Target RGBA (preserves alpha channel from image)
-# Model outputs RGBA, loss compares all 4 channels (or grayscale if --grayscale-loss)
-
-# Training loop (standard PyTorch f32)
-for epoch in range(epochs):
-    for rgb_batch, depth_batch, target_batch in dataloader:
-        # Compute static features (8D) with mip level
-        static_feat = compute_static_features(rgb_batch, depth_batch, mip_level)
-
-        # Input RGBD (4D)
-        input_rgbd = torch.cat([rgb_batch, depth_batch.unsqueeze(1)], dim=1)
-
-        # Forward pass
-        output = model(input_rgbd, static_feat)
-
-        # Loss computation (grayscale or RGBA)
-        if grayscale_loss:
-            # Convert RGBA to grayscale: Y = 0.299*R + 0.587*G + 0.114*B
-            output_gray = 0.299 * output[:, 0:1] + 0.587 * output[:, 1:2] + 0.114 * output[:, 2:3]
-            target_gray = 0.299 * target[:, 0:1] + 0.587 * target[:, 1:2] + 0.114 * target[:, 2:3]
-            loss = criterion(output_gray, target_gray)
-        else:
-            loss = criterion(output, target_batch)
-
-        # Backward pass
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-```
-
-**Checkpoint Format:**
-
-```python
-torch.save({
-    'state_dict': model.state_dict(),  # f32 weights
-    'config': {
-        'kernel_sizes': [3, 3, 3],  # Per-layer kernel sizes
-        'num_layers': 3,
-        'mip_level': 0,             # Mip level used for p0-p3
-        'grayscale_loss': False,    # Whether grayscale loss was used
-        'features': ['p0', 'p1', 'p2', 'p3', 'uv.x', 'uv.y', 'sin10_x', 'bias']
-    },
-    'epoch': epoch,
-    'loss': loss.item()
-}, f'checkpoints/checkpoint_epoch_{epoch}.pth')
-```
-
----
-
-## Export Workflow
-
-### Script: `training/export_cnn_v2_shader.py`
-
-**Process:**
-1. Load checkpoint (f32 PyTorch weights)
-2. Extract layer configs (kernels, channels)
-3. Quantize weights to float16: `weights_f16 = weights_f32.astype(np.float16)`
-4. Generate WGSL shader per layer
-5. Write to `workspaces/<workspace>/shaders/cnn_v2/cnn_v2_*.wgsl`
-
-**Example Generated Shader:**
-
-```wgsl
-// cnn_v2_layer_0.wgsl - Auto-generated from checkpoint_epoch_5000.pth
-
-const KERNEL_SIZE: u32 = 1u;
-const IN_CHANNELS: u32 = 8u;   // 7 features + bias
-const OUT_CHANNELS: u32 = 16u;
-
-// Weights quantized to float16 (stored as f32 in shader)
-const weights: array<f32, 128> = array(
-  0.123047, -0.089844, 0.234375, 0.456055, ...
-);
-
-@group(0) @binding(0) var static_features: texture_2d<u32>;
-@group(0) @binding(1) var output_texture: texture_storage_2d<rgba32uint, write>;
-
-@compute @workgroup_size(8, 8)
-fn main(@builtin(global_invocation_id) id: vec3<u32>) {
-  // Load static features (8D)
-  let static_feat = get_static_features(vec2<i32>(id.xy));
-
-  // Convolution (1×1 kernel = pointwise)
-  var output: array<f32, OUT_CHANNELS>;
-  for (var c: u32 = 0u; c < OUT_CHANNELS; c++) {
-    var sum: f32 = 0.0;
-    for (var k: u32 = 0u; k < IN_CHANNELS; k++) {
-      sum += weights[c * IN_CHANNELS + k] * static_feat[k];
-    }
-    output[c] = max(0.0, sum);  // ReLU activation
-  }
-
-  // Pack and store (8×f16 per texel)
-  textureStore(output_texture, vec2<i32>(id.xy), pack_f16x8(output));
-}
-```
-
-**Float16 Quantization:**
-- Training uses f32 throughout (PyTorch standard)
-- Export converts to np.float16, then back to f32 for WGSL literals
-- **Expected discrepancy:** <0.1% MSE (acceptable)
-- Validation via HTML tool (see below)
-
----
-
-## Validation Workflow
-
-### HTML Tool: `tools/cnn_v2_test/index.html`
-
-**WebGPU-based testing tool** with layer visualization.
-
-**Usage:**
-1. Open `tools/cnn_v2_test/index.html` in browser
-2. Drop `.bin` weights file (from `export_cnn_v2_weights.py`)
-3. Drop PNG test image
-4. View results with layer inspection
-
-**Features:**
-- Live CNN inference with WebGPU
-- Layer-by-layer visualization (static features + all CNN layers)
-- Weight visualization (per-layer kernels)
-- View modes: CNN output, original, diff (×10)
-- Blend control for comparing with original
-
-**Export weights:**
-```bash
-./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \
-  --output-weights workspaces/main/cnn_v2_weights.bin
-```
-
-See `doc/CNN_V2_WEB_TOOL.md` for detailed documentation
-
----
-
-## Implementation Checklist
-
-### Phase 1: Shaders (Core Infrastructure)
-
-- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl` - Static features compute
-  - [ ] RGBD sampling from framebuffer
-  - [ ] UV coordinate calculation
-  - [ ] sin(10\*uv.x) computation
-  - [ ] Bias dimension (constant 1.0)
-  - [ ] Float16 packing via `pack2x16float()`
-  - [ ] Output to `texture_storage_2d<rgba32uint>`
-
-- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_layer_template.wgsl` - Layer template
-  - [ ] Static features unpacking
-  - [ ] Previous layer unpacking (8×f16)
-  - [ ] Convolution implementation (1×1, 3×3, 5×5)
-  - [ ] ReLU activation
-  - [ ] Output packing (8×f16)
-  - [ ] Proper padding handling
-
-### Phase 2: C++ Effect Class
-
-- [ ] `src/effects/cnn_v2_effect.h` - Header
-  - [ ] Class declaration inheriting from `PostProcessEffect`
-  - [ ] Static features texture member
-  - [ ] Layer textures vector
-  - [ ] Pipeline and bind group members
-
-- [ ] `src/effects/cnn_v2_effect.cc` - Implementation
-  - [ ] Constructor: Load shaders, create textures
-  - [ ] `init()`: Create pipelines, bind groups
-  - [ ] `render()`: Multi-pass execution
-    - [ ] Pass 0: Compute static features
-    - [ ] Pass 1-N: CNN layers
-    - [ ] Final: Composite to output
-  - [ ] Proper resource cleanup
-
-- [ ] Integration
-  - [ ] Add to `src/gpu/demo_effects.h` includes
-  - [ ] Add `cnn_v2_effect.cc` to `CMakeLists.txt` (headless + normal)
-  - [ ] Add shaders to `workspaces/main/assets.txt`
-  - [ ] Add to `src/tests/gpu/test_demo_effects.cc`
-
-### Phase 3: Training Pipeline
-
-- [ ] `training/train_cnn_v2.py` - Training script
-  - [ ] Static feature extraction function
-  - [ ] CNNv2 PyTorch model class
-  - [ ] Patch-based dataloader
-  - [ ] Training loop with checkpointing
-  - [ ] Command-line argument parsing
-  - [ ] Inference mode (ground truth generation)
-
-- [ ] `training/export_cnn_v2_shader.py` - Export script
-  - [ ] Checkpoint loading
-  - [ ] Weight extraction and f16 quantization
-  - [ ] Per-layer WGSL generation
-  - [ ] File output to workspace shaders/
-  - [ ] Metadata preservation
-
-### Phase 4: Tools & Validation
-
-- [x] HTML validation tool - WebGPU inference with layer visualization
-  - [ ] Command-line argument parsing
-  - [ ] Shader export orchestration
-  - [ ] Build orchestration
-  - [ ] Batch image processing
-  - [ ] Results display
-
-- [ ] `src/tools/cnn_test_main.cc` - Tool updates
-  - [ ] Add `--cnn-version v2` flag
-  - [ ] CNNv2Effect instantiation path
-  - [ ] Static features pass execution
-  - [ ] Multi-layer processing
-
-### Phase 5: Documentation
-
-- [ ] `doc/HOWTO.md` - Usage guide
-  - [ ] Training section (CNN v2)
-  - [ ] Export section
-  - [ ] Validation section
-  - [ ] Examples
-
-- [ ] `README.md` - Project overview update
-  - [ ] Mention CNN v2 capability
-
----
-
-## File Structure
-
-### New Files
-
-```
-# Shaders (generated by export script)
-workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl       # Static features compute
-workspaces/main/shaders/cnn_v2/cnn_v2_layer_0.wgsl      # Input layer (generated)
-workspaces/main/shaders/cnn_v2/cnn_v2_layer_1.wgsl      # Inner layer (generated)
-workspaces/main/shaders/cnn_v2/cnn_v2_layer_2.wgsl      # Output layer (generated)
-
-# C++ implementation
-src/effects/cnn_v2_effect.h                  # Effect class header
-src/effects/cnn_v2_effect.cc                 # Effect implementation
-
-# Python training/export
-training/train_cnn_v2.py                         # Training script
-training/export_cnn_v2_shader.py                 # Shader generator
-training/validation/                             # Test images directory
-
-# Validation
-tools/cnn_v2_test/index.html                     # WebGPU validation tool
-
-# Documentation
-doc/CNN_V2.md                                    # This file
-```
-
-### Modified Files
-
-```
-src/gpu/demo_effects.h                           # Add CNNv2Effect include
-CMakeLists.txt                                   # Add cnn_v2_effect.cc
-workspaces/main/assets.txt                       # Add cnn_v2 shaders
-workspaces/main/timeline.seq                     # Optional: add CNNv2Effect
-src/tests/gpu/test_demo_effects.cc               # Add CNNv2 test case
-src/tools/cnn_test_main.cc                       # Add --cnn-version v2
-doc/HOWTO.md                                     # Add CNN v2 sections
-TODO.md                                          # Add CNN v2 task
-```
-
-### Unchanged (v1 Preserved)
-
-```
-training/train_cnn.py                            # Original training
-src/effects/cnn_effect.*                     # Original effect
-workspaces/main/shaders/cnn_*.wgsl               # Original v1 shaders
-```
-
----
-
-## Performance Characteristics
-
-### Static Features Compute
-- **Cost:** ~0.1ms @ 1080p
-- **Frequency:** Once per frame
-- **Operations:** sin(), texture sampling, packing
-
-### CNN Layers (Example 3-layer)
-- **Layer0 (1×1, 8→16):** ~0.3ms
-- **Layer1 (3×3, 23→8):** ~0.8ms
-- **Layer2 (5×5, 15→4):** ~1.2ms
-- **Total:** ~2.4ms @ 1080p
-
-### Memory Usage
-- Static features: 1920×1080×8×2 = 33 MB (f16)
-- Layer buffers: 1920×1080×16×2 = 66 MB (max 16 channels)
-- Weights: ~6.4 KB (f16, in shader code)
-- **Total GPU memory:** ~100 MB
-
----
-
-## Size Budget
-
-### CNN v1 vs v2
-
-| Metric | v1 | v2 | Delta |
-|--------|----|----|-------|
-| Weights (count) | 800 | 3268 | +2468 |
-| Storage (f32) | 3.2 KB | 13.1 KB | +9.9 KB |
-| Storage (f16) | N/A | 6.5 KB | +6.5 KB |
-| Shader code | ~500 lines | ~800 lines | +300 lines |
-
-### Mitigation Strategies
-
-**Reduce channels:**
-- [16,8,4] → [8,4,4] saves ~50% weights
-- [16,8,4] → [4,4,4] saves ~60% weights
-
-**Smaller kernels:**
-- [1,3,5] → [1,3,3] saves ~30% weights
-- [1,3,5] → [1,1,3] saves ~50% weights
-
-**Quantization:**
-- int8 weights: saves 75% (requires QAT training)
-- 4-bit weights: saves 87.5% (extreme, needs research)
-
-**Target:** Keep CNN v2 under 10 KB for 64k demo constraint
-
----
-
-## Future Extensions
-
-### Flexible Feature Layout (Binary Format v3)
-
-**TODO:** Support arbitrary feature vector layouts and ordering in binary format.
-
-**Current Limitation:**
-- Feature layout hardcoded: `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]`
-- Shader must match training script exactly
-- Experimentation requires shader recompilation
-
-**Proposed Enhancement:**
-- Add feature descriptor to binary format header
-- Specify feature types, sources, and ordering
-- Runtime shader generation or dynamic feature indexing
-- Examples: `[R, G, B, dx, dy, uv_x, bias]` or `[mip1.r, mip2.g, laplacian, uv_x, sin20_x, bias]`
-
-**Benefits:**
-- Training experiments without C++/shader changes
-- A/B test different feature combinations
-- Single binary format, multiple architectures
-- Faster iteration on feature engineering
-
-**Implementation Options:**
-1. **Static approach:** Generate shader code from descriptor at load time
-2. **Dynamic approach:** Array-based indexing with feature map uniform
-3. **Hybrid:** Precompile common layouts, fallback to dynamic
-
-See `doc/CNN_V2_BINARY_FORMAT.md` for proposed descriptor format.
-
----
-
-### More Features (uint8 Packing)
-
-```wgsl
-// 16 uint8 features per texel (texture_storage_2d<rgba8unorm>)
-// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y,
-//  sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, variance, bias]
-```
-- Trade precision for quantity
-- Requires quantization-aware training
-
-### Temporal Features
-
-- Previous frame RGBA (motion awareness)
-- Optical flow vectors
-- Requires multi-frame buffer
-
-### Learned Position Encodings
-
-- Replace hand-crafted sin(10\*uv) with learned embeddings
-- Requires separate embedding network
-- Similar to NeRF position encoding
-
-### Dynamic Architecture
-
-- Runtime kernel size selection based on scene
-- Conditional layer execution (skip connections)
-- Layer pruning for performance
-
----
-
-## References
-
-- **v1 Implementation:** `src/effects/cnn_effect.*`
-- **Training Guide:** `doc/HOWTO.md` (CNN Training section)
-- **Test Tool:** `doc/CNN_TEST_TOOL.md`
-- **Shader System:** `doc/SEQUENCE.md`
-- **Size Measurement:** `doc/SIZE_MEASUREMENT.md`
-
----
-
-## Appendix: Design Decisions
-
-### Why Bias as Static Feature?
-
-**Alternatives considered:**
-1. Separate bias array per layer (Option B)
-2. Bias as static feature = 1.0 (Option A, chosen)
-
-**Decision rationale:**
-- Simpler shader code (fewer bindings)
-- Standard NN formulation (augmented input)
-- Saves 56-112 bytes per model
-- 7 features sufficient for v1 implementation
-- Can extend to uint8 packing if >7 features needed
-
-### Why Float16 for Weights?
-
-**Alternatives considered:**
-1. Keep f32 (larger, more accurate)
-2. Use f16 (smaller, GPU-native)
-3. Use int8 (smallest, needs QAT)
-
-**Decision rationale:**
-- f16 saves 50% vs f32 (critical for 64k target)
-- GPU-native support (pack2x16float in WGSL)
-- <0.1% accuracy loss (acceptable)
-- Simpler than int8 quantization
-
-### Why Multi-Frequency Position Encoding?
-
-**Inspiration:** NeRF (Neural Radiance Fields)
-
-**Benefits:**
-- Helps network learn high-frequency details
-- Better than raw UV coordinates
-- Small footprint (1D per frequency)
-
-**Future:** Add sin(20\*uv), sin(40\*uv) if >7 features available
-
----
-
-## Related Documentation
-
-- `doc/CNN_V2_BINARY_FORMAT.md` - Binary weight file specification (.bin format)
-- `doc/CNN_V2_WEB_TOOL.md` - WebGPU testing tool with layer visualization
-- `doc/CNN_TEST_TOOL.md` - C++ offline validation tool (deprecated)
-- `doc/HOWTO.md` - Training and validation workflows
-
----
-
-**Document Version:** 1.0
-**Last Updated:** 2026-02-12
-**Status:** Design approved, ready for implementation
diff --git a/doc/CNN_V2_BINARY_FORMAT.md b/doc/CNN_V2_BINARY_FORMAT.md
deleted file mode 100644
index 59c859d..0000000
--- a/doc/CNN_V2_BINARY_FORMAT.md
+++ /dev/null
@@ -1,235 +0,0 @@
-# CNN v2 Binary Weight Format Specification
-
-Binary format for storing trained CNN v2 weights with static feature architecture.
-
-**File Extension:** `.bin`
-**Byte Order:** Little-endian
-**Version:** 2.0 (supports mip-level for parametric features)
-**Backward Compatible:** Version 1.0 files supported (mip_level=0)
-
----
-
-## File Structure
-
-**Version 2 (current):**
-```
-┌─────────────────────┐
-│  Header (20 bytes)  │
-├─────────────────────┤
-│  Layer Info         │
-│  (20 bytes × N)     │
-├─────────────────────┤
-│  Weight Data        │
-│  (variable size)    │
-└─────────────────────┘
-```
-
-**Version 1 (legacy):**
-```
-┌─────────────────────┐
-│  Header (16 bytes)  │
-├─────────────────────┤
-│  Layer Info         │
-│  (20 bytes × N)     │
-├─────────────────────┤
-│  Weight Data        │
-│  (variable size)    │
-└─────────────────────┘
-```
-
----
-
-## Header
-
-**Version 2 (20 bytes):**
-
-| Offset | Type | Field          | Description                          |
-|--------|------|----------------|--------------------------------------|
-| 0x00   | u32  | magic          | Magic number: `0x32_4E_4E_43` ("CNN2") |
-| 0x04   | u32  | version        | Format version (2 for current)       |
-| 0x08   | u32  | num_layers     | Number of CNN layers (excludes static features) |
-| 0x0C   | u32  | total_weights  | Total f16 weight count across all layers |
-| 0x10   | u32  | mip_level      | Mip level for p0-p3 features (0=original, 1=half, 2=quarter, 3=eighth) |
-
-**Version 1 (16 bytes) - Legacy:**
-
-| Offset | Type | Field          | Description                          |
-|--------|------|----------------|--------------------------------------|
-| 0x00   | u32  | magic          | Magic number: `0x32_4E_4E_43` ("CNN2") |
-| 0x04   | u32  | version        | Format version (1)                   |
-| 0x08   | u32  | num_layers     | Number of CNN layers                 |
-| 0x0C   | u32  | total_weights  | Total f16 weight count               |
-
-**Note:** Loaders should check version field and handle both formats. Version 1 files treated as mip_level=0.
-
----
-
-## Layer Info (20 bytes per layer)
-
-Repeated `num_layers` times:
-- **Version 2:** Starting at offset 0x14 (20 bytes)
-- **Version 1:** Starting at offset 0x10 (16 bytes)
-
-| Offset      | Type | Field          | Description                          |
-|-------------|------|----------------|--------------------------------------|
-| 0x00        | u32  | kernel_size    | Convolution kernel dimension (3, 5, 7, etc.) |
-| 0x04        | u32  | in_channels    | Input channel count (includes 8 static features for Layer 1) |
-| 0x08        | u32  | out_channels   | Output channel count (max 8)         |
-| 0x0C        | u32  | weight_offset  | Weight array start index (f16 units, relative to weight data section) |
-| 0x10        | u32  | weight_count   | Number of f16 weights for this layer |
-
-**Layer Order:** Sequential (Layer 1, Layer 2, Layer 3, ...)
-
----
-
-## Weight Data (variable size)
-
-Starts at offset:
-- **Version 2:** `20 + (num_layers × 20)`
-- **Version 1:** `16 + (num_layers × 20)`
-
-**Format:** Packed f16 pairs stored as u32
-**Packing:** `u32 = (f16_hi << 16) | f16_lo`
-**Storage:** Sequential by layer, then by output channel, input channel, spatial position
-
-**Weight Indexing:**
-```
-weight_idx = output_ch × (in_channels × kernel_size²) +
-             input_ch × kernel_size² +
-             (ky × kernel_size + kx)
-```
-
-Where:
-- `output_ch` ∈ [0, out_channels)
-- `input_ch` ∈ [0, in_channels)
-- `ky`, `kx` ∈ [0, kernel_size)
-
-**Unpacking f16 from u32:**
-```c
-uint32_t packed = weights_buffer[weight_idx / 2];
-uint16_t f16_bits = (weight_idx % 2 == 0) ? (packed & 0xFFFF) : (packed >> 16);
-```
-
----
-
-## Example: 3-Layer Network (Version 2)
-
-**Configuration:**
-- Mip level: 0 (original resolution)
-- Layer 0: 12→4, kernel 3×3 (432 weights)
-- Layer 1: 12→4, kernel 3×3 (432 weights)
-- Layer 2: 12→4, kernel 3×3 (432 weights)
-
-**File Layout:**
-```
-Offset   Size   Content
-------   ----   -------
-0x00     20     Header (magic, version=2, layers=3, weights=1296, mip_level=0)
-0x14     20     Layer 0 info (kernel=3, in=12, out=4, offset=0, count=432)
-0x28     20     Layer 1 info (kernel=3, in=12, out=4, offset=432, count=432)
-0x3C     20     Layer 2 info (kernel=3, in=12, out=4, offset=864, count=432)
-0x50     2592   Weight data (1296 u32 packed f16 pairs)
-         ----
-Total:   2672 bytes (~2.6 KB)
-```
-
----
-
-## Static Features
-
-Not stored in .bin file (computed at runtime):
-
-**8D Input Features:**
-1. **p0** - Parametric feature 0 (from mip level)
-2. **p1** - Parametric feature 1 (from mip level)
-3. **p2** - Parametric feature 2 (from mip level)
-4. **p3** - Parametric feature 3 (depth or from mip level)
-5. **UV_X** - Normalized x coordinate [0,1]
-6. **UV_Y** - Normalized y coordinate [0,1]
-7. **sin(20 × UV_Y)** - Spatial frequency encoding (vertical, frequency=20)
-8. **1.0** - Bias term
-
-**Mip Level Usage (p0-p3):**
-- `mip_level=0`: RGB from original resolution (mip 0)
-- `mip_level=1`: RGB from half resolution (mip 1), upsampled
-- `mip_level=2`: RGB from quarter resolution (mip 2), upsampled
-- `mip_level=3`: RGB from eighth resolution (mip 3), upsampled
-
-**Layer 0** receives input RGBD (4D) + static features (8D) = 12D input → 4D output.
-**Layer 1+** receive previous layer output (4D) + static features (8D) = 12D input → 4D output.
-
----
-
-## Validation
-
-**Magic Check:**
-```c
-uint32_t magic;
-fread(&magic, 4, 1, fp);
-if (magic != 0x32_4E_4E_43) { error("Invalid CNN v2 file"); }
-```
-
-**Version Check:**
-```c
-uint32_t version;
-fread(&version, 4, 1, fp);
-if (version != 1 && version != 2) { error("Unsupported version"); }
-uint32_t header_size = (version == 1) ? 16 : 20;
-```
-
-**Size Check:**
-```c
-expected_size = header_size + (num_layers × 20) + (total_weights × 2);
-if (file_size != expected_size) { error("Size mismatch"); }
-```
-
-**Weight Offset Sanity:**
-```c
-// Each layer's offset should match cumulative count
-uint32_t cumulative = 0;
-for (int i = 0; i < num_layers; i++) {
-    if (layers[i].weight_offset != cumulative) { error("Invalid offset"); }
-    cumulative += layers[i].weight_count;
-}
-if (cumulative != total_weights) { error("Total mismatch"); }
-```
-
----
-
-## Future Extensions
-
-**TODO: Flexible Feature Layout**
-
-Current limitation: Feature vector layout is hardcoded as `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]`.
-
-Proposed enhancement for version 3:
-- Add feature descriptor section to header
-- Specify feature count, types, and ordering
-- Support arbitrary 7D feature combinations (e.g., `[R, G, B, dx, dy, uv_x, bias]`)
-- Allow runtime shader generation based on descriptor
-- Enable experimentation without recompiling shaders
-
-Example descriptor format:
-```
-struct FeatureDescriptor {
-  u32 feature_count;           // Number of features (typically 7-8)
-  u32 feature_types[8];        // Type enum per feature
-  u32 feature_sources[8];      // Source enum (mip0, mip1, gradient, etc.)
-  u32 reserved[8];             // Future use
-}
-```
-
-Benefits:
-- Training can experiment with different feature combinations
-- No shader recompilation needed
-- Single binary format supports multiple architectures
-- Easier A/B testing of feature effectiveness
-
----
-
-## Related Files
-
-- `training/export_cnn_v2_weights.py` - Binary export tool
-- `src/effects/cnn_v2_effect.cc` - C++ loader
-- `tools/cnn_v2_test/index.html` - WebGPU validator
-- `doc/CNN_V2.md` - Architecture design
diff --git a/doc/CNN_V2_DEBUG_TOOLS.md b/doc/CNN_V2_DEBUG_TOOLS.md
deleted file mode 100644
index 8d1289a..0000000
--- a/doc/CNN_V2_DEBUG_TOOLS.md
+++ /dev/null
@@ -1,143 +0,0 @@
-# CNN v2 Debugging Tools
-
-Tools for investigating CNN v2 mismatch between HTML tool and cnn_test.
-
----
-
-## Identity Weight Generator
-
-**Purpose:** Generate trivial .bin files with identity passthrough for debugging.
-
-**Script:** `training/gen_identity_weights.py`
-
-**Usage:**
-```bash
-# 1×1 identity (default)
-./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity.bin
-
-# 3×3 identity
-./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity_3x3.bin --kernel-size 3
-
-# Mix mode: 50-50 blend (0.5*p0+0.5*p4, etc)
-./training/gen_identity_weights.py output.bin --mix
-
-# Static features only: p4→ch0, p5→ch1, p6→ch2, p7→ch3
-./training/gen_identity_weights.py output.bin --p47
-
-# Custom mip level
-./training/gen_identity_weights.py output.bin --kernel-size 1 --mip-level 2
-```
-
-**Output:**
-- Single layer, 12D→4D (4 input channels + 8 static features)
-- Identity mode: Output Ch{0,1,2,3} = Input Ch{0,1,2,3}
-- Mix mode (--mix): Output Ch{i} = 0.5*Input Ch{i} + 0.5*Input Ch{i+4} (50-50 blend, avoids overflow)
-- Static mode (--p47): Output Ch{i} = Input Ch{i+4} (static features only, visualizes p4-p7)
-- Minimal file size (~136 bytes for 1×1, ~904 bytes for 3×3)
-
-**Validation:**
-Load in HTML tool or cnn_test - output should match input (RGB only, ignoring static features).
-
----
-
-## Composited Layer Visualization
-
-**Purpose:** Save current layer view as single composited image (4 channels side-by-side, grayscale).
-
-**Location:** HTML tool - "Layer Visualization" panel
-
-**Usage:**
-1. Load image + weights in HTML tool
-2. Select layer to visualize (Static 0-3, Static 4-7, Layer 0, Layer 1, etc.)
-3. Click "Save Composited" button
-4. Downloads PNG: `composited_layer{N}_{W}x{H}.png`
-
-**Output:**
-- 4 channels stacked horizontally
-- Grayscale representation
-- Useful for comparing layer activations across tools
-
----
-
-## Debugging Strategy
-
-### Track a) Binary Conversion Chain
-
-**Hypothesis:** Conversion error in .bin ↔ base64 ↔ Float32Array
-
-**Test:**
-1. Generate identity weights:
-   ```bash
-   ./training/gen_identity_weights.py workspaces/main/weights/test_identity.bin
-   ```
-
-2. Load in HTML tool - output should match input RGB
-
-3. If mismatch:
-   - Check Python export: f16 packing in `export_cnn_v2_weights.py` line 105
-   - Check HTML parsing: `unpackF16()` in `index.html` line 805-815
-   - Check weight indexing: `get_weight()` shader function
-
-**Key locations:**
-- Python: `np.float16` → `view(np.uint32)` (line 105 of export script)
-- JS: `DataView` → `unpackF16()` → manual f16 decode (line 773-803)
-- WGSL: `unpack2x16float()` built-in (line 492 of shader)
-
-### Track b) Layer Visualization
-
-**Purpose:** Confirm layer outputs match between HTML and C++
-
-**Method:**
-1. Run identical input through both tools
-2. Save composited layers from HTML tool
-3. Compare with cnn_test output
-4. Use identity weights to isolate weight loading from computation
-
-### Track c) Trivial Test Case
-
-**Use identity weights to test:**
-- Weight loading (binary parsing)
-- Feature generation (static features)
-- Convolution (should be passthrough)
-- Output packing
-
-**Expected behavior:**
-- Input RGB → Output RGB (exact match)
-- Static features ignored (all zeros in identity matrix)
-
----
-
-## Known Issues
-
-### ~~Layer 0 Visualization Scale~~ [FIXED]
-
-**Issue:** Layer 0 output displayed at 0.5× brightness (divided by 2).
-
-**Cause:** Line 1530 used `vizScale = 0.5` for all CNN layers, but Layer 0 is clamped [0,1] and doesn't need dimming.
-
-**Fix:** Use scale 1.0 for Layer 0 output (layerIdx=1), 0.5 only for middle layers (ReLU, unbounded).
-
-### Remaining Mismatch
-
-**Current:** HTML tool and cnn_test produce different outputs for same input/weights.
-
-**Suspects:**
-1. F16 unpacking difference (CPU vs GPU vs JS)
-2. Static feature generation (RGBD, UV, sin encoding)
-3. Convolution kernel iteration order
-4. Output packing/unpacking
-
-**Next steps:**
-1. Test with identity weights (eliminates weight loading)
-2. Compare composited layer outputs
-3. Add debug visualization for static features
-4. Hex dump comparison (first 8 pixels) - use `--debug-hex` flag in cnn_test
-
----
-
-## Related Documentation
-
-- `doc/CNN_V2.md` - CNN v2 architecture
-- `doc/CNN_V2_WEB_TOOL.md` - HTML tool documentation
-- `doc/CNN_TEST_TOOL.md` - cnn_test CLI tool
-- `training/export_cnn_v2_weights.py` - Binary export format
diff --git a/doc/CNN_V2_WEB_TOOL.md b/doc/CNN_V2_WEB_TOOL.md
deleted file mode 100644
index b6f5b0b..0000000
--- a/doc/CNN_V2_WEB_TOOL.md
+++ /dev/null
@@ -1,348 +0,0 @@
-# CNN v2 Web Testing Tool
-
-Browser-based WebGPU tool for validating CNN v2 inference with layer visualization and weight inspection.
-
-**Location:** `tools/cnn_v2_test/index.html`
-
----
-
-## Status (2026-02-13)
-
-**Working:**
-- ✅ WebGPU initialization and device setup
-- ✅ Binary weight file parsing (v1 and v2 formats)
-- ✅ Automatic mip-level detection from binary format v2
-- ✅ Weight statistics (min/max per layer)
-- ✅ UI layout with collapsible panels
-- ✅ Mode switching (Activations/Weights tabs)
-- ✅ Canvas context management (2D for weights, WebGPU for activations)
-- ✅ Weight visualization infrastructure (layer selection, grid layout)
-- ✅ Layer naming matches codebase convention (Layer 0, Layer 1, Layer 2)
-- ✅ Static features split visualization (Static 0-3, Static 4-7)
-- ✅ All layers visible including output layer (Layer 2)
-- ✅ Video playback support (MP4, WebM) with frame-by-frame controls
-- ✅ Video looping (automatic continuous playback)
-- ✅ Mip level selection (p0-p3 features at different resolutions)
-
-**Recent Changes (Latest):**
-- Binary format v2 support: Reads mip_level from 20-byte header
-- Backward compatible: v1 (16-byte header) → mip_level=0
-- Auto-update UI dropdown when loading weights with mip_level
-- Display mip_level in metadata panel
-- Code refactoring: Extracted FULLSCREEN_QUAD_VS shader (reused 3× across pipelines)
-- Added helper methods: `getDimensions()`, `setVideoControlsEnabled()`
-- Improved code organization with section headers and comments
-- Moved Mip Level selector to bottom of left sidebar (removed "Features (p0-p3)" label)
-- Added `loop` attribute to video element for automatic continuous playback
-
-**Previous Fixes:**
-- Fixed Layer 2 not appearing (was excluded from layerOutputs due to isOutput check)
-- Fixed canvas context switching (force clear before recreation)
-- Added Static 0-3 / Static 4-7 buttons to view all 8 static feature channels
-- Aligned naming with train_cnn_v2.py/.wgsl: Layer 0, Layer 1, Layer 2 (not Layer 1, 2, 3)
-- Disabled Static buttons in weights mode (no learnable weights)
-
-**Known Issues:**
-- Layer activation visualization may show black if texture data not properly unpacked
-- Weight kernel display depends on correct 2D context creation after canvas recreation
-
----
-
-## Architecture
-
-### File Structure
-- Single-file HTML tool (~1100 lines)
-- Embedded shaders: STATIC_SHADER, CNN_SHADER, DISPLAY_SHADER, LAYER_VIZ_SHADER
-- Shared WGSL component: FULLSCREEN_QUAD_VS (reused across render pipelines)
-- **Embedded default weights:** DEFAULT_WEIGHTS_B64 (base64-encoded binary v2)
-  - Current: 4 layers (3×3, 5×5, 3×3, 3×3), 2496 f16 weights, mip_level=2
-  - Source: `workspaces/main/weights/cnn_v2_weights.bin`
-  - Updates: Re-encode binary with `base64 -i <file>` and update constant
-- Pure WebGPU (no external dependencies)
-
-### Code Organization
-
-**Recent Refactoring (2026-02-13):**
-- Extracted `FULLSCREEN_QUAD_VS` constant: Reused fullscreen quad vertex shader (2 triangles covering NDC)
-- Added helper methods to CNNTester class:
-  - `getDimensions()`: Returns current source dimensions (video or image)
-  - `setVideoControlsEnabled(enabled)`: Centralized video control enable/disable
-- Consolidated duplicate vertex shader code (used in mipmap generation, display, layer visualization)
-- Added section headers in JavaScript for better navigation
-- Improved inline comments explaining shader architecture
-
-**Benefits:**
-- Reduced code duplication (~40 lines saved)
-- Easier maintenance (single source of truth for fullscreen quad)
-- Clearer separation of concerns
-
-### Key Components
-
-**1. Weight Parsing**
-- Reads binary format v2: header (20B) + layer info (20B×N) + f16 weights
-- Backward compatible with v1: header (16B), mip_level defaults to 0
-- Computes min/max per layer via f16 unpacking
-- Stores `{ layers[], weights[], mipLevel, fileSize }`
-- Auto-sets UI mip-level dropdown from loaded weights
-
-**2. CNN Pipeline**
-- Static features computation (RGBD + UV + sin + bias → 7D packed)
-- Layer-by-layer convolution with storage buffer weights
-- Ping-pong buffers for intermediate results
-- Copy to persistent textures for visualization
-
-**3. Visualization Modes**
-
-**Activations Mode:**
-- 4 grayscale views per layer (channels 0-3 of up to 8 total)
-- WebGPU compute → unpack f16 → scale → grayscale
-- Auto-scale: Static features = 1.0, CNN layers = 0.2
-- Static features: Shows R,G,B,D (first 4 of 8: RGBD+UV+sin+bias)
-- CNN layers: Shows first 4 output channels
-
-**Weights Mode:**
-- 2D canvas rendering per output channel
-- Shows all input kernels horizontally
-- Normalized by layer min/max → [0, 1] → grayscale
-- 20px cells, 2px padding between kernels
-
-### Texture Management
-
-**Persistent Storage (layerTextures[]):**
-- One texture per layer output (static + all CNN layers)
-- `rgba32uint` format (packed f16 data)
-- `COPY_DST` usage for storing results
-
-**Compute Buffers (computeTextures[]):**
-- 2 textures for ping-pong computation
-- Reused across all layers
-- `COPY_SRC` usage for copying to persistent storage
-
-**Pipeline:**
-```
-Static pass → copy to layerTextures[0]
-For each CNN layer i:
-  Compute (ping-pong) → copy to layerTextures[i+1]
-```
-
-### Layer Indexing
-
-**UI Layer Buttons:**
-- "Static" → layerOutputs[0] (7D input features)
-- "Layer 1" → layerOutputs[1] (CNN layer 1 output, uses weights.layers[0])
-- "Layer 2" → layerOutputs[2] (CNN layer 2 output, uses weights.layers[1])
-- "Layer N" → layerOutputs[N] (CNN layer N output, uses weights.layers[N-1])
-
-**Weights Table:**
-- "Layer 1" → weights.layers[0] (first CNN layer weights)
-- "Layer 2" → weights.layers[1] (second CNN layer weights)
-- "Layer N" → weights.layers[N-1]
-
-**Consistency:** Both UI and weights table use same numbering (1, 2, 3...) for CNN layers.
-
----
-
-## Known Issues
-
-### Issue #1: Layer Activations Show Black
-
-**Symptom:**
-- All 4 channel canvases render black
-- UV gradient test (debug mode 10) works
-- Raw packed data test (mode 11) shows black
-- Unpacked f16 test (mode 12) shows black
-
-**Diagnosis:**
-- Texture access works (UV gradient visible)
-- Texture data is all zeros (packed.x = 0)
-- Textures being read are empty
-
-**Root Cause:**
-- `copyTextureToTexture` operations may not be executing
-- Possible ordering issue (copies not submitted before visualization)
-- Alternative: textures created with wrong usage flags
-
-**Investigation Steps Taken:**
-1. Added `onSubmittedWorkDone()` wait before visualization
-2. Verified texture creation with `COPY_SRC` and `COPY_DST` flags
-3. Confirmed separate texture allocation per layer (no aliasing)
-4. Added debug shader modes to isolate issue
-
-**Next Steps:**
-- Verify encoder contains copy commands (add debug logging)
-- Check if compute passes actually write data (add known-value test)
-- Test copyTextureToTexture in isolation
-- Consider CPU readback to verify texture contents
-
-### Issue #2: Weight Visualization Empty
-
-**Symptom:**
-- Canvases created with correct dimensions (logged)
-- No visual output (black canvases)
-- Console logs show method execution
-
-**Potential Causes:**
-1. Weight indexing calculation incorrect
-2. Canvas not properly attached to DOM when rendering
-3. 2D context operations not flushing
-4. Min/max normalization producing black (all values equal?)
-
-**Debug Added:**
-- Comprehensive logging of dimensions, indices, ranges
-- Canvas context check before rendering
-
-**Next Steps:**
-- Add test rendering (fixed gradient) to verify 2D context works
-- Log sample weight values to verify data access
-- Check if canvas is visible in DOM inspector
-- Verify min/max calculation produces valid range
-
----
-
-## UI Layout
-
-### Header
-- Controls: Blend slider, Depth input, View mode display
-- Drop zone for .bin weight files
-
-### Content Area
-
-**Left Sidebar (300px):**
-1. Drop zone for .bin weight files
-2. Weights Info panel (file size, layer table with min/max)
-3. Weights Visualization panel (per-layer kernel display)
-4. **Mip Level selector** (bottom) - Select p0/p1/p2 for static features
-
-**Main Canvas (center):**
-- CNN output display with video controls (Play/Pause, Frame ◄/►)
-- Supports both PNG images and video files (MP4, WebM)
-- Video loops automatically for continuous playback
-
-**Right Sidebar (panels):**
-1. **Layer Visualization Panel** (top, flex: 1)
-   - Layer selection buttons (Static 0-3, Static 4-7, Layer 0, Layer 1, ...)
-   - 2×2 grid of channel views (grayscale activations)
-   - 4× zoom view at bottom
-
-### Footer
-- Status line (GPU timing, dimensions, mode)
-- Console log (scrollable, color-coded)
-
----
-
-## Shader Details
-
-### LAYER_VIZ_SHADER
-
-**Purpose:** Display single channel from packed layer texture
-
-**Inputs:**
-- `@binding(0) layer_tex: texture_2d<u32>` - Packed f16 layer data
-- `@binding(1) viz_params: vec2<f32>` - (channel_idx, scale)
-
-**Debug Modes:**
-- Channel 10: UV gradient (texture coordinate test)
-- Channel 11: Raw packed u32 data
-- Channel 12: First unpacked f16 value
-
-**Normal Operation:**
-- Unpack all 8 f16 channels from rgba32uint
-- Select channel by index (0-7)
-- Apply scale factor (1.0 for static, 0.2 for CNN)
-- Clamp to [0, 1] and output grayscale
-
-**Scale Rationale:**
-- Static features (RGBD, UV): already in [0, 1] range
-- CNN activations: post-ReLU [0, ~5], need scaling for visibility
-
----
-
-## Binary Weight Format
-
-See `doc/CNN_V2_BINARY_FORMAT.md` for complete specification.
-
-**Quick Summary:**
-- Header: 16 bytes (magic, version, layer count, total weights)
-- Layer info: 20 bytes × N (kernel size, channels, offsets)
-- Weights: Packed f16 pairs as u32
-
----
-
-## Testing Workflow
-
-### Load & Parse
-1. Drop PNG image → displays original
-2. Drop .bin weights → parses and shows info table
-3. Auto-runs CNN pipeline
-
-### Verify Pipeline
-1. Check console for "Running CNN pipeline"
-2. Verify "Completed in Xms"
-3. Check "Layer visualization ready: N layers"
-
-### Debug Activations
-1. Select "Activations" tab
-2. Click layer buttons to switch
-3. Check console for texture/canvas logs
-4. If black: note which debug modes work (UV vs data)
-
-### Debug Weights
-1. Select "Weights" tab
-2. Click Layer 1 or Layer 2 (Layer 0 has no weights)
-3. Check console for "Visualizing Layer N weights"
-4. Check canvas dimensions logged
-5. Verify weight range is non-trivial (not [0, 0])
-
----
-
-## Integration with Main Project
-
-**Training Pipeline:**
-```bash
-# Generate weights
-./training/train_cnn_v2.py --export-binary
-
-# Test in browser
-open tools/cnn_v2_test/index.html
-# Drop: workspaces/main/cnn_v2_weights.bin
-# Drop: training/input/test.png
-```
-
-**Validation:**
-- Compare against demo CNNv2Effect (visual check)
-- Verify layer count matches binary file
-- Check weight ranges match training logs
-
----
-
-## Future Enhancements
-
-- [ ] Fix layer activation visualization (black texture issue)
-- [ ] Fix weight kernel display (empty canvas issue)
-- [ ] Add per-channel auto-scaling (compute min/max from visible data)
-- [ ] Export rendered outputs (download PNG)
-- [ ] Side-by-side comparison with original
-- [ ] Heatmap mode (color-coded activations)
-- [ ] Weight statistics overlay (mean, std, sparsity)
-- [ ] Batch processing (multiple images in sequence)
-- [ ] Integration with Python training (live reload)
-
----
-
-## Code Metrics
-
-- Total lines: ~1100
-- JavaScript: ~700 lines
-- WGSL shaders: ~300 lines
-- HTML/CSS: ~100 lines
-
-**Dependencies:** None (pure WebGPU + HTML5)
-
----
-
-## Related Files
-
-- `doc/CNN_V2.md` - CNN v2 architecture and design
-- `doc/CNN_TEST_TOOL.md` - C++ offline testing tool (deprecated)
-- `training/train_cnn_v2.py` - Training script with binary export
-- `workspaces/main/cnn_v2_weights.bin` - Trained weights
diff --git a/doc/HOWTO.md b/doc/HOWTO.md
index 0dc9ec7..a309b27 100644
--- a/doc/HOWTO.md
+++ b/doc/HOWTO.md
@@ -145,31 +145,31 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding
 **Complete Pipeline** (recommended):
 ```bash
 # Train → Export → Build → Validate (default config)
-./scripts/train_cnn_v2_full.sh
+./cnn_v2/scripts/train_cnn_v2_full.sh
 
 # Rapid debug (1 layer, 3×3, 5 epochs)
-./scripts/train_cnn_v2_full.sh --num-layers 1 --kernel-sizes 3 --epochs 5 --output-weights test.bin
+./cnn_v2/scripts/train_cnn_v2_full.sh --num-layers 1 --kernel-sizes 3 --epochs 5 --output-weights test.bin
 
 # Custom training parameters
-./scripts/train_cnn_v2_full.sh --epochs 500 --batch-size 32 --checkpoint-every 100
+./cnn_v2/scripts/train_cnn_v2_full.sh --epochs 500 --batch-size 32 --checkpoint-every 100
 
 # Custom architecture
-./scripts/train_cnn_v2_full.sh --kernel-sizes 3,5,3 --num-layers 3 --mip-level 1
+./cnn_v2/scripts/train_cnn_v2_full.sh --kernel-sizes 3,5,3 --num-layers 3 --mip-level 1
 
 # Custom output path
-./scripts/train_cnn_v2_full.sh --output-weights workspaces/test/cnn_weights.bin
+./cnn_v2/scripts/train_cnn_v2_full.sh --output-weights workspaces/test/cnn_weights.bin
 
 # Grayscale loss (compute loss on luminance instead of RGBA)
-./scripts/train_cnn_v2_full.sh --grayscale-loss
+./cnn_v2/scripts/train_cnn_v2_full.sh --grayscale-loss
 
 # Custom directories
-./scripts/train_cnn_v2_full.sh --input training/input --target training/target_2
+./cnn_v2/scripts/train_cnn_v2_full.sh --input training/input --target training/target_2
 
 # Full-image mode (instead of patch-based)
-./scripts/train_cnn_v2_full.sh --full-image --image-size 256
+./cnn_v2/scripts/train_cnn_v2_full.sh --full-image --image-size 256
 
 # See all options
-./scripts/train_cnn_v2_full.sh --help
+./cnn_v2/scripts/train_cnn_v2_full.sh --help
 ```
 
 **Defaults:** 200 epochs, 3×3 kernels, 8→4→4 channels, batch-size 16, patch-based (8×8, harris detector).
@@ -184,33 +184,33 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding
 **Validation Only** (skip training):
 ```bash
 # Use latest checkpoint
-./scripts/train_cnn_v2_full.sh --validate
+./cnn_v2/scripts/train_cnn_v2_full.sh --validate
 
 # Use specific checkpoint
-./scripts/train_cnn_v2_full.sh --validate checkpoints/checkpoint_epoch_50.pth
+./cnn_v2/scripts/train_cnn_v2_full.sh --validate checkpoints/checkpoint_epoch_50.pth
 ```
 
 **Manual Training:**
 ```bash
 # Default config
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --epochs 100 --batch-size 16 --checkpoint-every 5
 
 # Custom architecture (per-layer kernel sizes)
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --kernel-sizes 1,3,5 \
   --epochs 5000 --batch-size 16
 
 # Mip-level for p0-p3 features (0=original, 1=half, 2=quarter, 3=eighth)
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --mip-level 1 \
   --epochs 100 --batch-size 16
 
 # Grayscale loss (compute loss on luminance Y = 0.299*R + 0.587*G + 0.114*B)
-./training/train_cnn_v2.py \
+./cnn_v2/training/train_cnn_v2.py \
   --input training/input/ --target training/target_2/ \
   --grayscale-loss \
   --epochs 100 --batch-size 16
@@ -236,7 +236,7 @@ Use `--quiet` for streamlined output in scripts (used automatically by train_cnn
 
 ```
 
-**Validation:** Use HTML tool (`tools/cnn_v2_test/index.html`) for CNN v2 validation. See `doc/CNN_V2_WEB_TOOL.md`.
+**Validation:** Use HTML tool (`cnn_v2/tools/cnn_v2_test/index.html`) for CNN v2 validation. See `cnn_v2/docs/CNN_V2_WEB_TOOL.md`.
 
 ---