From 161a59fa50bb92e3664c389fa03b95aefe349b3f Mon Sep 17 00:00:00 2001
From: skal <pascal.massimino@gmail.com>
Date: Sun, 15 Feb 2026 18:44:17 +0100
Subject: refactor(cnn): isolate CNN v2 to cnn_v2/ subdirectory

Move all CNN v2 files to dedicated cnn_v2/ directory to prepare for CNN v3 development. Zero functional changes.

Structure:
- cnn_v2/src/ - C++ effect implementation
- cnn_v2/shaders/ - WGSL shaders (6 files)
- cnn_v2/weights/ - Binary weights (3 files)
- cnn_v2/training/ - Python training scripts (4 files)
- cnn_v2/scripts/ - Shell scripts (train_cnn_v2_full.sh)
- cnn_v2/tools/ - Validation tools (HTML)
- cnn_v2/docs/ - Documentation (4 markdown files)

Changes:
- Update CMake source list to cnn_v2/src/cnn_v2_effect.cc
- Update assets.txt with relative paths to cnn_v2/
- Update includes to ../../cnn_v2/src/cnn_v2_effect.h
- Add PROJECT_ROOT resolution to Python/shell scripts
- Update doc references in HOWTO.md, TODO.md
- Add cnn_v2/README.md

Verification: 34/34 tests passing, demo runs correctly.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---
 cnn_v2/docs/CNN_V2.md               | 813 ++++++++++++++++++++++++++++++++++++
 cnn_v2/docs/CNN_V2_BINARY_FORMAT.md | 235 +++++++++++
 cnn_v2/docs/CNN_V2_DEBUG_TOOLS.md   | 143 +++++++
 cnn_v2/docs/CNN_V2_WEB_TOOL.md      | 348 +++++++++++++++
 4 files changed, 1539 insertions(+)
 create mode 100644 cnn_v2/docs/CNN_V2.md
 create mode 100644 cnn_v2/docs/CNN_V2_BINARY_FORMAT.md
 create mode 100644 cnn_v2/docs/CNN_V2_DEBUG_TOOLS.md
 create mode 100644 cnn_v2/docs/CNN_V2_WEB_TOOL.md

(limited to 'cnn_v2/docs')

diff --git a/cnn_v2/docs/CNN_V2.md b/cnn_v2/docs/CNN_V2.md
new file mode 100644
index 0000000..b7fd6f8
--- /dev/null
+++ b/cnn_v2/docs/CNN_V2.md
@@ -0,0 +1,813 @@
+# CNN v2: Parametric Static Features
+
+**Technical Design Document**
+
+---
+
+## Overview
+
+CNN v2 extends the original CNN post-processing effect with parametric static features, enabling richer spatial and frequency-domain inputs for improved visual quality.
+
+**Key improvements over v1:**
+- 7D static feature input (vs 4D RGB)
+- Multi-frequency position encoding (NeRF-style)
+- Configurable mip-level for p0-p3 parametric features (0-3)
+- Per-layer configurable kernel sizes (1×1, 3×3, 5×5)
+- Variable channel counts per layer
+- Float16 weight storage (~3.2 KB for 3-layer model)
+- Bias integrated as static feature dimension
+- Storage buffer architecture (dynamic layer count)
+- Binary weight format v2 for runtime loading
+- Sigmoid activation for layer 0 and final layer (smooth [0,1] mapping)
+
+**Status:** ✅ Complete. Sigmoid activation, stable training, validation tools operational.
+
+**Breaking Change:**
+- Models trained with `clamp()` incompatible. Retrain required.
+
+**TODO:**
+- 8-bit quantization with QAT for 2× size reduction (~1.6 KB)
+
+---
+
+## Architecture
+
+### Pipeline Overview
+
+```
+Input RGBD → Static Features Compute → CNN Layers → Output RGBA
+             └─ computed once/frame ─┘  └─ multi-pass ─┘
+```
+
+**Detailed Data Flow:**
+
+```
+                  ┌─────────────────────────────────────────┐
+                  │   Static Features (computed once)      │
+                  │   8D: p0,p1,p2,p3,uv_x,uv_y,sin10x,bias │
+                  └──────────────┬──────────────────────────┘
+                                 │
+                                 │ 8D (broadcast to all layers)
+                                 ├───────────────────────────┐
+                                 │                           │
+  ┌──────────────┐              │                           │
+  │ Input RGBD   │──────────────┤                           │
+  │     4D       │     4D       │                           │
+  └──────────────┘              │                           │
+                                 ▼                           │
+                          ┌────────────┐                    │
+                          │  Layer 0   │  (12D input)       │
+                          │   (CNN)    │   = 4D + 8D        │
+                          │  12D → 4D  │                    │
+                          └─────┬──────┘                    │
+                                │ 4D output                 │
+                                │                           │
+                                ├───────────────────────────┘
+                                │                           │
+                                ▼                           │
+                          ┌────────────┐                    │
+                          │  Layer 1   │  (12D input)       │
+                          │   (CNN)    │   = 4D + 8D        │
+                          │  12D → 4D  │                    │
+                          └─────┬──────┘                    │
+                                │ 4D output                 │
+                                │                           │
+                                ├───────────────────────────┘
+                                ▼                           │
+                               ...                          │
+                                │                           │
+                                ▼                           │
+                          ┌────────────┐                    │
+                          │  Layer N   │  (12D input)       │
+                          │ (output)   │◄──────────────────┘
+                          │  12D → 4D  │
+                          └─────┬──────┘
+                                │ 4D (RGBA)
+                                ▼
+                            Output
+```
+
+**Key Points:**
+- Static features computed once, broadcast to all CNN layers
+- Each layer: previous 4D output + 8D static → 12D input → 4D output
+- Ping-pong buffering between layers
+- Layer 0 special case: uses input RGBD instead of previous layer output
+
+**Static Features Texture:**
+- Name: `static_features`
+- Format: `texture_storage_2d<rgba32uint, write>` (4×u32)
+- Data: 8 float16 values packed via `pack2x16float()`
+- Computed once per frame, read by all CNN layers
+- Lifetime: Entire frame (all CNN layer passes)
+
+**CNN Layers:**
+- Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels
+- Layer 1+: previous output (4D) + static (8D) = 12D → 4 channels
+- All layers: uniform 12D input, 4D output (ping-pong buffer)
+- Storage: `texture_storage_2d<rgba32uint>` (4 channels as 2×f16 pairs)
+
+**Activation Functions:**
+- Layer 0 & final layer: `sigmoid(x)` for smooth [0,1] mapping
+- Middle layers: `ReLU` (max(0, x))
+- Rationale: Sigmoid prevents gradient blocking at boundaries, enabling better convergence
+- Breaking change: Models trained with `clamp(x, 0, 1)` are incompatible, retrain required
+
+---
+
+## Static Features (7D + 1 bias)
+
+### Feature Layout
+
+**8 float16 values per pixel:**
+
+```wgsl
+// Slot 0-3: Parametric features (p0, p1, p2, p3)
+// Sampled from configurable mip level (0=original, 1=half, 2=quarter, 3=eighth)
+// Training sets mip_level via --mip-level flag, stored in binary format v2
+let p0 = ...;  // RGB.r from selected mip level
+let p1 = ...;  // RGB.g from selected mip level
+let p2 = ...;  // RGB.b from selected mip level
+let p3 = ...;  // Depth or RGB channel from mip level
+
+// Slot 4-5: UV coordinates (normalized screen space)
+let uv_x = coord.x / resolution.x;  // Horizontal position [0,1]
+let uv_y = coord.y / resolution.y;  // Vertical position [0,1]
+
+// Slot 6: Multi-frequency position encoding
+let sin20_y = sin(20.0 * uv_y);     // Periodic feature (frequency=20, vertical)
+
+// Slot 7: Bias dimension (always 1.0)
+let bias = 1.0;                     // Learned bias per output channel
+
+// Packed storage: [p0, p1, p2, p3, uv.x, uv.y, sin(20*uv.y), 1.0]
+```
+
+### Input Channel Mapping
+
+**Weight tensor layout (12 input channels per layer):**
+
+| Input Channel | Feature | Description |
+|--------------|---------|-------------|
+| 0-3 | Previous layer output | 4D RGBA from prior CNN layer (or input RGBD for Layer 0) |
+| 4-11 | Static features | 8D: p0, p1, p2, p3, uv_x, uv_y, sin20_y, bias |
+
+**Static feature channel details:**
+- Channel 4 → p0 (RGB.r from mip level)
+- Channel 5 → p1 (RGB.g from mip level)
+- Channel 6 → p2 (RGB.b from mip level)
+- Channel 7 → p3 (depth or RGB channel from mip level)
+- Channel 8 → p4 (uv_x: normalized horizontal position)
+- Channel 9 → p5 (uv_y: normalized vertical position)
+- Channel 10 → p6 (sin(20*uv_y): periodic encoding)
+- Channel 11 → p7 (bias: constant 1.0)
+
+**Note:** When generating identity weights, p4-p7 correspond to input channels 8-11, not 4-7.
+
+### Feature Rationale
+
+| Feature | Dimension | Purpose | Priority |
+|---------|-----------|---------|----------|
+| p0-p3 | 4D | Parametric auxiliary features (mips, gradients, etc.) | Essential |
+| UV coords | 2D | Spatial position awareness | Essential |
+| sin(20\*uv.y) | 1D | Periodic position encoding (vertical) | Medium |
+| Bias | 1D | Learned bias (standard NN) | Essential |
+
+**Note:** Input image RGBD (mip 0) fed only to Layer 0. Subsequent layers see static features + previous layer output.
+
+**Why bias as static feature:**
+- Simpler shader code (single weight array)
+- Standard NN formulation: y = Wx (x includes bias term)
+- Saves 56-112 bytes (no separate bias buffer)
+- 7 features sufficient for initial implementation
+
+### Future Feature Extensions
+
+**Option: Additional encodings:**
+- `sin(40*uv.y)` - Higher frequency encoding
+- `gray_mip1` - Multi-scale luminance
+- `dx`, `dy` - Sobel gradients
+- `variance` - Local texture measure
+- `laplacian` - Edge detection
+
+**Option: uint8 packing (16+ features):**
+```wgsl
+// texture_storage_2d<rgba8unorm> stores 16 uint8 values
+// Trade precision for feature count
+// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y,
+//  sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, var, bias]
+```
+Requires quantization-aware training.
+
+---
+
+## Layer Structure
+
+### Example 3-Layer Network
+
+```
+Layer 0: input RGBD (4D) + static (8D) = 12D → 4 channels (3×3 kernel)
+Layer 1: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel)
+Layer 2: previous (4D) + static (8D) = 12D → 4 channels (3×3 kernel, output RGBA)
+```
+
+**Output:** 4 channels (RGBA). Training targets preserve alpha from target images.
+
+### Weight Calculations
+
+**Per-layer weights (uniform 12D→4D, 3×3 kernels):**
+```
+Layer 0: 12 × 3 × 3 × 4 = 432 weights
+Layer 1: 12 × 3 × 3 × 4 = 432 weights
+Layer 2: 12 × 3 × 3 × 4 = 432 weights
+Total: 1296 weights
+```
+
+**Storage sizes:**
+- f32: 1296 × 4 = 5,184 bytes (~5.1 KB)
+- f16: 1296 × 2 = 2,592 bytes (~2.5 KB) ✓ **recommended**
+
+**Comparison to v1:**
+- v1: ~800 weights (3.2 KB f32)
+- v2: ~1296 weights (2.5 KB f16)
+- **Uniform architecture, smaller than v1 f32**
+
+### Kernel Size Guidelines
+
+**1×1 kernel (pointwise):**
+- No spatial context, channel mixing only
+- Weights: `12 × 4 = 48` per layer
+- Use for: Fast inference, channel remapping
+
+**3×3 kernel (standard conv):**
+- Local spatial context (recommended)
+- Weights: `12 × 9 × 4 = 432` per layer
+- Use for: Most layers (balanced quality/size)
+
+**5×5 kernel (large receptive field):**
+- Wide spatial context
+- Weights: `12 × 25 × 4 = 1200` per layer
+- Use for: Output layer, fine detail enhancement
+
+### Channel Storage (4×f16 per texel)
+
+```wgsl
+@group(0) @binding(1) var layer_input: texture_2d<u32>;
+
+fn unpack_channels(coord: vec2<i32>) -> vec4<f32> {
+  let packed = textureLoad(layer_input, coord, 0);
+  let v0 = unpack2x16float(packed.x);  // [ch0, ch1]
+  let v1 = unpack2x16float(packed.y);  // [ch2, ch3]
+  return vec4<f32>(v0.x, v0.y, v1.x, v1.y);
+}
+
+fn pack_channels(values: vec4<f32>) -> vec4<u32> {
+  return vec4<u32>(
+    pack2x16float(vec2(values.x, values.y)),
+    pack2x16float(vec2(values.z, values.w)),
+    0u,  // Unused
+    0u   // Unused
+  );
+}
+```
+
+---
+
+## Training Workflow
+
+### Script: `training/train_cnn_v2.py`
+
+**Static Feature Extraction:**
+
+```python
+def compute_static_features(rgb, depth, mip_level=0):
+    """Generate parametric features (8D: p0-p3 + spatial).
+
+    Args:
+        mip_level: 0=original, 1=half res, 2=quarter res, 3=eighth res
+    """
+    h, w = rgb.shape[:2]
+
+    # Generate mip level for p0-p3 (downsample then upsample)
+    if mip_level > 0:
+        mip_rgb = rgb.copy()
+        for _ in range(mip_level):
+            mip_rgb = cv2.pyrDown(mip_rgb)
+        for _ in range(mip_level):
+            mip_rgb = cv2.pyrUp(mip_rgb)
+        if mip_rgb.shape[:2] != (h, w):
+            mip_rgb = cv2.resize(mip_rgb, (w, h), interpolation=cv2.INTER_LINEAR)
+    else:
+        mip_rgb = rgb
+
+    # Parametric features from mip level
+    p0, p1, p2, p3 = mip_rgb[..., 0], mip_rgb[..., 1], mip_rgb[..., 2], depth
+
+    # UV coordinates (normalized)
+    uv_x = np.linspace(0, 1, w)[None, :].repeat(h, axis=0)
+    uv_y = np.linspace(0, 1, h)[:, None].repeat(w, axis=1)
+
+    # Multi-frequency position encoding
+    sin10_x = np.sin(10.0 * uv_x)
+
+    # Bias dimension (always 1.0)
+    bias = np.ones_like(p0)
+
+    # Stack: [p0, p1, p2, p3, uv.x, uv.y, sin10_x, bias]
+    return np.stack([p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias], axis=-1)
+```
+
+**Network Definition:**
+
+```python
+class CNNv2(nn.Module):
+    def __init__(self, kernel_sizes, num_layers=3):
+        super().__init__()
+        if isinstance(kernel_sizes, int):
+            kernel_sizes = [kernel_sizes] * num_layers
+        self.kernel_sizes = kernel_sizes
+        self.layers = nn.ModuleList()
+
+        # All layers: 12D input (4 prev + 8 static) → 4D output
+        for kernel_size in kernel_sizes:
+            self.layers.append(
+                nn.Conv2d(12, 4, kernel_size=kernel_size,
+                         padding=kernel_size//2, bias=False)
+            )
+
+    def forward(self, input_rgbd, static_features):
+        # Layer 0: input RGBD (4D) + static (8D) = 12D
+        x = torch.cat([input_rgbd, static_features], dim=1)
+        x = self.layers[0](x)
+        x = torch.sigmoid(x)  # Soft [0,1] for layer 0
+
+        # Layer 1+: previous output (4D) + static (8D) = 12D
+        for i in range(1, len(self.layers)):
+            x_input = torch.cat([x, static_features], dim=1)
+            x = self.layers[i](x_input)
+            if i < len(self.layers) - 1:
+                x = F.relu(x)
+            else:
+                x = torch.sigmoid(x)  # Soft [0,1] for final layer
+
+        return x  # RGBA output
+```
+
+**Training Configuration:**
+
+```python
+# Hyperparameters
+kernel_sizes = [3, 3, 3]     # Per-layer kernel sizes (e.g., [1,3,5])
+num_layers = 3               # Number of CNN layers
+mip_level = 0                # Mip level for p0-p3: 0=orig, 1=half, 2=quarter, 3=eighth
+grayscale_loss = False       # Compute loss on grayscale (Y) instead of RGBA
+learning_rate = 1e-3
+batch_size = 16
+epochs = 5000
+
+# Dataset: Input RGB, Target RGBA (preserves alpha channel from image)
+# Model outputs RGBA, loss compares all 4 channels (or grayscale if --grayscale-loss)
+
+# Training loop (standard PyTorch f32)
+for epoch in range(epochs):
+    for rgb_batch, depth_batch, target_batch in dataloader:
+        # Compute static features (8D) with mip level
+        static_feat = compute_static_features(rgb_batch, depth_batch, mip_level)
+
+        # Input RGBD (4D)
+        input_rgbd = torch.cat([rgb_batch, depth_batch.unsqueeze(1)], dim=1)
+
+        # Forward pass
+        output = model(input_rgbd, static_feat)
+
+        # Loss computation (grayscale or RGBA)
+        if grayscale_loss:
+            # Convert RGBA to grayscale: Y = 0.299*R + 0.587*G + 0.114*B
+            output_gray = 0.299 * output[:, 0:1] + 0.587 * output[:, 1:2] + 0.114 * output[:, 2:3]
+            target_gray = 0.299 * target[:, 0:1] + 0.587 * target[:, 1:2] + 0.114 * target[:, 2:3]
+            loss = criterion(output_gray, target_gray)
+        else:
+            loss = criterion(output, target_batch)
+
+        # Backward pass
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+```
+
+**Checkpoint Format:**
+
+```python
+torch.save({
+    'state_dict': model.state_dict(),  # f32 weights
+    'config': {
+        'kernel_sizes': [3, 3, 3],  # Per-layer kernel sizes
+        'num_layers': 3,
+        'mip_level': 0,             # Mip level used for p0-p3
+        'grayscale_loss': False,    # Whether grayscale loss was used
+        'features': ['p0', 'p1', 'p2', 'p3', 'uv.x', 'uv.y', 'sin10_x', 'bias']
+    },
+    'epoch': epoch,
+    'loss': loss.item()
+}, f'checkpoints/checkpoint_epoch_{epoch}.pth')
+```
+
+---
+
+## Export Workflow
+
+### Script: `training/export_cnn_v2_shader.py`
+
+**Process:**
+1. Load checkpoint (f32 PyTorch weights)
+2. Extract layer configs (kernels, channels)
+3. Quantize weights to float16: `weights_f16 = weights_f32.astype(np.float16)`
+4. Generate WGSL shader per layer
+5. Write to `workspaces/<workspace>/shaders/cnn_v2/cnn_v2_*.wgsl`
+
+**Example Generated Shader:**
+
+```wgsl
+// cnn_v2_layer_0.wgsl - Auto-generated from checkpoint_epoch_5000.pth
+
+const KERNEL_SIZE: u32 = 1u;
+const IN_CHANNELS: u32 = 8u;   // 7 features + bias
+const OUT_CHANNELS: u32 = 16u;
+
+// Weights quantized to float16 (stored as f32 in shader)
+const weights: array<f32, 128> = array(
+  0.123047, -0.089844, 0.234375, 0.456055, ...
+);
+
+@group(0) @binding(0) var static_features: texture_2d<u32>;
+@group(0) @binding(1) var output_texture: texture_storage_2d<rgba32uint, write>;
+
+@compute @workgroup_size(8, 8)
+fn main(@builtin(global_invocation_id) id: vec3<u32>) {
+  // Load static features (8D)
+  let static_feat = get_static_features(vec2<i32>(id.xy));
+
+  // Convolution (1×1 kernel = pointwise)
+  var output: array<f32, OUT_CHANNELS>;
+  for (var c: u32 = 0u; c < OUT_CHANNELS; c++) {
+    var sum: f32 = 0.0;
+    for (var k: u32 = 0u; k < IN_CHANNELS; k++) {
+      sum += weights[c * IN_CHANNELS + k] * static_feat[k];
+    }
+    output[c] = max(0.0, sum);  // ReLU activation
+  }
+
+  // Pack and store (8×f16 per texel)
+  textureStore(output_texture, vec2<i32>(id.xy), pack_f16x8(output));
+}
+```
+
+**Float16 Quantization:**
+- Training uses f32 throughout (PyTorch standard)
+- Export converts to np.float16, then back to f32 for WGSL literals
+- **Expected discrepancy:** <0.1% MSE (acceptable)
+- Validation via HTML tool (see below)
+
+---
+
+## Validation Workflow
+
+### HTML Tool: `tools/cnn_v2_test/index.html`
+
+**WebGPU-based testing tool** with layer visualization.
+
+**Usage:**
+1. Open `tools/cnn_v2_test/index.html` in browser
+2. Drop `.bin` weights file (from `export_cnn_v2_weights.py`)
+3. Drop PNG test image
+4. View results with layer inspection
+
+**Features:**
+- Live CNN inference with WebGPU
+- Layer-by-layer visualization (static features + all CNN layers)
+- Weight visualization (per-layer kernels)
+- View modes: CNN output, original, diff (×10)
+- Blend control for comparing with original
+
+**Export weights:**
+```bash
+./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \
+  --output-weights workspaces/main/cnn_v2_weights.bin
+```
+
+See `doc/CNN_V2_WEB_TOOL.md` for detailed documentation
+
+---
+
+## Implementation Checklist
+
+### Phase 1: Shaders (Core Infrastructure)
+
+- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl` - Static features compute
+  - [ ] RGBD sampling from framebuffer
+  - [ ] UV coordinate calculation
+  - [ ] sin(10\*uv.x) computation
+  - [ ] Bias dimension (constant 1.0)
+  - [ ] Float16 packing via `pack2x16float()`
+  - [ ] Output to `texture_storage_2d<rgba32uint>`
+
+- [ ] `workspaces/main/shaders/cnn_v2/cnn_v2_layer_template.wgsl` - Layer template
+  - [ ] Static features unpacking
+  - [ ] Previous layer unpacking (8×f16)
+  - [ ] Convolution implementation (1×1, 3×3, 5×5)
+  - [ ] ReLU activation
+  - [ ] Output packing (8×f16)
+  - [ ] Proper padding handling
+
+### Phase 2: C++ Effect Class
+
+- [ ] `src/effects/cnn_v2_effect.h` - Header
+  - [ ] Class declaration inheriting from `PostProcessEffect`
+  - [ ] Static features texture member
+  - [ ] Layer textures vector
+  - [ ] Pipeline and bind group members
+
+- [ ] `src/effects/cnn_v2_effect.cc` - Implementation
+  - [ ] Constructor: Load shaders, create textures
+  - [ ] `init()`: Create pipelines, bind groups
+  - [ ] `render()`: Multi-pass execution
+    - [ ] Pass 0: Compute static features
+    - [ ] Pass 1-N: CNN layers
+    - [ ] Final: Composite to output
+  - [ ] Proper resource cleanup
+
+- [ ] Integration
+  - [ ] Add to `src/gpu/demo_effects.h` includes
+  - [ ] Add `cnn_v2_effect.cc` to `CMakeLists.txt` (headless + normal)
+  - [ ] Add shaders to `workspaces/main/assets.txt`
+  - [ ] Add to `src/tests/gpu/test_demo_effects.cc`
+
+### Phase 3: Training Pipeline
+
+- [ ] `training/train_cnn_v2.py` - Training script
+  - [ ] Static feature extraction function
+  - [ ] CNNv2 PyTorch model class
+  - [ ] Patch-based dataloader
+  - [ ] Training loop with checkpointing
+  - [ ] Command-line argument parsing
+  - [ ] Inference mode (ground truth generation)
+
+- [ ] `training/export_cnn_v2_shader.py` - Export script
+  - [ ] Checkpoint loading
+  - [ ] Weight extraction and f16 quantization
+  - [ ] Per-layer WGSL generation
+  - [ ] File output to workspace shaders/
+  - [ ] Metadata preservation
+
+### Phase 4: Tools & Validation
+
+- [x] HTML validation tool - WebGPU inference with layer visualization
+  - [ ] Command-line argument parsing
+  - [ ] Shader export orchestration
+  - [ ] Build orchestration
+  - [ ] Batch image processing
+  - [ ] Results display
+
+- [ ] `src/tools/cnn_test_main.cc` - Tool updates
+  - [ ] Add `--cnn-version v2` flag
+  - [ ] CNNv2Effect instantiation path
+  - [ ] Static features pass execution
+  - [ ] Multi-layer processing
+
+### Phase 5: Documentation
+
+- [ ] `doc/HOWTO.md` - Usage guide
+  - [ ] Training section (CNN v2)
+  - [ ] Export section
+  - [ ] Validation section
+  - [ ] Examples
+
+- [ ] `README.md` - Project overview update
+  - [ ] Mention CNN v2 capability
+
+---
+
+## File Structure
+
+### New Files
+
+```
+# Shaders (generated by export script)
+workspaces/main/shaders/cnn_v2/cnn_v2_static.wgsl       # Static features compute
+workspaces/main/shaders/cnn_v2/cnn_v2_layer_0.wgsl      # Input layer (generated)
+workspaces/main/shaders/cnn_v2/cnn_v2_layer_1.wgsl      # Inner layer (generated)
+workspaces/main/shaders/cnn_v2/cnn_v2_layer_2.wgsl      # Output layer (generated)
+
+# C++ implementation
+src/effects/cnn_v2_effect.h                  # Effect class header
+src/effects/cnn_v2_effect.cc                 # Effect implementation
+
+# Python training/export
+training/train_cnn_v2.py                         # Training script
+training/export_cnn_v2_shader.py                 # Shader generator
+training/validation/                             # Test images directory
+
+# Validation
+tools/cnn_v2_test/index.html                     # WebGPU validation tool
+
+# Documentation
+doc/CNN_V2.md                                    # This file
+```
+
+### Modified Files
+
+```
+src/gpu/demo_effects.h                           # Add CNNv2Effect include
+CMakeLists.txt                                   # Add cnn_v2_effect.cc
+workspaces/main/assets.txt                       # Add cnn_v2 shaders
+workspaces/main/timeline.seq                     # Optional: add CNNv2Effect
+src/tests/gpu/test_demo_effects.cc               # Add CNNv2 test case
+src/tools/cnn_test_main.cc                       # Add --cnn-version v2
+doc/HOWTO.md                                     # Add CNN v2 sections
+TODO.md                                          # Add CNN v2 task
+```
+
+### Unchanged (v1 Preserved)
+
+```
+training/train_cnn.py                            # Original training
+src/effects/cnn_effect.*                     # Original effect
+workspaces/main/shaders/cnn_*.wgsl               # Original v1 shaders
+```
+
+---
+
+## Performance Characteristics
+
+### Static Features Compute
+- **Cost:** ~0.1ms @ 1080p
+- **Frequency:** Once per frame
+- **Operations:** sin(), texture sampling, packing
+
+### CNN Layers (Example 3-layer)
+- **Layer0 (1×1, 8→16):** ~0.3ms
+- **Layer1 (3×3, 23→8):** ~0.8ms
+- **Layer2 (5×5, 15→4):** ~1.2ms
+- **Total:** ~2.4ms @ 1080p
+
+### Memory Usage
+- Static features: 1920×1080×8×2 = 33 MB (f16)
+- Layer buffers: 1920×1080×16×2 = 66 MB (max 16 channels)
+- Weights: ~6.4 KB (f16, in shader code)
+- **Total GPU memory:** ~100 MB
+
+---
+
+## Size Budget
+
+### CNN v1 vs v2
+
+| Metric | v1 | v2 | Delta |
+|--------|----|----|-------|
+| Weights (count) | 800 | 3268 | +2468 |
+| Storage (f32) | 3.2 KB | 13.1 KB | +9.9 KB |
+| Storage (f16) | N/A | 6.5 KB | +6.5 KB |
+| Shader code | ~500 lines | ~800 lines | +300 lines |
+
+### Mitigation Strategies
+
+**Reduce channels:**
+- [16,8,4] → [8,4,4] saves ~50% weights
+- [16,8,4] → [4,4,4] saves ~60% weights
+
+**Smaller kernels:**
+- [1,3,5] → [1,3,3] saves ~30% weights
+- [1,3,5] → [1,1,3] saves ~50% weights
+
+**Quantization:**
+- int8 weights: saves 75% (requires QAT training)
+- 4-bit weights: saves 87.5% (extreme, needs research)
+
+**Target:** Keep CNN v2 under 10 KB for 64k demo constraint
+
+---
+
+## Future Extensions
+
+### Flexible Feature Layout (Binary Format v3)
+
+**TODO:** Support arbitrary feature vector layouts and ordering in binary format.
+
+**Current Limitation:**
+- Feature layout hardcoded: `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]`
+- Shader must match training script exactly
+- Experimentation requires shader recompilation
+
+**Proposed Enhancement:**
+- Add feature descriptor to binary format header
+- Specify feature types, sources, and ordering
+- Runtime shader generation or dynamic feature indexing
+- Examples: `[R, G, B, dx, dy, uv_x, bias]` or `[mip1.r, mip2.g, laplacian, uv_x, sin20_x, bias]`
+
+**Benefits:**
+- Training experiments without C++/shader changes
+- A/B test different feature combinations
+- Single binary format, multiple architectures
+- Faster iteration on feature engineering
+
+**Implementation Options:**
+1. **Static approach:** Generate shader code from descriptor at load time
+2. **Dynamic approach:** Array-based indexing with feature map uniform
+3. **Hybrid:** Precompile common layouts, fallback to dynamic
+
+See `doc/CNN_V2_BINARY_FORMAT.md` for proposed descriptor format.
+
+---
+
+### More Features (uint8 Packing)
+
+```wgsl
+// 16 uint8 features per texel (texture_storage_2d<rgba8unorm>)
+// [R, G, B, D, uv.x, uv.y, sin10.x, sin10.y,
+//  sin20.x, sin20.y, dx, dy, gray_mip1, gray_mip2, variance, bias]
+```
+- Trade precision for quantity
+- Requires quantization-aware training
+
+### Temporal Features
+
+- Previous frame RGBA (motion awareness)
+- Optical flow vectors
+- Requires multi-frame buffer
+
+### Learned Position Encodings
+
+- Replace hand-crafted sin(10\*uv) with learned embeddings
+- Requires separate embedding network
+- Similar to NeRF position encoding
+
+### Dynamic Architecture
+
+- Runtime kernel size selection based on scene
+- Conditional layer execution (skip connections)
+- Layer pruning for performance
+
+---
+
+## References
+
+- **v1 Implementation:** `src/effects/cnn_effect.*`
+- **Training Guide:** `doc/HOWTO.md` (CNN Training section)
+- **Test Tool:** `doc/CNN_TEST_TOOL.md`
+- **Shader System:** `doc/SEQUENCE.md`
+- **Size Measurement:** `doc/SIZE_MEASUREMENT.md`
+
+---
+
+## Appendix: Design Decisions
+
+### Why Bias as Static Feature?
+
+**Alternatives considered:**
+1. Separate bias array per layer (Option B)
+2. Bias as static feature = 1.0 (Option A, chosen)
+
+**Decision rationale:**
+- Simpler shader code (fewer bindings)
+- Standard NN formulation (augmented input)
+- Saves 56-112 bytes per model
+- 7 features sufficient for v1 implementation
+- Can extend to uint8 packing if >7 features needed
+
+### Why Float16 for Weights?
+
+**Alternatives considered:**
+1. Keep f32 (larger, more accurate)
+2. Use f16 (smaller, GPU-native)
+3. Use int8 (smallest, needs QAT)
+
+**Decision rationale:**
+- f16 saves 50% vs f32 (critical for 64k target)
+- GPU-native support (pack2x16float in WGSL)
+- <0.1% accuracy loss (acceptable)
+- Simpler than int8 quantization
+
+### Why Multi-Frequency Position Encoding?
+
+**Inspiration:** NeRF (Neural Radiance Fields)
+
+**Benefits:**
+- Helps network learn high-frequency details
+- Better than raw UV coordinates
+- Small footprint (1D per frequency)
+
+**Future:** Add sin(20\*uv), sin(40\*uv) if >7 features available
+
+---
+
+## Related Documentation
+
+- `doc/CNN_V2_BINARY_FORMAT.md` - Binary weight file specification (.bin format)
+- `doc/CNN_V2_WEB_TOOL.md` - WebGPU testing tool with layer visualization
+- `doc/CNN_TEST_TOOL.md` - C++ offline validation tool (deprecated)
+- `doc/HOWTO.md` - Training and validation workflows
+
+---
+
+**Document Version:** 1.0
+**Last Updated:** 2026-02-12
+**Status:** Design approved, ready for implementation
diff --git a/cnn_v2/docs/CNN_V2_BINARY_FORMAT.md b/cnn_v2/docs/CNN_V2_BINARY_FORMAT.md
new file mode 100644
index 0000000..59c859d
--- /dev/null
+++ b/cnn_v2/docs/CNN_V2_BINARY_FORMAT.md
@@ -0,0 +1,235 @@
+# CNN v2 Binary Weight Format Specification
+
+Binary format for storing trained CNN v2 weights with static feature architecture.
+
+**File Extension:** `.bin`
+**Byte Order:** Little-endian
+**Version:** 2.0 (supports mip-level for parametric features)
+**Backward Compatible:** Version 1.0 files supported (mip_level=0)
+
+---
+
+## File Structure
+
+**Version 2 (current):**
+```
+┌─────────────────────┐
+│  Header (20 bytes)  │
+├─────────────────────┤
+│  Layer Info         │
+│  (20 bytes × N)     │
+├─────────────────────┤
+│  Weight Data        │
+│  (variable size)    │
+└─────────────────────┘
+```
+
+**Version 1 (legacy):**
+```
+┌─────────────────────┐
+│  Header (16 bytes)  │
+├─────────────────────┤
+│  Layer Info         │
+│  (20 bytes × N)     │
+├─────────────────────┤
+│  Weight Data        │
+│  (variable size)    │
+└─────────────────────┘
+```
+
+---
+
+## Header
+
+**Version 2 (20 bytes):**
+
+| Offset | Type | Field          | Description                          |
+|--------|------|----------------|--------------------------------------|
+| 0x00   | u32  | magic          | Magic number: `0x32_4E_4E_43` ("CNN2") |
+| 0x04   | u32  | version        | Format version (2 for current)       |
+| 0x08   | u32  | num_layers     | Number of CNN layers (excludes static features) |
+| 0x0C   | u32  | total_weights  | Total f16 weight count across all layers |
+| 0x10   | u32  | mip_level      | Mip level for p0-p3 features (0=original, 1=half, 2=quarter, 3=eighth) |
+
+**Version 1 (16 bytes) - Legacy:**
+
+| Offset | Type | Field          | Description                          |
+|--------|------|----------------|--------------------------------------|
+| 0x00   | u32  | magic          | Magic number: `0x32_4E_4E_43` ("CNN2") |
+| 0x04   | u32  | version        | Format version (1)                   |
+| 0x08   | u32  | num_layers     | Number of CNN layers                 |
+| 0x0C   | u32  | total_weights  | Total f16 weight count               |
+
+**Note:** Loaders should check version field and handle both formats. Version 1 files treated as mip_level=0.
+
+---
+
+## Layer Info (20 bytes per layer)
+
+Repeated `num_layers` times:
+- **Version 2:** Starting at offset 0x14 (20 bytes)
+- **Version 1:** Starting at offset 0x10 (16 bytes)
+
+| Offset      | Type | Field          | Description                          |
+|-------------|------|----------------|--------------------------------------|
+| 0x00        | u32  | kernel_size    | Convolution kernel dimension (3, 5, 7, etc.) |
+| 0x04        | u32  | in_channels    | Input channel count (includes 8 static features for Layer 1) |
+| 0x08        | u32  | out_channels   | Output channel count (max 8)         |
+| 0x0C        | u32  | weight_offset  | Weight array start index (f16 units, relative to weight data section) |
+| 0x10        | u32  | weight_count   | Number of f16 weights for this layer |
+
+**Layer Order:** Sequential (Layer 1, Layer 2, Layer 3, ...)
+
+---
+
+## Weight Data (variable size)
+
+Starts at offset:
+- **Version 2:** `20 + (num_layers × 20)`
+- **Version 1:** `16 + (num_layers × 20)`
+
+**Format:** Packed f16 pairs stored as u32
+**Packing:** `u32 = (f16_hi << 16) | f16_lo`
+**Storage:** Sequential by layer, then by output channel, input channel, spatial position
+
+**Weight Indexing:**
+```
+weight_idx = output_ch × (in_channels × kernel_size²) +
+             input_ch × kernel_size² +
+             (ky × kernel_size + kx)
+```
+
+Where:
+- `output_ch` ∈ [0, out_channels)
+- `input_ch` ∈ [0, in_channels)
+- `ky`, `kx` ∈ [0, kernel_size)
+
+**Unpacking f16 from u32:**
+```c
+uint32_t packed = weights_buffer[weight_idx / 2];
+uint16_t f16_bits = (weight_idx % 2 == 0) ? (packed & 0xFFFF) : (packed >> 16);
+```
+
+---
+
+## Example: 3-Layer Network (Version 2)
+
+**Configuration:**
+- Mip level: 0 (original resolution)
+- Layer 0: 12→4, kernel 3×3 (432 weights)
+- Layer 1: 12→4, kernel 3×3 (432 weights)
+- Layer 2: 12→4, kernel 3×3 (432 weights)
+
+**File Layout:**
+```
+Offset   Size   Content
+------   ----   -------
+0x00     20     Header (magic, version=2, layers=3, weights=1296, mip_level=0)
+0x14     20     Layer 0 info (kernel=3, in=12, out=4, offset=0, count=432)
+0x28     20     Layer 1 info (kernel=3, in=12, out=4, offset=432, count=432)
+0x3C     20     Layer 2 info (kernel=3, in=12, out=4, offset=864, count=432)
+0x50     2592   Weight data (1296 u32 packed f16 pairs)
+         ----
+Total:   2672 bytes (~2.6 KB)
+```
+
+---
+
+## Static Features
+
+Not stored in .bin file (computed at runtime):
+
+**8D Input Features:**
+1. **p0** - Parametric feature 0 (from mip level)
+2. **p1** - Parametric feature 1 (from mip level)
+3. **p2** - Parametric feature 2 (from mip level)
+4. **p3** - Parametric feature 3 (depth or from mip level)
+5. **UV_X** - Normalized x coordinate [0,1]
+6. **UV_Y** - Normalized y coordinate [0,1]
+7. **sin(20 × UV_Y)** - Spatial frequency encoding (vertical, frequency=20)
+8. **1.0** - Bias term
+
+**Mip Level Usage (p0-p3):**
+- `mip_level=0`: RGB from original resolution (mip 0)
+- `mip_level=1`: RGB from half resolution (mip 1), upsampled
+- `mip_level=2`: RGB from quarter resolution (mip 2), upsampled
+- `mip_level=3`: RGB from eighth resolution (mip 3), upsampled
+
+**Layer 0** receives input RGBD (4D) + static features (8D) = 12D input → 4D output.
+**Layer 1+** receive previous layer output (4D) + static features (8D) = 12D input → 4D output.
+
+---
+
+## Validation
+
+**Magic Check:**
+```c
+uint32_t magic;
+fread(&magic, 4, 1, fp);
+if (magic != 0x32_4E_4E_43) { error("Invalid CNN v2 file"); }
+```
+
+**Version Check:**
+```c
+uint32_t version;
+fread(&version, 4, 1, fp);
+if (version != 1 && version != 2) { error("Unsupported version"); }
+uint32_t header_size = (version == 1) ? 16 : 20;
+```
+
+**Size Check:**
+```c
+expected_size = header_size + (num_layers × 20) + (total_weights × 2);
+if (file_size != expected_size) { error("Size mismatch"); }
+```
+
+**Weight Offset Sanity:**
+```c
+// Each layer's offset should match cumulative count
+uint32_t cumulative = 0;
+for (int i = 0; i < num_layers; i++) {
+    if (layers[i].weight_offset != cumulative) { error("Invalid offset"); }
+    cumulative += layers[i].weight_count;
+}
+if (cumulative != total_weights) { error("Total mismatch"); }
+```
+
+---
+
+## Future Extensions
+
+**TODO: Flexible Feature Layout**
+
+Current limitation: Feature vector layout is hardcoded as `[p0, p1, p2, p3, uv_x, uv_y, sin10_x, bias]`.
+
+Proposed enhancement for version 3:
+- Add feature descriptor section to header
+- Specify feature count, types, and ordering
+- Support arbitrary 7D feature combinations (e.g., `[R, G, B, dx, dy, uv_x, bias]`)
+- Allow runtime shader generation based on descriptor
+- Enable experimentation without recompiling shaders
+
+Example descriptor format:
+```
+struct FeatureDescriptor {
+  u32 feature_count;           // Number of features (typically 7-8)
+  u32 feature_types[8];        // Type enum per feature
+  u32 feature_sources[8];      // Source enum (mip0, mip1, gradient, etc.)
+  u32 reserved[8];             // Future use
+}
+```
+
+Benefits:
+- Training can experiment with different feature combinations
+- No shader recompilation needed
+- Single binary format supports multiple architectures
+- Easier A/B testing of feature effectiveness
+
+---
+
+## Related Files
+
+- `training/export_cnn_v2_weights.py` - Binary export tool
+- `src/effects/cnn_v2_effect.cc` - C++ loader
+- `tools/cnn_v2_test/index.html` - WebGPU validator
+- `doc/CNN_V2.md` - Architecture design
diff --git a/cnn_v2/docs/CNN_V2_DEBUG_TOOLS.md b/cnn_v2/docs/CNN_V2_DEBUG_TOOLS.md
new file mode 100644
index 0000000..8d1289a
--- /dev/null
+++ b/cnn_v2/docs/CNN_V2_DEBUG_TOOLS.md
@@ -0,0 +1,143 @@
+# CNN v2 Debugging Tools
+
+Tools for investigating CNN v2 mismatch between HTML tool and cnn_test.
+
+---
+
+## Identity Weight Generator
+
+**Purpose:** Generate trivial .bin files with identity passthrough for debugging.
+
+**Script:** `training/gen_identity_weights.py`
+
+**Usage:**
+```bash
+# 1×1 identity (default)
+./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity.bin
+
+# 3×3 identity
+./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity_3x3.bin --kernel-size 3
+
+# Mix mode: 50-50 blend (0.5*p0+0.5*p4, etc)
+./training/gen_identity_weights.py output.bin --mix
+
+# Static features only: p4→ch0, p5→ch1, p6→ch2, p7→ch3
+./training/gen_identity_weights.py output.bin --p47
+
+# Custom mip level
+./training/gen_identity_weights.py output.bin --kernel-size 1 --mip-level 2
+```
+
+**Output:**
+- Single layer, 12D→4D (4 input channels + 8 static features)
+- Identity mode: Output Ch{0,1,2,3} = Input Ch{0,1,2,3}
+- Mix mode (--mix): Output Ch{i} = 0.5*Input Ch{i} + 0.5*Input Ch{i+4} (50-50 blend, avoids overflow)
+- Static mode (--p47): Output Ch{i} = Input Ch{i+4} (static features only, visualizes p4-p7)
+- Minimal file size (~136 bytes for 1×1, ~904 bytes for 3×3)
+
+**Validation:**
+Load in HTML tool or cnn_test - output should match input (RGB only, ignoring static features).
+
+---
+
+## Composited Layer Visualization
+
+**Purpose:** Save current layer view as single composited image (4 channels side-by-side, grayscale).
+
+**Location:** HTML tool - "Layer Visualization" panel
+
+**Usage:**
+1. Load image + weights in HTML tool
+2. Select layer to visualize (Static 0-3, Static 4-7, Layer 0, Layer 1, etc.)
+3. Click "Save Composited" button
+4. Downloads PNG: `composited_layer{N}_{W}x{H}.png`
+
+**Output:**
+- 4 channels stacked horizontally
+- Grayscale representation
+- Useful for comparing layer activations across tools
+
+---
+
+## Debugging Strategy
+
+### Track a) Binary Conversion Chain
+
+**Hypothesis:** Conversion error in .bin ↔ base64 ↔ Float32Array
+
+**Test:**
+1. Generate identity weights:
+   ```bash
+   ./training/gen_identity_weights.py workspaces/main/weights/test_identity.bin
+   ```
+
+2. Load in HTML tool - output should match input RGB
+
+3. If mismatch:
+   - Check Python export: f16 packing in `export_cnn_v2_weights.py` line 105
+   - Check HTML parsing: `unpackF16()` in `index.html` line 805-815
+   - Check weight indexing: `get_weight()` shader function
+
+**Key locations:**
+- Python: `np.float16` → `view(np.uint32)` (line 105 of export script)
+- JS: `DataView` → `unpackF16()` → manual f16 decode (line 773-803)
+- WGSL: `unpack2x16float()` built-in (line 492 of shader)
+
+### Track b) Layer Visualization
+
+**Purpose:** Confirm layer outputs match between HTML and C++
+
+**Method:**
+1. Run identical input through both tools
+2. Save composited layers from HTML tool
+3. Compare with cnn_test output
+4. Use identity weights to isolate weight loading from computation
+
+### Track c) Trivial Test Case
+
+**Use identity weights to test:**
+- Weight loading (binary parsing)
+- Feature generation (static features)
+- Convolution (should be passthrough)
+- Output packing
+
+**Expected behavior:**
+- Input RGB → Output RGB (exact match)
+- Static features ignored (all zeros in identity matrix)
+
+---
+
+## Known Issues
+
+### ~~Layer 0 Visualization Scale~~ [FIXED]
+
+**Issue:** Layer 0 output displayed at 0.5× brightness (divided by 2).
+
+**Cause:** Line 1530 used `vizScale = 0.5` for all CNN layers, but Layer 0 is clamped [0,1] and doesn't need dimming.
+
+**Fix:** Use scale 1.0 for Layer 0 output (layerIdx=1), 0.5 only for middle layers (ReLU, unbounded).
+
+### Remaining Mismatch
+
+**Current:** HTML tool and cnn_test produce different outputs for same input/weights.
+
+**Suspects:**
+1. F16 unpacking difference (CPU vs GPU vs JS)
+2. Static feature generation (RGBD, UV, sin encoding)
+3. Convolution kernel iteration order
+4. Output packing/unpacking
+
+**Next steps:**
+1. Test with identity weights (eliminates weight loading)
+2. Compare composited layer outputs
+3. Add debug visualization for static features
+4. Hex dump comparison (first 8 pixels) - use `--debug-hex` flag in cnn_test
+
+---
+
+## Related Documentation
+
+- `doc/CNN_V2.md` - CNN v2 architecture
+- `doc/CNN_V2_WEB_TOOL.md` - HTML tool documentation
+- `doc/CNN_TEST_TOOL.md` - cnn_test CLI tool
+- `training/export_cnn_v2_weights.py` - Binary export format
diff --git a/cnn_v2/docs/CNN_V2_WEB_TOOL.md b/cnn_v2/docs/CNN_V2_WEB_TOOL.md
new file mode 100644
index 0000000..b6f5b0b
--- /dev/null
+++ b/cnn_v2/docs/CNN_V2_WEB_TOOL.md
@@ -0,0 +1,348 @@
+# CNN v2 Web Testing Tool
+
+Browser-based WebGPU tool for validating CNN v2 inference with layer visualization and weight inspection.
+
+**Location:** `tools/cnn_v2_test/index.html`
+
+---
+
+## Status (2026-02-13)
+
+**Working:**
+- ✅ WebGPU initialization and device setup
+- ✅ Binary weight file parsing (v1 and v2 formats)
+- ✅ Automatic mip-level detection from binary format v2
+- ✅ Weight statistics (min/max per layer)
+- ✅ UI layout with collapsible panels
+- ✅ Mode switching (Activations/Weights tabs)
+- ✅ Canvas context management (2D for weights, WebGPU for activations)
+- ✅ Weight visualization infrastructure (layer selection, grid layout)
+- ✅ Layer naming matches codebase convention (Layer 0, Layer 1, Layer 2)
+- ✅ Static features split visualization (Static 0-3, Static 4-7)
+- ✅ All layers visible including output layer (Layer 2)
+- ✅ Video playback support (MP4, WebM) with frame-by-frame controls
+- ✅ Video looping (automatic continuous playback)
+- ✅ Mip level selection (p0-p3 features at different resolutions)
+
+**Recent Changes (Latest):**
+- Binary format v2 support: Reads mip_level from 20-byte header
+- Backward compatible: v1 (16-byte header) → mip_level=0
+- Auto-update UI dropdown when loading weights with mip_level
+- Display mip_level in metadata panel
+- Code refactoring: Extracted FULLSCREEN_QUAD_VS shader (reused 3× across pipelines)
+- Added helper methods: `getDimensions()`, `setVideoControlsEnabled()`
+- Improved code organization with section headers and comments
+- Moved Mip Level selector to bottom of left sidebar (removed "Features (p0-p3)" label)
+- Added `loop` attribute to video element for automatic continuous playback
+
+**Previous Fixes:**
+- Fixed Layer 2 not appearing (was excluded from layerOutputs due to isOutput check)
+- Fixed canvas context switching (force clear before recreation)
+- Added Static 0-3 / Static 4-7 buttons to view all 8 static feature channels
+- Aligned naming with train_cnn_v2.py/.wgsl: Layer 0, Layer 1, Layer 2 (not Layer 1, 2, 3)
+- Disabled Static buttons in weights mode (no learnable weights)
+
+**Known Issues:**
+- Layer activation visualization may show black if texture data not properly unpacked
+- Weight kernel display depends on correct 2D context creation after canvas recreation
+
+---
+
+## Architecture
+
+### File Structure
+- Single-file HTML tool (~1100 lines)
+- Embedded shaders: STATIC_SHADER, CNN_SHADER, DISPLAY_SHADER, LAYER_VIZ_SHADER
+- Shared WGSL component: FULLSCREEN_QUAD_VS (reused across render pipelines)
+- **Embedded default weights:** DEFAULT_WEIGHTS_B64 (base64-encoded binary v2)
+  - Current: 4 layers (3×3, 5×5, 3×3, 3×3), 2496 f16 weights, mip_level=2
+  - Source: `workspaces/main/weights/cnn_v2_weights.bin`
+  - Updates: Re-encode binary with `base64 -i <file>` and update constant
+- Pure WebGPU (no external dependencies)
+
+### Code Organization
+
+**Recent Refactoring (2026-02-13):**
+- Extracted `FULLSCREEN_QUAD_VS` constant: Reused fullscreen quad vertex shader (2 triangles covering NDC)
+- Added helper methods to CNNTester class:
+  - `getDimensions()`: Returns current source dimensions (video or image)
+  - `setVideoControlsEnabled(enabled)`: Centralized video control enable/disable
+- Consolidated duplicate vertex shader code (used in mipmap generation, display, layer visualization)
+- Added section headers in JavaScript for better navigation
+- Improved inline comments explaining shader architecture
+
+**Benefits:**
+- Reduced code duplication (~40 lines saved)
+- Easier maintenance (single source of truth for fullscreen quad)
+- Clearer separation of concerns
+
+### Key Components
+
+**1. Weight Parsing**
+- Reads binary format v2: header (20B) + layer info (20B×N) + f16 weights
+- Backward compatible with v1: header (16B), mip_level defaults to 0
+- Computes min/max per layer via f16 unpacking
+- Stores `{ layers[], weights[], mipLevel, fileSize }`
+- Auto-sets UI mip-level dropdown from loaded weights
+
+**2. CNN Pipeline**
+- Static features computation (RGBD + UV + sin + bias → 7D packed)
+- Layer-by-layer convolution with storage buffer weights
+- Ping-pong buffers for intermediate results
+- Copy to persistent textures for visualization
+
+**3. Visualization Modes**
+
+**Activations Mode:**
+- 4 grayscale views per layer (channels 0-3 of up to 8 total)
+- WebGPU compute → unpack f16 → scale → grayscale
+- Auto-scale: Static features = 1.0, CNN layers = 0.2
+- Static features: Shows R,G,B,D (first 4 of 8: RGBD+UV+sin+bias)
+- CNN layers: Shows first 4 output channels
+
+**Weights Mode:**
+- 2D canvas rendering per output channel
+- Shows all input kernels horizontally
+- Normalized by layer min/max → [0, 1] → grayscale
+- 20px cells, 2px padding between kernels
+
+### Texture Management
+
+**Persistent Storage (layerTextures[]):**
+- One texture per layer output (static + all CNN layers)
+- `rgba32uint` format (packed f16 data)
+- `COPY_DST` usage for storing results
+
+**Compute Buffers (computeTextures[]):**
+- 2 textures for ping-pong computation
+- Reused across all layers
+- `COPY_SRC` usage for copying to persistent storage
+
+**Pipeline:**
+```
+Static pass → copy to layerTextures[0]
+For each CNN layer i:
+  Compute (ping-pong) → copy to layerTextures[i+1]
+```
+
+### Layer Indexing
+
+**UI Layer Buttons:**
+- "Static" → layerOutputs[0] (7D input features)
+- "Layer 1" → layerOutputs[1] (CNN layer 1 output, uses weights.layers[0])
+- "Layer 2" → layerOutputs[2] (CNN layer 2 output, uses weights.layers[1])
+- "Layer N" → layerOutputs[N] (CNN layer N output, uses weights.layers[N-1])
+
+**Weights Table:**
+- "Layer 1" → weights.layers[0] (first CNN layer weights)
+- "Layer 2" → weights.layers[1] (second CNN layer weights)
+- "Layer N" → weights.layers[N-1]
+
+**Consistency:** Both UI and weights table use same numbering (1, 2, 3...) for CNN layers.
+
+---
+
+## Known Issues
+
+### Issue #1: Layer Activations Show Black
+
+**Symptom:**
+- All 4 channel canvases render black
+- UV gradient test (debug mode 10) works
+- Raw packed data test (mode 11) shows black
+- Unpacked f16 test (mode 12) shows black
+
+**Diagnosis:**
+- Texture access works (UV gradient visible)
+- Texture data is all zeros (packed.x = 0)
+- Textures being read are empty
+
+**Root Cause:**
+- `copyTextureToTexture` operations may not be executing
+- Possible ordering issue (copies not submitted before visualization)
+- Alternative: textures created with wrong usage flags
+
+**Investigation Steps Taken:**
+1. Added `onSubmittedWorkDone()` wait before visualization
+2. Verified texture creation with `COPY_SRC` and `COPY_DST` flags
+3. Confirmed separate texture allocation per layer (no aliasing)
+4. Added debug shader modes to isolate issue
+
+**Next Steps:**
+- Verify encoder contains copy commands (add debug logging)
+- Check if compute passes actually write data (add known-value test)
+- Test copyTextureToTexture in isolation
+- Consider CPU readback to verify texture contents
+
+### Issue #2: Weight Visualization Empty
+
+**Symptom:**
+- Canvases created with correct dimensions (logged)
+- No visual output (black canvases)
+- Console logs show method execution
+
+**Potential Causes:**
+1. Weight indexing calculation incorrect
+2. Canvas not properly attached to DOM when rendering
+3. 2D context operations not flushing
+4. Min/max normalization producing black (all values equal?)
+
+**Debug Added:**
+- Comprehensive logging of dimensions, indices, ranges
+- Canvas context check before rendering
+
+**Next Steps:**
+- Add test rendering (fixed gradient) to verify 2D context works
+- Log sample weight values to verify data access
+- Check if canvas is visible in DOM inspector
+- Verify min/max calculation produces valid range
+
+---
+
+## UI Layout
+
+### Header
+- Controls: Blend slider, Depth input, View mode display
+- Drop zone for .bin weight files
+
+### Content Area
+
+**Left Sidebar (300px):**
+1. Drop zone for .bin weight files
+2. Weights Info panel (file size, layer table with min/max)
+3. Weights Visualization panel (per-layer kernel display)
+4. **Mip Level selector** (bottom) - Select p0/p1/p2 for static features
+
+**Main Canvas (center):**
+- CNN output display with video controls (Play/Pause, Frame ◄/►)
+- Supports both PNG images and video files (MP4, WebM)
+- Video loops automatically for continuous playback
+
+**Right Sidebar (panels):**
+1. **Layer Visualization Panel** (top, flex: 1)
+   - Layer selection buttons (Static 0-3, Static 4-7, Layer 0, Layer 1, ...)
+   - 2×2 grid of channel views (grayscale activations)
+   - 4× zoom view at bottom
+
+### Footer
+- Status line (GPU timing, dimensions, mode)
+- Console log (scrollable, color-coded)
+
+---
+
+## Shader Details
+
+### LAYER_VIZ_SHADER
+
+**Purpose:** Display single channel from packed layer texture
+
+**Inputs:**
+- `@binding(0) layer_tex: texture_2d<u32>` - Packed f16 layer data
+- `@binding(1) viz_params: vec2<f32>` - (channel_idx, scale)
+
+**Debug Modes:**
+- Channel 10: UV gradient (texture coordinate test)
+- Channel 11: Raw packed u32 data
+- Channel 12: First unpacked f16 value
+
+**Normal Operation:**
+- Unpack all 8 f16 channels from rgba32uint
+- Select channel by index (0-7)
+- Apply scale factor (1.0 for static, 0.2 for CNN)
+- Clamp to [0, 1] and output grayscale
+
+**Scale Rationale:**
+- Static features (RGBD, UV): already in [0, 1] range
+- CNN activations: post-ReLU [0, ~5], need scaling for visibility
+
+---
+
+## Binary Weight Format
+
+See `doc/CNN_V2_BINARY_FORMAT.md` for complete specification.
+
+**Quick Summary:**
+- Header: 16 bytes (magic, version, layer count, total weights)
+- Layer info: 20 bytes × N (kernel size, channels, offsets)
+- Weights: Packed f16 pairs as u32
+
+---
+
+## Testing Workflow
+
+### Load & Parse
+1. Drop PNG image → displays original
+2. Drop .bin weights → parses and shows info table
+3. Auto-runs CNN pipeline
+
+### Verify Pipeline
+1. Check console for "Running CNN pipeline"
+2. Verify "Completed in Xms"
+3. Check "Layer visualization ready: N layers"
+
+### Debug Activations
+1. Select "Activations" tab
+2. Click layer buttons to switch
+3. Check console for texture/canvas logs
+4. If black: note which debug modes work (UV vs data)
+
+### Debug Weights
+1. Select "Weights" tab
+2. Click Layer 1 or Layer 2 (Layer 0 has no weights)
+3. Check console for "Visualizing Layer N weights"
+4. Check canvas dimensions logged
+5. Verify weight range is non-trivial (not [0, 0])
+
+---
+
+## Integration with Main Project
+
+**Training Pipeline:**
+```bash
+# Generate weights
+./training/train_cnn_v2.py --export-binary
+
+# Test in browser
+open tools/cnn_v2_test/index.html
+# Drop: workspaces/main/cnn_v2_weights.bin
+# Drop: training/input/test.png
+```
+
+**Validation:**
+- Compare against demo CNNv2Effect (visual check)
+- Verify layer count matches binary file
+- Check weight ranges match training logs
+
+---
+
+## Future Enhancements
+
+- [ ] Fix layer activation visualization (black texture issue)
+- [ ] Fix weight kernel display (empty canvas issue)
+- [ ] Add per-channel auto-scaling (compute min/max from visible data)
+- [ ] Export rendered outputs (download PNG)
+- [ ] Side-by-side comparison with original
+- [ ] Heatmap mode (color-coded activations)
+- [ ] Weight statistics overlay (mean, std, sparsity)
+- [ ] Batch processing (multiple images in sequence)
+- [ ] Integration with Python training (live reload)
+
+---
+
+## Code Metrics
+
+- Total lines: ~1100
+- JavaScript: ~700 lines
+- WGSL shaders: ~300 lines
+- HTML/CSS: ~100 lines
+
+**Dependencies:** None (pure WebGPU + HTML5)
+
+---
+
+## Related Files
+
+- `doc/CNN_V2.md` - CNN v2 architecture and design
+- `doc/CNN_TEST_TOOL.md` - C++ offline testing tool (deprecated)
+- `training/train_cnn_v2.py` - Training script with binary export
+- `workspaces/main/cnn_v2_weights.bin` - Trained weights
-- 
cgit v1.2.3