5 files changed, 88 insertions, 11 deletions
diff --git a/doc/CNN_TEST_TOOL.md b/doc/CNN_TEST_TOOL.md
index 82d5799..4307894 100644
--- a/doc/CNN_TEST_TOOL.md
+++ b/doc/CNN_TEST_TOOL.md
@@ -41,10 +41,11 @@ Standalone tool for validating trained CNN shaders with GPU-to-CPU readback. Sup
 cnn_test input.png output.png [OPTIONS]
 
 OPTIONS:
-  --cnn-version N          CNN version: 1 (default) or 2
+  --cnn-version N          CNN version: 1 (default) or 2 (ignored with --weights)
+  --weights PATH           Load weights from .bin (forces CNN v2, overrides layer config)
   --blend F                Final blend amount (0.0-1.0, default: 1.0)
   --format ppm|png         Output format (default: png)
-  --layers N               Number of CNN layers (1-10, v1 only, default: 3)
+  --layers N               Number of CNN layers (1-10, v1 only, default: 3, ignored with --weights)
   --save-intermediates DIR Save intermediate layers to directory
   --debug-hex              Print first 8 pixels as hex (debug)
   --help                   Show usage
@@ -55,9 +56,12 @@ OPTIONS:
 # CNN v1 (render pipeline, 3 layers)
 ./build/cnn_test input.png output.png --cnn-version 1
 
-# CNN v2 (compute, storage buffer, dynamic layers)
+# CNN v2 (compute, storage buffer, uses asset system weights)
 ./build/cnn_test input.png output.png --cnn-version 2
 
+# CNN v2 with runtime weight loading (loads layer config from .bin)
+./build/cnn_test input.png output.png --weights checkpoints/checkpoint_epoch_100.pth.bin
+
 # 50% blend with original (v2)
 ./build/cnn_test input.png output.png --cnn-version 2 --blend 0.5
 
@@ -65,6 +69,8 @@ OPTIONS:
 ./build/cnn_test input.png output.png --cnn-version 2 --debug-hex
 ```
 
+**Important:** When using `--weights`, the layer count and kernel sizes are read from the binary file header, overriding any `--layers` or `--cnn-version` arguments.
+
 ---
 
 ## Implementation Details
@@ -119,6 +125,13 @@ std::vector<uint8_t> OffscreenRenderTarget::read_pixels() {
 
 **Binary format:** Header (20B) + layer info (20B×N) + f16 weights
 
+**Weight Loading:**
+- **Without `--weights`:** Loads from asset system (`ASSET_WEIGHTS_CNN_V2`)
+- **With `--weights PATH`:** Loads from external `.bin` file (e.g., checkpoint exports)
+  - Layer count and kernel sizes parsed from binary header
+  - Overrides any `--layers` or `--cnn-version` arguments
+  - Enables runtime testing of training checkpoints without rebuild
+
 ---
 
 ## Build Integration
diff --git a/doc/CNN_V2.md b/doc/CNN_V2.md
index 577cf9e..2d1d4c4 100644
--- a/doc/CNN_V2.md
+++ b/doc/CNN_V2.md
@@ -18,15 +18,15 @@ CNN v2 extends the original CNN post-processing effect with parametric static fe
 - Bias integrated as static feature dimension
 - Storage buffer architecture (dynamic layer count)
 - Binary weight format v2 for runtime loading
+- Sigmoid activation for layer 0 and final layer (smooth [0,1] mapping)
 
-**Status:** ✅ Complete. Training pipeline functional, validation tools ready, mip-level support integrated.
+**Status:** ✅ Complete. Sigmoid activation, stable training, validation tools operational.
 
-**Known Issues:**
-- ⚠️ **cnn_test output differs from HTML validation tool** - Visual discrepancy remains after fixing uv_y inversion and Layer 0 activation. Root cause under investigation. Both tools should produce identical output given same weights/input.
+**Breaking Change:**
+- Models trained with `clamp()` incompatible. Retrain required.
 
 **TODO:**
 - 8-bit quantization with QAT for 2× size reduction (~1.6 KB)
-- Debug cnn_test vs HTML tool output difference
 
 ---
 
@@ -106,6 +106,12 @@ Input RGBD → Static Features Compute → CNN Layers → Output RGBA
 - All layers: uniform 12D input, 4D output (ping-pong buffer)
 - Storage: `texture_storage_2d<rgba32uint>` (4 channels as 2×f16 pairs)
 
+**Activation Functions:**
+- Layer 0 & final layer: `sigmoid(x)` for smooth [0,1] mapping
+- Middle layers: `ReLU` (max(0, x))
+- Rationale: Sigmoid prevents gradient blocking at boundaries, enabling better convergence
+- Breaking change: Models trained with `clamp(x, 0, 1)` are incompatible, retrain required
+
 ---
 
 ## Static Features (7D + 1 bias)
@@ -136,6 +142,27 @@ let bias = 1.0;                     // Learned bias per output channel
 // Packed storage: [p0, p1, p2, p3, uv.x, uv.y, sin(20*uv.y), 1.0]
 ```
 
+### Input Channel Mapping
+
+**Weight tensor layout (12 input channels per layer):**
+
+| Input Channel | Feature | Description |
+|--------------|---------|-------------|
+| 0-3 | Previous layer output | 4D RGBA from prior CNN layer (or input RGBD for Layer 0) |
+| 4-11 | Static features | 8D: p0, p1, p2, p3, uv_x, uv_y, sin20_y, bias |
+
+**Static feature channel details:**
+- Channel 4 → p0 (RGB.r from mip level)
+- Channel 5 → p1 (RGB.g from mip level)
+- Channel 6 → p2 (RGB.b from mip level)
+- Channel 7 → p3 (depth or RGB channel from mip level)
+- Channel 8 → p4 (uv_x: normalized horizontal position)
+- Channel 9 → p5 (uv_y: normalized vertical position)
+- Channel 10 → p6 (sin(20*uv_y): periodic encoding)
+- Channel 11 → p7 (bias: constant 1.0)
+
+**Note:** When generating identity weights, p4-p7 correspond to input channels 8-11, not 4-7.
+
 ### Feature Rationale
 
 | Feature | Dimension | Purpose | Priority |
@@ -311,7 +338,7 @@ class CNNv2(nn.Module):
         # Layer 0: input RGBD (4D) + static (8D) = 12D
         x = torch.cat([input_rgbd, static_features], dim=1)
         x = self.layers[0](x)
-        x = torch.clamp(x, 0, 1)  # Output layer 0 (4 channels)
+        x = torch.sigmoid(x)  # Soft [0,1] for layer 0
 
         # Layer 1+: previous output (4D) + static (8D) = 12D
         for i in range(1, len(self.layers)):
@@ -320,7 +347,7 @@ class CNNv2(nn.Module):
             if i < len(self.layers) - 1:
                 x = F.relu(x)
             else:
-                x = torch.clamp(x, 0, 1)  # Final output [0,1]
+                x = torch.sigmoid(x)  # Soft [0,1] for final layer
 
         return x  # RGBA output
 ```
diff --git a/doc/CNN_V2_DEBUG_TOOLS.md b/doc/CNN_V2_DEBUG_TOOLS.md
index b6dc65f..8d1289a 100644
--- a/doc/CNN_V2_DEBUG_TOOLS.md
+++ b/doc/CNN_V2_DEBUG_TOOLS.md
@@ -18,14 +18,21 @@ Tools for investigating CNN v2 mismatch between HTML tool and cnn_test.
 # 3×3 identity
 ./training/gen_identity_weights.py workspaces/main/weights/cnn_v2_identity_3x3.bin --kernel-size 3
 
+# Mix mode: 50-50 blend (0.5*p0+0.5*p4, etc)
+./training/gen_identity_weights.py output.bin --mix
+
+# Static features only: p4→ch0, p5→ch1, p6→ch2, p7→ch3
+./training/gen_identity_weights.py output.bin --p47
+
 # Custom mip level
 ./training/gen_identity_weights.py output.bin --kernel-size 1 --mip-level 2
 ```
 
 **Output:**
 - Single layer, 12D→4D (4 input channels + 8 static features)
-- Identity matrix: Output Ch{0,1,2,3} = Input Ch{0,1,2,3}
-- Static features (Ch 4-11) are zeroed
+- Identity mode: Output Ch{0,1,2,3} = Input Ch{0,1,2,3}
+- Mix mode (--mix): Output Ch{i} = 0.5*Input Ch{i} + 0.5*Input Ch{i+4} (50-50 blend, avoids overflow)
+- Static mode (--p47): Output Ch{i} = Input Ch{i+4} (static features only, visualizes p4-p7)
 - Minimal file size (~136 bytes for 1×1, ~904 bytes for 3×3)
 
 **Validation:**
diff --git a/doc/COMPLETED.md b/doc/COMPLETED.md
index 01c4408..c7b2cae 100644
--- a/doc/COMPLETED.md
+++ b/doc/COMPLETED.md
@@ -455,3 +455,12 @@ Use `read @doc/archive/FILENAME.md` to access archived documents.
 - **test_mesh tool**: Implemented a standalone `test_mesh` tool for visualizing OBJ files with debug normal display.
 - **Task #39: Visual Debugging System**: Implemented a comprehensive set of wireframe primitives (Sphere, Cone, Cross, Line, Trajectory) in `VisualDebug`. Updated `test_3d_render` to demonstrate usage.
 - **Task #68: Mesh Wireframe Rendering**: Added `add_mesh_wireframe` to `VisualDebug` to visualize triangle edges for mesh objects. Integrated into `Renderer3D` debug path and `test_mesh` tool.
+
+#### CNN v2 Training Pipeline Improvements (February 14, 2026) 🎯
+- **Critical Training Fixes**: Resolved checkpoint saving and argument handling bugs in CNN v2 training pipeline. **Bug 1 (Missing Checkpoints)**: Training completed successfully but no checkpoint saved when `epochs < checkpoint_every` interval. Solution: Always save final checkpoint after training completes, regardless of interval settings. **Bug 2 (Stale Checkpoints)**: Old checkpoint files from previous runs with different parameters weren't overwritten due to `if not exists` check. Solution: Remove existence check, always overwrite final checkpoint. **Bug 3 (Ignored num_layers)**: When providing comma-separated kernel sizes (e.g., `--kernel-sizes 3,1,3`), the `--num-layers` parameter was used only for validation but not derived from list length. Solution: Derive `num_layers` from kernel_sizes list length when multiple values provided. **Bug 4 (Argument Passing)**: Shell script passed unquoted variables to Python, potentially causing parsing issues with special characters. Solution: Quote all shell variables when passing to Python scripts.
+
+- **Output Streamlining**: Reduced verbose training pipeline output by 90%. **Export Section**: Added `--quiet` flag to `export_cnn_v2_weights.py`, producing single-line summary instead of detailed layer-by-layer breakdown (e.g., "Exported 3 layers, 912 weights, 1904 bytes → test.bin"). **Validation Section**: Changed from printing 10+ lines per image (loading, processing, saving) to compact single-line format showing all images at once (e.g., "Processing images: img_000 img_001 img_002 ✓"). **Result**: Training pipeline output reduced from ~100 lines to ~30 lines while preserving essential information. Makes rapid iteration more pleasant.
+
+- **Documentation Updates**: Updated `doc/HOWTO.md` CNN v2 training section to document new behavior: always saves final checkpoint, derives num_layers from kernel_sizes list, uses streamlined output with `--quiet` flag. Added examples for both verbose and quiet export modes.
+
+- **Files Modified**: `training/train_cnn_v2.py` (checkpoint saving logic, num_layers derivation), `scripts/train_cnn_v2_full.sh` (variable quoting, validation output, checkpoint validation), `training/export_cnn_v2_weights.py` (--quiet flag support), `doc/HOWTO.md` (documentation). **Impact**: Training pipeline now robust for rapid experimentation with different architectures, no longer requires manual checkpoint management or workarounds for short training runs.
diff --git a/doc/HOWTO.md b/doc/HOWTO.md
index 85ce801..506bf0a 100644
--- a/doc/HOWTO.md
+++ b/doc/HOWTO.md
@@ -139,12 +139,18 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding
 # Train → Export → Build → Validate (default config)
 ./scripts/train_cnn_v2_full.sh
 
+# Rapid debug (1 layer, 3×3, 5 epochs)
+./scripts/train_cnn_v2_full.sh --num-layers 1 --kernel-sizes 3 --epochs 5 --output-weights test.bin
+
 # Custom training parameters
 ./scripts/train_cnn_v2_full.sh --epochs 500 --batch-size 32 --checkpoint-every 100
 
 # Custom architecture
 ./scripts/train_cnn_v2_full.sh --kernel-sizes 3,5,3 --num-layers 3 --mip-level 1
 
+# Custom output path
+./scripts/train_cnn_v2_full.sh --output-weights workspaces/test/cnn_weights.bin
+
 # Grayscale loss (compute loss on luminance instead of RGBA)
 ./scripts/train_cnn_v2_full.sh --grayscale-loss
 
@@ -160,8 +166,11 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding
 
 **Defaults:** 200 epochs, 3×3 kernels, 8→4→4 channels, batch-size 16, patch-based (8×8, harris detector).
 - Live progress with single-line update
+- Always saves final checkpoint (regardless of --checkpoint-every interval)
+- When multiple kernel sizes provided (e.g., 3,5,3), num_layers derived from list length
 - Validates all input images on final epoch
 - Exports binary weights (storage buffer architecture)
+- Streamlined output: single-line export summary, compact validation
 - All parameters configurable via command-line
 
 **Validation Only** (skip training):
@@ -201,12 +210,19 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding
 
 **Export Binary Weights:**
 ```bash
+# Verbose output (shows all layer details)
 ./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \
   --output-weights workspaces/main/cnn_v2_weights.bin
+
+# Quiet mode (single-line summary)
+./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \
+  --output-weights workspaces/main/cnn_v2_weights.bin \
+  --quiet
 ```
 
 Generates binary format: header + layer info + f16 weights (~3.2 KB for 3-layer model).
 Storage buffer architecture allows dynamic layer count.
+Use `--quiet` for streamlined output in scripts (used automatically by train_cnn_v2_full.sh).
 
 **TODO:** 8-bit quantization for 2× size reduction (~1.6 KB). Requires quantization-aware training (QAT).
 
@@ -268,6 +284,9 @@ See `doc/ASSET_SYSTEM.md` and `doc/WORKSPACE_SYSTEM.md`.
 # CNN v2 (recommended, fully functional)
 ./build/cnn_test input.png output.png --cnn-version 2
 
+# CNN v2 with runtime weight loading (loads layer config from .bin)
+./build/cnn_test input.png output.png --weights checkpoints/checkpoint_epoch_100.pth.bin
+
 # CNN v1 (produces incorrect output, debug only)
 ./build/cnn_test input.png output.png --cnn-version 1
 
@@ -282,6 +301,8 @@ See `doc/ASSET_SYSTEM.md` and `doc/WORKSPACE_SYSTEM.md`.
 - **CNN v2:** ✅ Fully functional, matches CNNv2Effect
 - **CNN v1:** ⚠️ Produces incorrect output, use CNNEffect in demo for validation
 
+**Note:** `--weights` loads layer count and kernel sizes from the binary file, overriding `--layers` and forcing CNN v2.
+
 See `doc/CNN_TEST_TOOL.md` for full documentation.
 
 ---