From f72f404e755149e80350dc6eb34015d5e7630d44 Mon Sep 17 00:00:00 2001 From: skal Date: Sat, 14 Feb 2026 01:37:50 +0100 Subject: Document CNN v2 training pipeline improvements - HOWTO.md: Document always-save-checkpoint behavior and --quiet flag - COMPLETED.md: Add milestone entry for Feb 14 CNN v2 fixes - Details: checkpoint saving, num_layers derivation, output streamlining Co-Authored-By: Claude Sonnet 4.5 --- doc/COMPLETED.md | 9 +++++++++ doc/HOWTO.md | 10 ++++++++++ 2 files changed, 19 insertions(+) (limited to 'doc') diff --git a/doc/COMPLETED.md b/doc/COMPLETED.md index 01c4408..c7b2cae 100644 --- a/doc/COMPLETED.md +++ b/doc/COMPLETED.md @@ -455,3 +455,12 @@ Use `read @doc/archive/FILENAME.md` to access archived documents. - **test_mesh tool**: Implemented a standalone `test_mesh` tool for visualizing OBJ files with debug normal display. - **Task #39: Visual Debugging System**: Implemented a comprehensive set of wireframe primitives (Sphere, Cone, Cross, Line, Trajectory) in `VisualDebug`. Updated `test_3d_render` to demonstrate usage. - **Task #68: Mesh Wireframe Rendering**: Added `add_mesh_wireframe` to `VisualDebug` to visualize triangle edges for mesh objects. Integrated into `Renderer3D` debug path and `test_mesh` tool. + +#### CNN v2 Training Pipeline Improvements (February 14, 2026) 🎯 +- **Critical Training Fixes**: Resolved checkpoint saving and argument handling bugs in CNN v2 training pipeline. **Bug 1 (Missing Checkpoints)**: Training completed successfully but no checkpoint saved when `epochs < checkpoint_every` interval. Solution: Always save final checkpoint after training completes, regardless of interval settings. **Bug 2 (Stale Checkpoints)**: Old checkpoint files from previous runs with different parameters weren't overwritten due to `if not exists` check. Solution: Remove existence check, always overwrite final checkpoint. **Bug 3 (Ignored num_layers)**: When providing comma-separated kernel sizes (e.g., `--kernel-sizes 3,1,3`), the `--num-layers` parameter was used only for validation but not derived from list length. Solution: Derive `num_layers` from kernel_sizes list length when multiple values provided. **Bug 4 (Argument Passing)**: Shell script passed unquoted variables to Python, potentially causing parsing issues with special characters. Solution: Quote all shell variables when passing to Python scripts. + +- **Output Streamlining**: Reduced verbose training pipeline output by 90%. **Export Section**: Added `--quiet` flag to `export_cnn_v2_weights.py`, producing single-line summary instead of detailed layer-by-layer breakdown (e.g., "Exported 3 layers, 912 weights, 1904 bytes → test.bin"). **Validation Section**: Changed from printing 10+ lines per image (loading, processing, saving) to compact single-line format showing all images at once (e.g., "Processing images: img_000 img_001 img_002 ✓"). **Result**: Training pipeline output reduced from ~100 lines to ~30 lines while preserving essential information. Makes rapid iteration more pleasant. + +- **Documentation Updates**: Updated `doc/HOWTO.md` CNN v2 training section to document new behavior: always saves final checkpoint, derives num_layers from kernel_sizes list, uses streamlined output with `--quiet` flag. Added examples for both verbose and quiet export modes. + +- **Files Modified**: `training/train_cnn_v2.py` (checkpoint saving logic, num_layers derivation), `scripts/train_cnn_v2_full.sh` (variable quoting, validation output, checkpoint validation), `training/export_cnn_v2_weights.py` (--quiet flag support), `doc/HOWTO.md` (documentation). **Impact**: Training pipeline now robust for rapid experimentation with different architectures, no longer requires manual checkpoint management or workarounds for short training runs. diff --git a/doc/HOWTO.md b/doc/HOWTO.md index c98f6ee..506bf0a 100644 --- a/doc/HOWTO.md +++ b/doc/HOWTO.md @@ -166,8 +166,11 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding **Defaults:** 200 epochs, 3×3 kernels, 8→4→4 channels, batch-size 16, patch-based (8×8, harris detector). - Live progress with single-line update +- Always saves final checkpoint (regardless of --checkpoint-every interval) +- When multiple kernel sizes provided (e.g., 3,5,3), num_layers derived from list length - Validates all input images on final epoch - Exports binary weights (storage buffer architecture) +- Streamlined output: single-line export summary, compact validation - All parameters configurable via command-line **Validation Only** (skip training): @@ -207,12 +210,19 @@ Enhanced CNN with parametric static features (7D input: RGBD + UV + sin encoding **Export Binary Weights:** ```bash +# Verbose output (shows all layer details) ./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \ --output-weights workspaces/main/cnn_v2_weights.bin + +# Quiet mode (single-line summary) +./training/export_cnn_v2_weights.py checkpoints/checkpoint_epoch_100.pth \ + --output-weights workspaces/main/cnn_v2_weights.bin \ + --quiet ``` Generates binary format: header + layer info + f16 weights (~3.2 KB for 3-layer model). Storage buffer architecture allows dynamic layer count. +Use `--quiet` for streamlined output in scripts (used automatically by train_cnn_v2_full.sh). **TODO:** 8-bit quantization for 2× size reduction (~1.6 KB). Requires quantization-aware training (QAT). -- cgit v1.2.3