archive(cnn): move CNN v1 to cnn_v1/ subdirectory

Consolidate CNN v1 (CNNEffect) into dedicated directory: - C++ effect: src/effects → cnn_v1/src/ - Shaders: workspaces/main/shaders/cnn → cnn_v1/shaders/ - Training: training/train_cnn.py → cnn_v1/training/ - Docs: doc/CNN*.md → cnn_v1/docs/ Updated all references: - CMake source list - C++ includes (relative paths: ../../cnn_v1/src/) - Asset paths (../../cnn_v1/shaders/) - Documentation cross-references CNN v1 remains active in timeline. For new work, use CNN v2 with enhanced features (7D static, storage buffer, sigmoid activation). Tests: 34/34 passing (100%)
author: skal <pascal.massimino@gmail.com> 2026-02-15 18:52:48 +0100
committer: skal <pascal.massimino@gmail.com> 2026-02-15 18:52:48 +0100
commit: d4b67e2f6ab48ab9ec658140be4f1999f604559a (patch)
tree: 2502b0dc89748f7cfe674d3c177bd1528ce1c231 /doc/CNN_FLATTEN_ANALYSIS.md
parent: 161a59fa50bb92e3664c389fa03b95aefe349b3f (diff)
1 files changed, 0 insertions, 189 deletions
diff --git a/doc/CNN_FLATTEN_ANALYSIS.md b/doc/CNN_FLATTEN_ANALYSIS.md
deleted file mode 100644
index bf63c5d..0000000
--- a/doc/CNN_FLATTEN_ANALYSIS.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# CNN Shader Flatten Mode - Technical Analysis
-
-**Status:** Analysis complete - flatten mode NOT RECOMMENDED
-
-**Date:** February 2026
-
----
-
-## Context
-
-Current CNN architecture uses **3 sequential render passes** (linear chaining):
-- **Layer 0:** 5×5 conv (7→4 channels) → framebuffer
-- **Layer 1:** 3×3 conv (7→4 channels) → reads L0 output, writes framebuffer
-- **Layer 2:** 3×3 conv (7→1 channel) → reads L1 output, blends with original
-
-Proposed **"flatten mode"**: Collapse all layers into **single shader pass** using intermediate arrays, eliminating framebuffer read/write between layers.
-
----
-
-## Current Architecture
-
-**Shader Structure:**
-- 1 pipeline with layer branching (`layer_index` uniform)
-- 5 bindings: sampler, input texture, uniforms, layer params, original capture
-- Total shader size: ~8 KB (snippets + weights)
-
-**Performance Profile:**
-- 3 render pass dispatches
-- 2 framebuffer writes + reads between layers
-- Memory bandwidth: ~2× framebuffer size per layer
-- Register pressure: Low (per-layer isolation)
-
-**Weight Buffer:** 290 vec4s (4.6 KB) - already unified
-
----
-
-## Flatten Approaches Evaluated
-
-### Option A: Full Flatten (All 3 Layers)
-
-**Cascading Receptive Field:**
-
-To compute final output at position (x, y):
-- Layer 2 needs 3×3 neighborhood of Layer 1 outputs
-- Each Layer 1 output needs 3×3 neighborhood of Layer 0 outputs
-- Each Layer 0 output needs 5×5 neighborhood of input samples
-
-**Effective input sampling:** 9×9 pixels (vs current 5×5 max)
-
-**Intermediate Storage (per thread/pixel):**
-```
-Layer 0 outputs: 5×5 positions × 4 channels = 100 floats
-Layer 1 outputs: 3×3 positions × 4 channels =  36 floats
-                                   TOTAL = 136 floats (544 bytes)
-```
-
-**GPU Register Pressure:**
-- Modern GPUs: 32-64 KB registers per SM, shared across warps
-- 544 bytes/thread → max 64 threads/SM (**low occupancy**)
-- Current multi-pass: ~4-8 bytes/thread (high occupancy)
-
-**Pros:**
-- 1 dispatch vs 3 (reduce CPU overhead)
-- Zero framebuffer bandwidth between layers
-
-**Cons:**
-- **Severe register pressure** (10-20× increase)
-- Reduced occupancy → potential performance loss
-- Complex shader (harder debug, larger binary)
-- 9×9 input sampling
-
-**Assessment:** ❌ **Not Recommended**
-Register cost outweighs bandwidth savings.
-
----
-
-### Option B: Partial Flatten (Layers 1 + 2)
-
-Keep Layer 0 separate, flatten only Layers 1 and 2.
-
-**Pass Structure:**
-1. **Pass 1:** Layer 0 (5×5 conv) → framebuffer
-2. **Pass 2 (flattened):** Compute Layer 1 + Layer 2 in single shader
-
-**Intermediate Storage:**
-```
-Layer 0 samples: 3×3 × 4 = 36 floats (read once)
-Layer 1 outputs: 3×3 × 4 = 36 floats (computed)
-                 TOTAL = 72 floats (288 bytes)
-```
-
-**Receptive Field:** 5×5 Layer 0 samples required for 3×3 Layer 1 outputs
-
-**Pros:**
-- 2 passes vs 3 (33% reduction)
-- 1 framebuffer write saved
-- More manageable register usage
-
-**Cons:**
-- Still significant register pressure (288 bytes vs ~8 bytes baseline)
-- Medium complexity increase
-- Layer 0 (heaviest kernel) still separate
-
-**Assessment:** ⚠️ **Marginal Benefit**
-Saves 1 pass but register cost still high.
-
----
-
-### Option C: Keep Current Multi-Pass ✅
-
-**Rationale:**
-- Current architecture well-suited to GPU design (high throughput via parallelism)
-- Minimal register usage → high occupancy → hides memory latency
-- Framebuffer bandwidth cost < register pressure cost
-- Clean separation aids debugging/iteration
-- Modular (easy to add/remove layers)
-
-**Alternative Optimizations (if bandwidth critical):**
-1. Merge passes via render pass load/store ops (Vulkan subpasses)
-2. Reduce intermediate channel count (4→3 or 2)
-3. Hybrid: Compute shaders + workgroup shared memory
-4. Layer pruning (2-layer vs 3-layer quality comparison)
-
----
-
-## Recommendation
-
-**✅ Keep current multi-pass architecture**
-
-### Decision Matrix
-
-| Factor | Multi-Pass | Partial Flatten | Full Flatten |
-|--------|-----------|----------------|--------------|
-| Register pressure | ✅ Low | ⚠️ High | ❌ Extreme |
-| Occupancy | ✅ High | ⚠️ Medium | ❌ Low |
-| Memory bandwidth | ⚠️ Medium | ✅ Lower | ✅ Lowest |
-| Shader complexity | ✅ Simple | ⚠️ Medium | ❌ High |
-| Debuggability | ✅ Easy | ⚠️ Harder | ❌ Very hard |
-| Binary size | ✅ Small | ⚠️ Larger | ⚠️ Largest |
-
-**Modern GPU Architecture Favors:**
-- High parallelism (many small threads) over complex threads
-- Hiding latency via occupancy over minimizing operations
-- Memory bandwidth via caching, not elimination
-
----
-
-## Alternative: Compute Shader + Shared Memory
-
-**If bandwidth becomes critical:**
-- Use compute shader with workgroup shared memory
-- Load tile + halos into shared memory (9×9 input samples)
-- Compute all 3 layers for tile interior (avoids redundant sampling)
-- Requires explicit synchronization (`workgroupBarrier`)
-
-**Trade-offs:**
-- ✅ Low register pressure + low bandwidth
-- ❌ Compute pipeline complexity (no render pass integration)
-- ❌ Tile edge handling
-- ❌ Larger code size
-
----
-
-## Conclusion
-
-Current 3-pass architecture is **appropriate for demo64k**:
-- Size-efficient (modular shaders)
-- Performance adequate (bandwidth not bottleneck)
-- Maintainable (clean layer isolation)
-
-**Flatten mode not recommended** unless profiling reveals specific bandwidth constraint.
-
-### Size Optimization Alternatives (Better ROI)
-
-If size optimization critical, focus on:
-1. **Weight quantization:** 4.6 KB → ~2 KB (8-bit or 4-bit quantization)
-2. **Kernel size reduction:** 5×5 → 3×3 for Layer 0 (200 vec4s → 72 vec4s)
-3. **Channel reduction:** 7 inputs → 4 inputs (remove UV/grayscale channels)
-
-These yield better size/performance than shader architecture changes.
-
----
-
-## References
-
-- `doc/CNN_EFFECT.md` - CNN implementation details
-- `doc/CNN.md` - High-level CNN design
-- `src/effects/cnn_effect.cc` - Current implementation
-- `workspaces/main/shaders/cnn_*.wgsl` - Shader snippets
author	skal <pascal.massimino@gmail.com>	2026-02-15 18:52:48 +0100
committer	skal <pascal.massimino@gmail.com>	2026-02-15 18:52:48 +0100
commit	d4b67e2f6ab48ab9ec658140be4f1999f604559a (patch)
tree	2502b0dc89748f7cfe674d3c177bd1528ce1c231 /doc/CNN_FLATTEN_ANALYSIS.md
parent	161a59fa50bb92e3664c389fa03b95aefe349b3f (diff)