rfc: Add L4 Feed architecture spec (DuckDB + LanceDB)
This commit is contained in:
parent
282eecab24
commit
875c9b7957
|
|
@ -0,0 +1,202 @@
|
|||
# RFC-0130: L4 Feed — Temporal Event Store
|
||||
|
||||
**Status:** Draft
|
||||
**Author:** Frankie (Silicon Architect)
|
||||
**Date:** 2026-02-03
|
||||
**Target:** Janus SDK v0.2.0
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
L4 Feed ist das temporale Event-Storage-Layer für Libertaria. Es speichert soziale Primitive (Posts, Reactions, Follows) mit hybridem Ansatz:
|
||||
|
||||
- **DuckDB:** Strukturierte Queries (Zeitreihen, Aggregations)
|
||||
- **LanceDB:** Vektor-Search für semantische Ähnlichkeit
|
||||
|
||||
## Kenya Compliance
|
||||
|
||||
| Constraint | Status | Implementation |
|
||||
|------------|--------|----------------|
|
||||
| RAM <10MB | ✅ Planned | DuckDB in-memory mode, LanceDB mmap |
|
||||
| No cloud | ✅ | Embedded storage only |
|
||||
| <1MB binary | ⚠️ TBD | Stripped DuckDB + custom LanceDB bindings |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ L4 Feed Layer │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ DuckDB │ │ LanceDB │ │
|
||||
│ │ (events) │ │ (embeddings) │ │
|
||||
│ ├──────────────┤ ├──────────────┤ │
|
||||
│ │ - Timeline │ │ - ANN search │ │
|
||||
│ │ - Counts │ │ - Similarity │ │
|
||||
│ │ - Replies │ │ - Clustering │ │
|
||||
│ └──────────────┘ └──────────────┘ │
|
||||
│ │ │ │
|
||||
│ └───────────┬───────────┘ │
|
||||
│ │ │
|
||||
│ ┌───────▼───────┐ │
|
||||
│ │ FeedStore │ │
|
||||
│ └───────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Data Model
|
||||
|
||||
### Event Types
|
||||
|
||||
```zig
|
||||
pub const EventType = enum {
|
||||
post, // Original content
|
||||
reaction, // like, boost, bookmark
|
||||
follow, // Social graph edge (directed)
|
||||
mention, // @username reference
|
||||
hashtag, // #topic tag
|
||||
edit, // Content modification
|
||||
delete, // Tombstone (soft delete)
|
||||
};
|
||||
```
|
||||
|
||||
### FeedEvent Structure
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| id | u64 | Snowflake ID (time-sortable, 64-bit) |
|
||||
| event_type | EventType | Enum discriminator |
|
||||
| author | [32]u8 | DID (Decentralized Identifier) |
|
||||
| timestamp | i64 | Unix nanoseconds |
|
||||
| content_hash | [32]u8 | Blake3 hash of canonical content |
|
||||
| parent_id | ?u64 | For replies/threading |
|
||||
| embedding | ?[384]f32 | 384-dim vector (LanceDB) |
|
||||
| tags | []string | Hashtags |
|
||||
| mentions | [][32]u8 | Referenced DIDs |
|
||||
|
||||
## DuckDB Schema
|
||||
|
||||
```sql
|
||||
-- Events table (structured data)
|
||||
CREATE TABLE events (
|
||||
id UBIGINT PRIMARY KEY,
|
||||
event_type TINYINT,
|
||||
author BLOB(32),
|
||||
timestamp BIGINT,
|
||||
content_hash BLOB(32),
|
||||
parent_id UBIGINT,
|
||||
tags VARCHAR[],
|
||||
embedding_ref INTEGER -- Index into LanceDB
|
||||
);
|
||||
|
||||
-- Indexes for common queries
|
||||
CREATE INDEX idx_author_time ON events(author, timestamp DESC);
|
||||
CREATE INDEX idx_parent ON events(parent_id);
|
||||
CREATE INDEX idx_time ON events(timestamp DESC);
|
||||
|
||||
-- FTS for content search (optional)
|
||||
CREATE TABLE event_content (
|
||||
id UBIGINT PRIMARY KEY REFERENCES events(id),
|
||||
text_content VARCHAR
|
||||
);
|
||||
```
|
||||
|
||||
## LanceDB Schema
|
||||
|
||||
```python
|
||||
# Python pseudocode for schema
|
||||
import lancedb
|
||||
from lancedb.pydantic import LanceModel, Vector
|
||||
|
||||
class Embedding(LanceModel):
|
||||
id: int # Matches events.id
|
||||
vector: Vector(384) # 384-dim embedding
|
||||
|
||||
# Metadata for filtering
|
||||
event_type: int
|
||||
author: bytes # 32 bytes DID
|
||||
timestamp: int
|
||||
```
|
||||
|
||||
## Query Patterns
|
||||
|
||||
### 1. Timeline (Home Feed)
|
||||
```sql
|
||||
SELECT * FROM events
|
||||
WHERE author IN (SELECT following FROM follows WHERE follower = ?)
|
||||
ORDER BY timestamp DESC
|
||||
LIMIT 50;
|
||||
```
|
||||
|
||||
### 2. Thread (Conversation)
|
||||
```sql
|
||||
WITH RECURSIVE thread AS (
|
||||
SELECT * FROM events WHERE id = ?
|
||||
UNION ALL
|
||||
SELECT e.* FROM events e
|
||||
JOIN thread t ON e.parent_id = t.id
|
||||
)
|
||||
SELECT * FROM thread ORDER BY timestamp;
|
||||
```
|
||||
|
||||
### 3. Semantic Search (LanceDB)
|
||||
```python
|
||||
# Find similar posts
|
||||
table.search(query_embedding) \
|
||||
.where("event_type = 0") \ # Only posts
|
||||
.limit(20) \
|
||||
.to_pandas()
|
||||
```
|
||||
|
||||
## Synchronization Strategy
|
||||
|
||||
1. **Write Path:**
|
||||
- Insert into DuckDB (ACID transaction)
|
||||
- Generate embedding (local model, ONNX Runtime)
|
||||
- Insert into LanceDB (async, eventual consistency)
|
||||
|
||||
2. **Read Path:**
|
||||
- DuckDB: Structured queries, counts, timelines
|
||||
- LanceDB: Vector similarity, clustering
|
||||
- Hybrid: Vector + time filter (LanceDB filter API)
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: DuckDB Core (Sprint 4)
|
||||
- [ ] DuckDB Zig bindings (C API wrapper)
|
||||
- [ ] Event storage/retrieval
|
||||
- [ ] Timeline queries
|
||||
- [ ] Thread reconstruction
|
||||
|
||||
### Phase 2: LanceDB Integration (Sprint 5)
|
||||
- [ ] LanceDB Rust bindings (via C FFI)
|
||||
- [ ] Embedding storage
|
||||
- [ ] ANN search
|
||||
- [ ] Hybrid queries
|
||||
|
||||
### Phase 3: Optimization (Sprint 6)
|
||||
- [ ] WAL for durability
|
||||
- [ ] Compression (zstd for content)
|
||||
- [ ] Incremental backups
|
||||
- [ ] RAM usage optimization
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Library | Version | Purpose | Size |
|
||||
|---------|---------|---------|------|
|
||||
| DuckDB | 0.9.2 | Structured storage | ~15MB → 5MB stripped |
|
||||
| LanceDB | 0.9.x | Vector storage | ~20MB → 8MB stripped |
|
||||
| ONNX Runtime | 1.16 | Embeddings | Optional, ~50MB |
|
||||
|
||||
**Total binary impact:** ~13MB (DuckDB + LanceDB stripped, ohne ONNX)
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Embedding Model:** All-MiniLM-L6-v2 (22MB) oder kleiner?
|
||||
2. **Sync Strategy:** LanceDB als optionaler Index (graceful degradation)?
|
||||
3. **Replication:** Event sourcing für Node-to-Node sync?
|
||||
|
||||
---
|
||||
|
||||
*Sovereign; Kinetic; Anti-Fragile.* ⚡️
|
||||
Loading…
Reference in New Issue