Architecture¶

Overview¶

Knowmarks is a local-first application. All data stays on your machine in a SQLite database. External services (LLM APIs, embedding APIs) are optional and configurable.

Component Diagram¶

                   ┌─────────────┐
                   │   Browser   │
                   │  Extension  │
                   └──────┬──────┘
                          │ REST API
┌──────────┐    ┌─────────┴─────────┐    ┌──────────┐
│   CLI    │    │   Web Dashboard   │    │   MCP    │
│  (Click) │    │    (FastAPI)      │    │  Server  │
└────┬─────┘    └─────────┬─────────┘    └────┬─────┘
     │                    │                    │
     └────────────────────┼────────────────────┘
                          │
              ┌───────────┴───────────┐
              │       Core Layer      │
              │                       │
              │  ingest   search      │
              │  embed    cluster     │
              │  freshness projects   │
              │  curated  classify    │
              │  llm      connectors  │
              └───────────┬───────────┘
                          │
              ┌───────────┴───────────┐
              │    SQLite Database    │
              │   WAL mode + FTS5    │
              └───────────────────────┘

Key Modules¶

Core Layer (`src/knowmarks/core/`)¶

Module	Responsibility
`db.py`	SQLite schema, CRUD operations, FTS5 virtual tables, vector storage as BLOBs
`embed.py`	Pluggable embedding providers (fastembed, Ollama, OpenAI API)
`extract.py`	Content extraction from HTML, PDF, YouTube, GitHub, local files
`ingest.py`	Full pipeline: fetch, extract, embed, classify, connect
`search.py`	Hybrid search with Reciprocal Rank Fusion (vector + FTS5)
`cluster.py`	Semantic clustering with post-merge consolidation
`freshness.py`	Vitality scoring and cluster-relative decay
`classify.py`	Content type classification
`projects.py`	Project association with score-gap detection
`curated.py`	Curated collections (keyword-seeded topic lenses)
`llm.py`	LLM client (OpenAI-compatible API)
`probes.py`	Source-aware staleness probes (GitHub, npm, HTTP)
`connectors/`	Import connectors (browser, GitHub, Karakeep, Readwise, file)

Web Layer (`src/knowmarks/web/`)¶

app.py — FastAPI routes and dashboard serving
api_v1.py — REST API v1 (27 routes, Bearer auth, full MCP parity)
static/ — Dashboard HTML, CSS, JS (no build step)

MCP Layer (`src/knowmarks/mcp/`)¶

server.py — 27 tools via FastMCP with stdio transport

Data Flow¶

Save¶

URL received via CLI, dashboard, MCP, or REST API
ingest.py orchestrates: validate URL, fetch content, extract text
extract.py handles format-specific extraction (HTML via trafilatura, PDF via pdfplumber, etc.)
embed.py generates a 384-dimension vector
classify.py determines content type
db.py stores everything in SQLite (content, embedding as BLOB, metadata)
Project auto-association and near-duplicate detection run post-save

Search¶

Query received
embed.py generates a query vector
search.py runs both:
Vector search: cosine similarity against all embeddings (loaded from SQLite BLOBs)
FTS5 search: full-text keyword matching
Reciprocal Rank Fusion (k=60) merges both ranked lists
Results returned with relevance explanations

Bulk Import¶

Imports of 10+ items use a phased pipeline:

Metadata phase: Insert URL + title + domain with fetch_status='pending'. Items are keyword-searchable immediately via FTS5.
Enrichment phase: ThreadPoolExecutor (8 workers) runs fetch, extract, embed, classify for each item. Each worker uses its own DB connection (SQLite WAL supports concurrent reads).
Post-import sweep: Near-duplicate detection compares new embeddings against the full collection.

Database¶

SQLite with WAL mode for concurrent read access. Key tables:

knowmarks — Core content (URL, title, content, metadata, embedding BLOB)
clusters / cluster_members — Semantic topic clusters
projects / project_knowmarks — Project associations
curated_collections / curated_members — Keyword-seeded collections
connectors — External source configuration and health
conversations / messages — Persistent chat history

FTS5 virtual tables index title, content, URL, domain, and notes for full-text search.