Architecture¶
Overview¶
Knowmarks is a local-first application. All data stays on your machine in a SQLite database. External services (LLM APIs, embedding APIs) are optional and configurable.
Component Diagram¶
┌─────────────┐
│ Browser │
│ Extension │
└──────┬──────┘
│ REST API
┌──────────┐ ┌─────────┴─────────┐ ┌──────────┐
│ CLI │ │ Web Dashboard │ │ MCP │
│ (Click) │ │ (FastAPI) │ │ Server │
└────┬─────┘ └─────────┬─────────┘ └────┬─────┘
│ │ │
└────────────────────┼────────────────────┘
│
┌───────────┴───────────┐
│ Core Layer │
│ │
│ ingest search │
│ embed cluster │
│ freshness projects │
│ curated classify │
│ llm connectors │
└───────────┬───────────┘
│
┌───────────┴───────────┐
│ SQLite Database │
│ WAL mode + FTS5 │
└───────────────────────┘
Key Modules¶
Core Layer (src/knowmarks/core/)¶
| Module | Responsibility |
|---|---|
db.py |
SQLite schema, CRUD operations, FTS5 virtual tables, vector storage as BLOBs |
embed.py |
Pluggable embedding providers (fastembed, Ollama, OpenAI API) |
extract.py |
Content extraction from HTML, PDF, YouTube, GitHub, local files |
ingest.py |
Full pipeline: fetch, extract, embed, classify, connect |
search.py |
Hybrid search with Reciprocal Rank Fusion (vector + FTS5) |
cluster.py |
Semantic clustering with post-merge consolidation |
freshness.py |
Vitality scoring and cluster-relative decay |
classify.py |
Content type classification |
projects.py |
Project association with score-gap detection |
curated.py |
Curated collections (keyword-seeded topic lenses) |
llm.py |
LLM client (OpenAI-compatible API) |
probes.py |
Source-aware staleness probes (GitHub, npm, HTTP) |
connectors/ |
Import connectors (browser, GitHub, Karakeep, Readwise, file) |
Web Layer (src/knowmarks/web/)¶
app.py— FastAPI routes and dashboard servingapi_v1.py— REST API v1 (27 routes, Bearer auth, full MCP parity)static/— Dashboard HTML, CSS, JS (no build step)
MCP Layer (src/knowmarks/mcp/)¶
server.py— 27 tools via FastMCP with stdio transport
Data Flow¶
Save¶
- URL received via CLI, dashboard, MCP, or REST API
ingest.pyorchestrates: validate URL, fetch content, extract textextract.pyhandles format-specific extraction (HTML via trafilatura, PDF via pdfplumber, etc.)embed.pygenerates a 384-dimension vectorclassify.pydetermines content typedb.pystores everything in SQLite (content, embedding as BLOB, metadata)- Project auto-association and near-duplicate detection run post-save
Search¶
- Query received
embed.pygenerates a query vectorsearch.pyruns both:- Vector search: cosine similarity against all embeddings (loaded from SQLite BLOBs)
- FTS5 search: full-text keyword matching
- Reciprocal Rank Fusion (k=60) merges both ranked lists
- Results returned with relevance explanations
Bulk Import¶
Imports of 10+ items use a phased pipeline:
- Metadata phase: Insert URL + title + domain with
fetch_status='pending'. Items are keyword-searchable immediately via FTS5. - Enrichment phase:
ThreadPoolExecutor(8 workers) runs fetch, extract, embed, classify for each item. Each worker uses its own DB connection (SQLite WAL supports concurrent reads). - Post-import sweep: Near-duplicate detection compares new embeddings against the full collection.
Database¶
SQLite with WAL mode for concurrent read access. Key tables:
knowmarks— Core content (URL, title, content, metadata, embedding BLOB)clusters/cluster_members— Semantic topic clustersprojects/project_knowmarks— Project associationscurated_collections/curated_members— Keyword-seeded collectionsconnectors— External source configuration and healthconversations/messages— Persistent chat history
FTS5 virtual tables index title, content, URL, domain, and notes for full-text search.