Skip to content

Architecture

Overview

Knowmarks is a local-first application. All data stays on your machine in a SQLite database. External services (LLM APIs, embedding APIs) are optional and configurable.

Component Diagram

                   ┌─────────────┐
                   │   Browser   │
                   │  Extension  │
                   └──────┬──────┘
                          │ REST API
┌──────────┐    ┌─────────┴─────────┐    ┌──────────┐
│   CLI    │    │   Web Dashboard   │    │   MCP    │
│  (Click) │    │    (FastAPI)      │    │  Server  │
└────┬─────┘    └─────────┬─────────┘    └────┬─────┘
     │                    │                    │
     └────────────────────┼────────────────────┘
                          │
              ┌───────────┴───────────┐
              │       Core Layer      │
              │                       │
              │  ingest   search      │
              │  embed    cluster     │
              │  freshness projects   │
              │  curated  classify    │
              │  llm      connectors  │
              └───────────┬───────────┘
                          │
              ┌───────────┴───────────┐
              │    SQLite Database    │
              │   WAL mode + FTS5    │
              └───────────────────────┘

Key Modules

Core Layer (src/knowmarks/core/)

Module Responsibility
db.py SQLite schema, CRUD operations, FTS5 virtual tables, vector storage as BLOBs
embed.py Pluggable embedding providers (fastembed, Ollama, OpenAI API)
extract.py Content extraction from HTML, PDF, YouTube, GitHub, local files
ingest.py Full pipeline: fetch, extract, embed, classify, connect
search.py Hybrid search with Reciprocal Rank Fusion (vector + FTS5)
cluster.py Semantic clustering with post-merge consolidation
freshness.py Vitality scoring and cluster-relative decay
classify.py Content type classification
projects.py Project association with score-gap detection
curated.py Curated collections (keyword-seeded topic lenses)
llm.py LLM client (OpenAI-compatible API)
probes.py Source-aware staleness probes (GitHub, npm, HTTP)
connectors/ Import connectors (browser, GitHub, Karakeep, Readwise, file)

Web Layer (src/knowmarks/web/)

  • app.py — FastAPI routes and dashboard serving
  • api_v1.py — REST API v1 (27 routes, Bearer auth, full MCP parity)
  • static/ — Dashboard HTML, CSS, JS (no build step)

MCP Layer (src/knowmarks/mcp/)

  • server.py — 27 tools via FastMCP with stdio transport

Data Flow

Save

  1. URL received via CLI, dashboard, MCP, or REST API
  2. ingest.py orchestrates: validate URL, fetch content, extract text
  3. extract.py handles format-specific extraction (HTML via trafilatura, PDF via pdfplumber, etc.)
  4. embed.py generates a 384-dimension vector
  5. classify.py determines content type
  6. db.py stores everything in SQLite (content, embedding as BLOB, metadata)
  7. Project auto-association and near-duplicate detection run post-save
  1. Query received
  2. embed.py generates a query vector
  3. search.py runs both:
  4. Vector search: cosine similarity against all embeddings (loaded from SQLite BLOBs)
  5. FTS5 search: full-text keyword matching
  6. Reciprocal Rank Fusion (k=60) merges both ranked lists
  7. Results returned with relevance explanations

Bulk Import

Imports of 10+ items use a phased pipeline:

  1. Metadata phase: Insert URL + title + domain with fetch_status='pending'. Items are keyword-searchable immediately via FTS5.
  2. Enrichment phase: ThreadPoolExecutor (8 workers) runs fetch, extract, embed, classify for each item. Each worker uses its own DB connection (SQLite WAL supports concurrent reads).
  3. Post-import sweep: Near-duplicate detection compares new embeddings against the full collection.

Database

SQLite with WAL mode for concurrent read access. Key tables:

  • knowmarks — Core content (URL, title, content, metadata, embedding BLOB)
  • clusters / cluster_members — Semantic topic clusters
  • projects / project_knowmarks — Project associations
  • curated_collections / curated_members — Keyword-seeded collections
  • connectors — External source configuration and health
  • conversations / messages — Persistent chat history

FTS5 virtual tables index title, content, URL, domain, and notes for full-text search.