Transform a messy Obsidian vault into clean PARA + Zettelkasten structure using LLMs.
Most knowledge workers accumulate hundreds of notes in Obsidian over time, but the vault gradually becomes a tangled mess of long-form drafts, bullet dumps, and half-finished thoughts. ZettelVault fixes that. It reads every note from one or more source vaults, classifies each into the PARA method (Projects / Areas / Resources / Archive), then decomposes each note into atomic Zettelkasten notes - one idea per note, heavily cross-linked. Under the hood, it uses DSPy for structured LLM interaction, with dspy.RLM (Recursive Language Models) as the primary decomposition strategy.
Beyond vault restructuring, this project also serves as reference code for using dspy.RLM for document decomposition - a technique applicable to any use case where long-form documents need to be split into structured, atomic units.
- Pipeline
- RLM vs Predict/ChainOfThought
- Model Comparison
- Production Run (GLM-5)
- Cost Tracking
- Design Decisions
- Known Limitations
- Data Loss Prevention
- Setup
- Usage
- Testing
- Project Structure
- Potential Improvements
- References
The pipeline has five steps, each feeding into the next. The diagram below shows the overall flow, and the sections that follow explain each step in detail.
Source Vault(s) Destination Vault
+------------------+ +--------------------+
| Note A (messy) | 1. Read 2. Classify | 1. Projects/ |
| Note B (messy) | ---------> [PARA bucket] --------> | TopicA/ |
| Note C (messy) | [domain/tags] | | Atomic Note 1 |
| ... | | | Atomic Note 2 |
+------------------+ 3. Decompose | | 2. Areas/ |
[RLM -> REPL] | | TopicB/ |
[code analysis] | | Atomic Note 3 |
[sub-LM calls] | | 3. Resources/ |
| | | TopicC/ |
4. Write | | | Atomic Note 4 |
----------+---------->| | 4. Archive/ |
5. Resolve links | | MOC/ |
+---------->| | Domain-A.md |
| Domain-B.md |
+--------------------+
The first thing ZettelVault needs to do is read every .md file from one or more source vaults. Rather than walking the filesystem and manually parsing Obsidian's internal structures, ZettelVault delegates this to vlt.
vlt is a fast, zero-dependency CLI tool (a compiled Go binary) purpose-built for operating on Obsidian vaults without requiring the Obsidian desktop app, Electron, Node.js, or any network calls. It reads and writes vault files directly via the filesystem, starts in sub-millisecond time by leveraging the OS page cache, and uses advisory file locking for safe concurrent access.
ZettelVault uses vlt instead of reading files directly for several reasons:
- Vault registry awareness. vlt understands Obsidian's vault registry, so vault names map to filesystem paths automatically. You pass a vault name, not a directory path.
- Native Obsidian parsing. Frontmatter extraction, wikilink resolution, and tag parsing are handled natively by vlt, so ZettelVault does not need to reimplement any of that logic.
- Multi-vault support. Working with multiple source vaults is trivial - just pass vault names and vlt handles the rest.
- Structured output. vlt produces JSON output suitable for pipeline consumption, making it easy to integrate with Python scripts.
- No internal configuration parsing. There is no need to understand or parse Obsidian's
.obsidian/directory structure.
Multiple source vaults are supported: pass space-separated vault names and ZettelVault merges them (first vault wins on title collision). No Obsidian desktop app is required for processing, but it is recommended for viewing results.
Once all notes are loaded, ZettelVault classifies each one into the PARA framework. This gives every note a clear place in the output vault's folder hierarchy. Each note receives:
- PARA bucket: Projects, Areas, Resources, or Archive
- Domain: a primary knowledge domain (e.g., "AI/ML", "Engineering", "Health")
- Subdomain: a specific area within the domain (e.g., "DSPy", "Networking", "Nutrition")
- Tags: 3-7 lowercase hyphenated tags
The classification uses a typed DSPy Signature with Literal type for PARA buckets, ensuring the model always produces a valid category.
Classification results are cached to classified_notes.json after every 50 notes for crash resilience. If the cache exists on a subsequent run, only new (uncached) notes are classified.
This is the heart of the pipeline. Each classified note is decomposed into atomic Zettelkasten notes, where "atomic" means one idea per note, heavily cross-linked to its siblings. To make this reliable, ZettelVault uses a three-level fallback strategy:
- dspy.RLM (primary) - the model writes Python code in a sandboxed REPL to programmatically analyze the note's structure, then generates atomic notes
- dspy.Predict with retry - direct LLM call with escalating temperature (0.1, 0.4, 0.7) and cache bypass
- Single-atom passthrough - guaranteed success; emits the original note as-is
Before decomposition begins, a concept index maps meaningful words to note titles, enabling cross-link suggestions. Each note receives a list of the most conceptually similar note titles as context, so the LLM can generate relevant wikilinks.
The output format uses ===-delimited markdown blocks (not JSON - see Design Decisions for why).
Notes that fall back to Predict or passthrough are logged to fallback_notes.json with the note title, reason, and atom count. This log can be used to selectively reprocess those notes later (see make reprocess).
Decomposition results are checkpointed to atomic_notes.json after every note. If the cache exists on a subsequent run, already-decomposed notes (matched by source title) are skipped automatically, enabling progressive processing across multiple sessions.
With classification and decomposition complete, ZettelVault writes the atomic notes to the destination vault. Each note gets:
- YAML frontmatter (tags, domain, subdomain, source note, type)
- Non-conflicting original frontmatter fields preserved (aliases, cssclass, plugin-specific fields)
- Markdown body with
# Titleheading ## Relatedsection with[[wikilinks]]to related notes- Files organized into
PARA_bucket/Subdomain/Title.md - Collision handling: duplicate filenames get a
_1,_2, etc. suffix
A Map of Content (MOC) note per domain is generated in MOC/, containing [[wikilinks]] to all atomic notes in that domain. These MOC notes serve as navigational hubs once you open the vault in Obsidian.
The .obsidian configuration directory from the first source vault is copied to the destination (if it does not already exist), so plugins, themes, and settings carry over.
The final step cleans up the link graph. During decomposition, the LLM generates wikilinks to related notes, but not all of those targets actually exist in the output vault. ZettelVault scans the destination vault and resolves orphan [[wikilinks]] - links that point to notes that do not exist - using a four-tier strategy:
- Case-insensitive match:
[[project planning]]resolves toProject Planning.md - Fuzzy match:
[[Proj Planning]]resolves toProject Planning.mdif the similarity ratio is >= 0.85 (usesdifflib.SequenceMatcher) - Stub creation: orphan links referenced by 3+ notes get a stub note created in the most common PARA folder among referencing notes, with a "Referenced by" section
- Dead link removal: orphan links with only 1-2 references are stripped of brackets (converted to plain text)
The result is a clean, navigable link graph with no dangling references.
The key insight behind this project is that document decomposition benefits enormously from programmatic analysis. Traditional dspy.Predict or dspy.ChainOfThought feed the entire document into the LLM's context window and ask it to generate decomposed output in a single pass. dspy.RLM takes a fundamentally different approach: the document content is never loaded into the LLM's primary context. Instead, it is stored as a variable (context) inside a sandboxed REPL environment, and the LLM writes Python code to access and process it programmatically.
This distinction matters: the LLM's context window is used for reasoning and code generation, not for holding the document. The document lives in the execution environment as data, accessible via code. This means RLM can handle documents far larger than the model's context window - the model only ever sees the slices it explicitly reads via code.
When decomposing a note, the RLM module:
- Stores the note content in a
contextvariable inside a Deno/Pyodide WASM sandbox - not in the LLM's prompt - The LLM writes Python code to access
contextand analyze the note's structure (headings, bullet points, paragraphs) - The code executes in the sandbox; output is captured and shown to the LLM
- The LLM iterates - writing more code to refine its analysis, split content, generate titles and tags
- For semantic tasks, the LLM calls
llm_query()from within the sandbox (e.g., "what is the main idea of this paragraph?") - When done, the LLM calls
SUBMIT(decomposed=...)to return the final result
Because the document is data in the REPL rather than tokens in the prompt, the model can:
- Process documents of arbitrary length (500K+ characters) without context window pressure
- Count sections and bullet points programmatically
- Split content at structural boundaries with precision
- Use regex to extract patterns (tags, links, metadata)
- Make sub-LM calls for semantic understanding of specific sections
- Self-correct by inspecting intermediate results
To quantify the difference, we tested both approaches on 4 source notes from an Obsidian vault, using qwen/qwen3.5-35b-a3b via OpenRouter (Parasail provider):
| Metric | dspy.Predict | dspy.RLM |
|---|---|---|
| Atomic notes produced | 17 | 23 |
| Notes with fallback | 1 (25%) | 0 (0%) |
| Success rate | 75% (3/4) | 100% (4/4) |
| Avg iterations per note | n/a | 5.5 |
| Sub-LM calls (total) | n/a | 3 |
| Provider-reported cost | ~$0.04 | ~$0.08 |
| Latency per note | ~5s | ~30s |
Key findings:
- RLM solved the deterministic failure case. A bullet-heavy note always failed with Predict across all temperatures (0.1 to 0.7) and retry counts. RLM decomposed it successfully on the first attempt by writing code to iterate over bullet points.
- RLM produces more atomic notes. 23 vs 17 - RLM splits more aggressively because it can programmatically identify section boundaries.
- RLM generates richer cross-links. The REPL allows the model to compare note content against the related titles list programmatically.
- Cost is 2x but still very low. At $0.08 for 4 notes, the cost per note is ~$0.02 with RLM. For vault restructuring (a batch operation run once), this is negligible.
| Use Case | Recommendation |
|---|---|
| Structured documents (headings, lists) | RLM - programmatic splitting is more reliable |
| Short, simple notes (1-2 paragraphs) | Predict - RLM overhead isn't justified |
| Notes with complex cross-references | RLM - can programmatically match against related titles |
| High-volume batch processing (1000+ notes) | RLM with cost monitoring - 2x cost may matter at scale |
| Real-time / interactive use | Predict - 5s vs 30s latency matters for UX |
| Notes that consistently fail with Predict | RLM - its code-based approach bypasses template collisions |
dspy.ChainOfThought adds a reasoning step before output generation. For decomposition tasks, this reasoning competes with the actual output for the model's output token budget - the model spends tokens planning what it will do rather than doing it. Whether this trade-off helps depends on the task and model; we chose not to use it for decomposition because the reasoning step doesn't produce actionable intermediate results the model can inspect or correct.
RLM is architecturally different - it doesn't add reasoning text, it adds execution. The model writes code that runs, producing concrete intermediate results it can inspect and refine. Each REPL iteration is a feedback loop, not a one-shot preamble.
Choosing the right model matters. We tested multiple open-source models as RLM orchestrators across two dimensions: quantitative metrics (cost, speed, convergence) and qualitative assessment (output quality, content fidelity, link richness). All tests used the same 4 source notes, run via OpenRouter.
These results reflect only the models we tested during the evaluation phase. Our focus during evaluation was on open-source models because many users want to run this locally via LM Studio, Ollama, or similar tools. After evaluation, we selected GLM-5 for the production run on our full vault (see Production Run (GLM-5)).
| Metric | Qwen 3.5-35B-A3B | MiniMax M2.5 | Kimi K2.5 |
|---|---|---|---|
| Atomic notes | 23 | 16 | 18 |
| RLM iterations | 21 (5.2 avg) | 17 (4.2 avg) | 15 (3.8 avg) |
| Sub-LM calls | 2 | 3 | 0 |
| LLM calls | 33 | 22 | 20 |
| Input tokens | 19K | 48K | 37K |
| Output tokens | 33K | 13K | 10K |
| Provider cost | $0.080 | $0.033 | $0.051 |
| Max iters hit | 0 | 0 | 0 |
Numbers only tell part of the story. We manually reviewed every atom produced by each model, comparing against the source notes for content fidelity, duplication, granularity, and cross-linking quality.
| Dimension | Qwen 3.5-35B-A3B | MiniMax M2.5 | Kimi K2.5 |
|---|---|---|---|
| Content preservation | Good - uses original phrasing | Poor - invents/paraphrases | Best - verbatim + source attribution |
| Content duplication | Near-duplicate atom pairs | Verbatim duplicate atoms | Zero duplicates |
| Granularity judgment | Over-splits (1 paragraph -> 7 atoms) | Mixed | Slightly over-splits (1 paragraph -> 4 atoms) |
| Cross-linking | Basic (links to source notes only) | Basic | Rich (links between sibling atoms + source notes) |
| Source attribution | None | None | Includes "From:" and "See also:" references |
Kimi K2.5 is the quality winner among the models we evaluated. It produces the most faithful content extraction with zero duplicates, the richest cross-link graph (it creates links between sibling atoms within the same source note, not just back to source notes), and proper source attribution. It also converges fastest (3.8 avg iterations) with the fewest total LLM calls.
Qwen 3.5-35B-A3B is a solid baseline that preserves original text well but over-splits aggressively - it will fragment a single paragraph into many single-sentence atoms, destroying the paragraph's coherence. MiniMax M2.5 converges fast and is the cheapest, but produces duplicate content and invents text not present in the source.
Kimi K2.5's strong showing during evaluation set the quality ceiling, and this informed our final model choice for the production run. See the next section for how GLM-5 performed at full vault scale.
We also tested different sub-LM models for llm_query() calls while keeping Qwen as the orchestrator:
| Metric | Same model | Liquid LFM-2 | Qwen Flash | Mercury-2 | Step-3.5-Flash |
|---|---|---|---|---|---|
| Atomic notes | 24 | 19 | 23 | 21 | 21 |
| RLM iterations | 26 (6.5 avg) | 23 (5.8 avg) | 42 (10.5 avg) | 42 (10.5 avg) | 42 (10.5 avg) |
| Sub-LM calls | 9 | 2 | 2 | 2 | 2 |
| Provider cost | $0.094 | $0.051 | $0.111 | $0.116 | $0.125 |
| Max iters hit | 0 | 0 | 2 | 2 | 2 |
The sub-LM currently has minimal influence - the orchestrator makes very few llm_query() calls (2-9 across 4 notes). A signature or prompt change that encourages more sub-LM usage could shift the economics of dual-model setups.
Some models (Kimi K2.5, DeepSeek R1, etc.) default to "thinking mode" where tokens go to an internal reasoning trace before producing content. This can cause DSPy to hang indefinitely - the model burns its token budget on reasoning and never emits content.
Fix for Kimi K2.5: Disable reasoning and use XMLAdapter (which matches Kimi's post-training format):
# config.local.yaml
model:
id: "moonshotai/kimi-k2.5"
provider: "openrouter"
max_tokens: 32000
adapter: "xml"
reasoning:
enabled: falseThe reasoning config maps directly to OpenRouter's reasoning parameter in the request body. For Qwen 3.5, thinking mode doesn't cause issues because it returns reasoning within the content field rather than a separate reasoning field.
After evaluating Qwen, MiniMax, and Kimi K2.5 on 4 test notes, we chose GLM-5 from Zhipu AI for the full production run. GLM-5 was accessed via z.ai's OpenAI-compatible API endpoint, not through OpenRouter.
Two factors drove this decision. First, GLM-5's code generation capabilities proved strong enough for RLM's REPL-driven decomposition, matching the quality ceiling that Kimi K2.5 had established during evaluation. Second, z.ai's API endpoint is OpenAI-compatible, so no code changes were needed, just a config override pointing api_base at z.ai and setting the model ID.
The production run processed a real vault of 774 notes (merged from two source vaults containing 671 and 103 notes respectively), producing 5,805 atomic notes. Here are the key metrics:
| Metric | GLM-5 (Production) |
|---|---|
| Source notes | 774 |
| Atomic notes produced | 5,805 |
| Total REPL iterations | 3,981 (5.2 avg per note) |
| Sub-LM calls | 839 |
| Fallback to dspy.Predict | 35 / 769 (4.6%) |
| Runtime | ~51 hours |
| Provider cost | $0.00 (see note below) |
The 4.6% fallback rate means that 35 out of 769 notes (a few notes were filtered or deduplicated during merging) fell back from RLM to dspy.Predict. Notably, every one of those 35 fallback notes produced results that required no manual adjustment. This validates the three-level fallback strategy: even when RLM fails, Predict picks up the slack cleanly.
On cost: The cost report shows $0 because the author has a yearly subscription to z.ai at a fixed price, making the marginal cost of running GLM-5 effectively zero. If you use z.ai's API directly without a subscription, you will incur whatever per-token costs Zhipu charges. The cost tracker itself has no pricing data for z.ai (it is not in OpenRouter's model catalog), so it cannot calculate costs for this endpoint automatically.
It is worth noting that the evaluation data (Qwen, MiniMax, Kimi K2.5) used only 4 test notes, while the GLM-5 numbers come from the full 774-note vault. The evaluation was designed to compare model quality and choose a winner; the production run was designed to process everything. The 5.2 average REPL iterations per note in production aligns closely with the evaluation results (3.8 to 6.5 range), suggesting that the evaluation was representative of real-world behavior.
Although GLM-5 was excellent for the production run, Qwen 3.5-35B-A3B is a very strong contender for running the entire pipeline locally. An MLX version is available for Mac, making it practical to process a full vault without any API costs at all. For users who want full control and privacy, this is worth considering. Qwen performed well during evaluation (see Model Comparison), and running it locally eliminates both cost and rate-limiting concerns.
ZettelVault includes a standalone cost tracking module (pricing.py) that:
- Fetches real-time pricing from OpenRouter's
/api/v1/modelscatalog at startup - Tracks token usage per pipeline phase by inspecting DSPy's
lm.history - Reports both calculated and provider-reported costs for accuracy
======================================================================
COST REPORT: Qwen3.5-35B-A3B
Model: qwen/qwen3.5-35b-a3b
Pricing: $0.1625/M input, $1.3000/M output
Context window: 262,144 tokens
======================================================================
Phase Calls Input Output Cost
----------------------------------------------------------------------
classification 4 8,234 1,102 $0.002768
decomposition 8 45,123 12,456 $0.023526 [22 iters, 3 sub]
----------------------------------------------------------------------
TOTAL 12 53,357 13,558 $0.026294
TOTAL (provider) $0.082341
======================================================================
The "provider" total reflects OpenRouter's actual billing, which includes routing costs and provider markup. It is more accurate than the calculated total, which uses catalog pricing.
from pricing import CostTracker
tracker = CostTracker("qwen/qwen3.5-35b-a3b")
with tracker.phase("classification"):
for note in notes:
classify(note)
with tracker.phase("decomposition") as phase:
decompose_all(notes)
phase.rlm_iterations = total_iters # optional RLM metrics
phase.rlm_sub_calls = total_subs
tracker.report()The tracker works by snapshotting lm.history length before each phase and summing token counts from new entries after. DSPy stores usage data in two places per history entry:
entry["usage"]- dict withprompt_tokens,completion_tokensentry["response"].usage- LiteLLM ModelResponse object
The tracker checks both, preferring the dict. It also sums entry["cost"] (LiteLLM's per-call cost) for the provider-reported total.
Known limitation: In our tests, the provider-reported cost (summed from entry["cost"]) was consistently higher than the cost we calculated from token counts. The exact cause of this delta is unclear - it may be due to DSPy internal retries, adapter overhead, or provider-side billing differences. We recommend treating the provider-reported total as the ground truth for actual spend, and the per-phase calculated values as useful for relative comparison between phases.
Several design choices in ZettelVault are non-obvious and worth explaining. Each one was made after encountering a specific failure mode during development.
The decomposition output uses ===-delimited markdown blocks rather than JSON:
Title: Atomic Note Title
Tags: tag1, tag2, tag3
Links: Related Note A, Related Note B
Body:
The actual content of this atomic note...
===
Title: Another Note
...
JSON output failed consistently because:
- Models generate trailing commas, unescaped quotes, and invalid unicode
- Large outputs exceed the model's ability to maintain valid JSON structure
- Error recovery requires reparsing the entire output
Markdown-delimited output is forgiving: each section is parsed independently, malformed sections are skipped, and the regex parser handles model quirks (concatenated hashtags, bracket-wrapped links, .md extensions).
Obsidian notes have two structural elements that must survive the pipeline intact:
-
Wikilinks (
[[note title]]) - the author's link graph. DSPy's template system uses[[ ## field ## ]]markers that collide with wikilink syntax. Stripping them would destroy the vault's cross-references. -
YAML frontmatter - metadata used by Obsidian plugins (Dataview, Templater, etc.). Properties like
aliases,cssclass,publish, and custom plugin fields must not be discarded.
Wikilinks are escaped to Unicode guillemets (\u00ab / \u00bb) before sending to DSPy, and restored to [[brackets]] after parsing output. The roundtrip is lossless, including edge cases like [[Note (2024)]] and [[link|alias]].
Frontmatter is extracted from the source note before sanitization via extract_frontmatter(), carried through the pipeline alongside classification data, and merged into each atomic note's output. Generated fields (tags, domain, subdomain, source, type) take precedence; all other original properties are preserved as-is.
Reliability matters more than perfection when processing hundreds of notes in a batch. The fallback chain (RLM -> Predict with retry -> passthrough) ensures the pipeline never fails:
- RLM handles complex, structured notes with high fidelity
- Predict with retry catches cases where RLM fails (e.g., sandbox issues, timeout)
- Passthrough guarantees every note appears in the output, even if decomposition fails entirely
This means the pipeline can process an entire vault without manual intervention. Notes that fell back can always be reprocessed later with make reprocess.
Both classification and decomposition support progressive processing:
- Classification:
classified_notes.jsonis saved every 50 notes. On restart, only uncached notes are classified. - Decomposition:
atomic_notes.jsonis saved after every single note. On restart, notes whose source title already appears in the cache are skipped.
This design means a crash after processing 400 of 800 notes loses at most 1 note's work. Combined with the three-level fallback, it makes large vault migrations practical even over unreliable connections or with rate-limited APIs.
There is an important distinction between what vlt preserves and what the LLM decomposition step preserves. Understanding this distinction will save you from surprises, especially if your vault relies heavily on Obsidian plugins.
vlt uses a 6-pass inert zone masking system that preserves comments and metadata during scanning. Obsidian comments (%% ... %%), HTML comments (<!-- ... -->), code blocks, inline code, and math expressions are all masked (replaced with spaces preserving byte offsets) before link and tag scanning. This means content inside these zones is never modified by vlt during read operations. Frontmatter is always preserved by vlt's write operations (only the body is modified).
ZettelVault's pipeline sends note content through an LLM for decomposition. The LLM does not have vlt's inert zone awareness. This means:
- Comments may be lost. Both Obsidian comments (
%% ... %%) and HTML comments (<!-- ... -->) inside note bodies may be dropped, rewritten, or misinterpreted by the LLM during decomposition. - Plugin-specific syntax will likely not survive. Dataview queries, Templater commands, Excalidraw drawing data, and other plugin-specific content stored in the note body will probably not come through decomposition intact. The LLM treats all body content as natural language text.
- Frontmatter fields ARE preserved. ZettelVault extracts frontmatter before LLM processing and merges it back into each atomic note. Generated fields (
tags,domain,subdomain,source,type) take precedence; all other original properties are kept as-is.
If your vault relies heavily on Dataview queries, Templater scripts, or other plugin-generated content embedded in note bodies, be aware that this content will likely need manual restoration after decomposition. Frontmatter-based plugin data (custom properties, aliases, cssclass) will survive without issues.
ZettelVault processes your entire knowledge base through an LLM pipeline, so data safety is a core design concern, not an afterthought. The protections fall into five layers: preventing damage before it happens, surviving failures mid-run, keeping files safe on disk, preserving information through the LLM round-trip, and making everything observable.
| Mechanism | What it does |
|---|---|
Dry-run mode (--dry-run, make dry-run) |
Runs the full classification and decomposition pipeline but prints a sample of results and exits before writing any files to disk. |
Sample mode (--sample) |
Selects a small representative subset of notes for testing the full pipeline without processing the entire vault. |
Limit flag (--limit N) |
Processes only the first N notes, useful for bounded testing before committing to a multi-hour full run. |
| LLM output validation | Every LLM response is validated before acceptance: minimum 100 characters, must contain a real title (5+ chars), must contain a Body: field, and template garbage ({decomposed}, ## ]]) is rejected. See decompose.py:is_valid_output(). |
| Mechanism | What it does |
|---|---|
| Progressive checkpointing | Classification results are saved to classified_notes.json every 50 notes. Decomposition results are saved to atomic_notes.json after every single note. A crash after 400 of 800 notes loses at most 1 note's work. |
Resume capability (make resume, make resume-all) |
Reloads cached checkpoint data so a crashed or interrupted run continues where it left off rather than restarting from zero. |
| Three-level fallback | Decomposition uses a guaranteed-success chain: (1) RLM programmatic decomposition, (2) Predict with temperature retries, (3) passthrough that emits the original note unchanged. Every note always produces output. |
| Per-note error isolation | Exceptions during decomposition of a single note are caught, logged, and skipped. One problematic note never crashes the entire pipeline. |
Fallback logging (fallback_notes.json) |
Every note that used a degraded path (Predict or passthrough) is recorded with the reason, enabling targeted reprocessing later via make reprocess. |
| Mechanism | What it does |
|---|---|
| Filename collision handling | When a note title matches an existing file, a counter suffix (_1, _2, ...) is appended instead of overwriting. See writer.py:write_note(). |
| Filename sanitization | Unsafe characters (`< > : " / \ |
| Idempotent directory creation | All mkdir calls use parents=True, exist_ok=True, making folder creation safe to repeat. |
| Configuration layering | config.yaml (defaults) is overlaid by config.local.yaml (user overrides), with deep merge preserving unset keys. Misconfiguration in one layer doesn't destroy defaults. |
This is the most critical layer. Files can be intact on disk while the information inside them has been silently mangled by the LLM. ZettelVault addresses this at every stage of the pipeline.
| Mechanism | What it does |
|---|---|
| Source note tracking | Every atomic note carries a source_note field linking it back to the original note title, enabling audits and reconstruction. |
| Frontmatter preservation | YAML frontmatter is extracted before LLM processing (sanitize.py:extract_frontmatter()), carried through the pipeline, and merged back into each output note. Generated fields (tags, domain, subdomain, source, type) override; all other original properties (aliases, cssclass, plugin fields) are kept as-is. |
| Wikilink round-trip | [[wikilinks]] are escaped to Unicode guillemets (<< / >>) before sending to DSPy (avoiding collision with DSPy's [[ ## field ## ]] template markers), then restored to [[brackets]] after parsing. The round-trip is lossless. See sanitize.py. |
| Tag preservation | Classification tags and domain assignments are attached to every atom produced from a note, whether via RLM, Predict, or passthrough. |
| Zero-loss passthrough guarantee | If both RLM and Predict fail, the passthrough fallback emits the complete original content (not the truncated version sent to the LLM) as a single atomic note with all metadata intact. It is impossible for a note to enter the pipeline and not appear in the output. |
| Input truncation transparency | Long notes are truncated to max_input_chars (default 8000) before LLM processing, but the full original content is preserved separately. The passthrough fallback always uses the full content. |
| Link resolution with stub creation | Orphan wikilinks referenced by 3+ notes get stub notes created automatically (preserving the link graph). Dead links with fewer references are removed cleanly rather than left broken. See resolve.py. |
| Classification carry-forward | PARA bucket, domain, and subdomain assignments from the classification phase are embedded in every atomic note, so organizational context is never lost even if decomposition degrades. |
| Mechanism | What it does |
|---|---|
| Progress reporting with ETA | Percentage complete, elapsed time, and estimated time remaining are printed at regular intervals during both classification and decomposition phases. |
| Fallback audit trail | fallback_notes.json records which notes fell back and why, so you can assess quality and selectively reprocess. |
| Checkpoint inspection | classified_notes.json and atomic_notes.json are human-readable JSON files that can be inspected at any time, even mid-run. |
These are areas where data loss is theoretically possible and not yet mitigated:
- Link rewriting is non-atomic.
resolve.pyrewrites files in-place withwrite_text(). A crash mid-rewrite could leave a file in a partially written state. An atomic rename pattern would close this gap. - No automatic git integration. The pipeline does not create git commits before destructive operations. Running inside a git repository and committing before a run is recommended but not enforced.
- MOC pages are overwritten. Map of Content pages are regenerated on each run without versioning. Domain-based naming makes accidental collisions unlikely, but previous MOC content is not preserved.
- No multi-file transaction. Notes are written individually. There is no mechanism to atomically write all notes in a batch or roll back a partial run's file output (though checkpoint-based resumption limits the blast radius).
- Python >= 3.13
- uv (Python package manager)
- Deno (required for RLM's Pyodide WASM sandbox)
- vlt - a compiled Go binary that reads and writes Obsidian vault files directly via the filesystem, without requiring the Obsidian desktop app. ZettelVault uses vlt for vault discovery, note reading, and frontmatter/wikilink parsing. See Why vlt above for details.
- An OpenRouter API key (or a compatible API endpoint such as z.ai, LM Studio, or Ollama)
git clone https://github.com/RamXX/zettelvault.git
cd zettelvault
make install # runs uv sync
# Install Deno (macOS)
curl -fsSL https://deno.land/install.sh | sh
# Set your API key
echo 'OPENROUTER_API_KEY=sk-or-...' > .env# Full pipeline: read from SourceVault, write to ~/path/to/dest
make run SOURCE=MyVault DEST=~/path/to/dest
# Preview without writing files
make dry-run SOURCE=MyVault DEST=~/path/to/dest LIMIT=10
# Process multiple source vaults
make run SOURCE="VaultA VaultB" DEST=~/path/to/destAll parameters are in config.yaml. Copy to config.local.yaml to override without touching the tracked file (it is gitignored). You can also pass --config path/to/file.yaml on the command line.
# ── LLM Models ───────────────────────────────────────────────────────────────
# Primary model - used for classification and as the RLM orchestrator.
# Optional keys: adapter ("xml"|"json"), reasoning (passed to OpenRouter),
# route (OpenRouter provider routing), api_base, api_key_env.
model:
id: "qwen/qwen3.5-35b-a3b"
provider: "openrouter"
max_tokens: 32000
temperature: 0.1
# adapter: "xml" # use XMLAdapter (recommended for Kimi K2.5)
# api_base: "http://localhost:1234/v1" # for local models (LM Studio, Ollama)
# api_key_env: "MY_API_KEY" # env var name for the API key (default: OPENROUTER_API_KEY)
# reasoning: # OpenRouter reasoning params (for thinking models)
# enabled: false
route: # OpenRouter provider routing (optional)
only: ["Parasail"]
# Sub-LM - used inside RLM for llm_query() calls (semantic tasks).
# Can be a smaller/cheaper model to improve cost ratio.
# If not set, the primary model is used for sub-LM calls too.
sub_model:
id: "qwen/qwen3.5-35b-a3b"
provider: "openrouter"
max_tokens: 32000
route:
only: ["Parasail"]
# ── RLM Settings ─────────────────────────────────────────────────────────────
rlm:
max_iterations: 15 # REPL iterations before fallback
max_llm_calls: 30 # sub-LM calls per decomposition
max_output_chars: 15000 # truncation limit per REPL output
verbose: false
# ── Pipeline Settings ────────────────────────────────────────────────────────
pipeline:
max_retries: 3 # Predict fallback retry count
max_input_chars: 8000 # content truncation for LLM input
retry_temp_start: 0.1 # initial temperature for retries
retry_temp_step: 0.3 # temperature increment per retry
classify_checkpoint: 50 # save classification cache every N notes
concept_min_word_len: 4 # minimum word length for concept index
related_top_n: 20 # top-N related notes for decomposition
# -- Link Resolution -------------------------------------------------------
resolve:
fuzzy_threshold: 0.85 # SequenceMatcher ratio for fuzzy wikilink matching
stub_min_refs: 3 # orphan links with >= N references get a stub note
# -- Sampling ---------------------------------------------------------------
sample:
size: 10 # default number of notes for --sample
bullet_heavy_threshold: 0.40 # fraction of bullet lines for bullet-heavy classification
heading_heavy_threshold: 0.15 # fraction of heading lines for heading-heavy classification
prose_heavy_threshold: 0.70 # fraction of prose lines for prose-heavy classificationLocal overrides merge deeply, so you only need to specify the keys you want to change:
# config.local.yaml - use Kimi K2.5 as orchestrator with reasoning disabled
model:
id: "moonshotai/kimi-k2.5"
adapter: "xml"
reasoning:
enabled: false| Key | Type | Default | Description |
|---|---|---|---|
model.id |
string | "qwen/qwen3.5-35b-a3b" |
Model ID (OpenRouter or LiteLLM format) |
model.provider |
string | "openrouter" |
LiteLLM provider prefix |
model.max_tokens |
int | 32000 |
Max output tokens |
model.temperature |
float | 0.1 |
Base temperature for classification |
model.adapter |
string | none | DSPy adapter: "xml" or "json" |
model.api_base |
string | none | Custom API endpoint (for local models) |
model.api_key_env |
string | none | Env var name for API key |
model.reasoning |
dict | none | OpenRouter reasoning params (e.g., enabled: false) |
model.route |
dict | none | OpenRouter provider routing (e.g., only: ["Parasail"]) |
model.top_p |
float | none | Top-p sampling parameter |
sub_model.* |
-- | -- | Same keys as model, applied to the sub-LM |
rlm.max_iterations |
int | 15 |
Max REPL iterations before fallback |
rlm.max_llm_calls |
int | 30 |
Max sub-LM calls per decomposition |
rlm.max_output_chars |
int | 15000 |
Truncation limit per REPL output |
rlm.verbose |
bool | false |
Print RLM REPL traces |
pipeline.max_retries |
int | 3 |
Predict fallback retry count |
pipeline.max_input_chars |
int | 8000 |
Content truncation for LLM input |
pipeline.retry_temp_start |
float | 0.1 |
Initial retry temperature |
pipeline.retry_temp_step |
float | 0.3 |
Temperature increment per retry |
pipeline.classify_checkpoint |
int | 50 |
Save classification cache every N notes |
pipeline.concept_min_word_len |
int | 4 |
Minimum word length for concept index |
pipeline.related_top_n |
int | 20 |
Top-N related notes passed to decomposition |
resolve.fuzzy_threshold |
float | 0.85 |
SequenceMatcher ratio cutoff for fuzzy link matching |
resolve.stub_min_refs |
int | 3 |
Minimum orphan link references before creating a stub note |
sample.size |
int | 10 |
Default number of notes to sample |
sample.bullet_heavy_threshold |
float | 0.40 |
Fraction of bullet lines to classify as bullet-heavy |
sample.heading_heavy_threshold |
float | 0.15 |
Fraction of heading lines to classify as heading-heavy |
sample.prose_heavy_threshold |
float | 0.70 |
Fraction of prose lines to classify as prose-heavy |
RLM decomposes documents using two distinct roles, and you can assign different models to each:
-
Primary model (orchestrator) - writes the Python code, reasons about structure, decides how to split. This model needs to be capable enough to write correct code and understand document semantics.
-
Sub-LM (worker) - handles
llm_query()calls from within the REPL. These are simpler semantic tasks like "summarize this paragraph" or "generate tags for this content". A smaller, cheaper model may work well here.
By setting sub_model to a smaller model, you can reduce the cost of the most frequent LLM calls (sub-queries) while keeping the orchestrator capable. This is configured via DSPy's sub_lm parameter on dspy.RLM.
ZettelVault works with any OpenAI-compatible API. To use a local model via LM Studio, Ollama, or similar:
# config.local.yaml - local model via LM Studio
model:
id: "my-local-model"
provider: "openai"
api_base: "http://localhost:1234/v1"
api_key_env: "LOCAL_API_KEY" # set to any non-empty string in .env
max_tokens: 32000| Target | Description |
|---|---|
make help |
Show all targets and variables |
make run |
Full pipeline (read -> classify -> decompose -> write -> resolve links) |
make dry-run |
Classify + decompose, no file writes (preview mode) |
make sample |
Select representative notes for pipeline preview (uses SAMPLE_SIZE) |
make resume |
Skip classification, reuse classified_notes.json |
make resume-all |
Skip classify + decompose, reuse atomic_notes.json |
make reprocess |
Re-run only the notes that fell back to Predict (reads fallback_notes.json) |
make status |
Show progress of caches and current run |
make clean |
Remove all caches (classified_notes.json, atomic_notes.json, fallback_notes.json, migration_log.txt) |
make clean-all |
Remove caches + all .md files in destination vault (preserves .obsidian) |
make install |
Create venv and install dependencies (uv sync) |
make test |
Unit tests only (no API key needed) |
make test-all |
Unit + integration tests (needs OPENROUTER_API_KEY) |
make lint |
Run ruff linter |
make run SOURCE=MyVault DEST=~/path/to/dest
make run SOURCE="VaultA VaultB" DEST=~/path/to/dest
make dry-run LIMIT=10
make run CONFIG=config.local.yaml| Variable | Default | Description |
|---|---|---|
SOURCE |
-- | Source vault name(s), space-separated |
DEST |
-- | Destination vault path |
LIMIT |
0 (all) |
Process only the first N notes |
CONFIG |
auto-detect | Path to config YAML override |
SAMPLE_SIZE |
10 |
Number of notes to sample |
# Full run
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest
# Disable RLM (use Predict only)
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --no-rlm
# Process only first 10 notes
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --limit 10
# Dry run (no writes)
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --dry-run
# Skip classification (load from cache)
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --skip-classification
# Skip decomposition (load from cache)
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --skip-decomposition
# Multiple source vaults
uv run --env-file .env -- python -m zettelvault VaultA VaultB ~/path/to/dest
# Sample 5 representative notes for preview
uv run -p 3.13 -- python -m zettelvault MyVault --sample --sample-size 5
# Run pipeline on the sample
uv run --env-file .env -p 3.13 -- python -m zettelvault "$(pwd)/_sample/MyVault" ~/path/to/preview
# Custom config
uv run --env-file .env -- python -m zettelvault MyVault ~/path/to/dest --config config.local.yaml| Argument | Description |
|---|---|
source_vault |
One or more source vault names (as known to vlt) |
dest_vault |
Destination vault path (absolute or ~/...) |
--dry-run |
No file writes; preview only |
--no-rlm |
Disable RLM; use dspy.Predict for decomposition |
--skip-classification |
Load pre-classified notes from classified_notes.json |
--skip-decomposition |
Load atomic notes from atomic_notes.json (implies --skip-classification) |
--limit N |
Process only the first N notes (0 = all) |
--sample |
Select representative notes into a sample vault (no LLM calls needed) |
--sample-size N |
Number of notes to sample (default: 10, from config) |
--sample-dir PATH |
Output directory for sample vault (default: ./_sample) |
--config FILE |
Path to config YAML override |
make test # Unit tests, no API key needed
make test-all # Unit + integration tests (needs OPENROUTER_API_KEY)
make lint # Run ruff linterIntegration tests are marked with @pytest.mark.integration and require OPENROUTER_API_KEY to be set.
| Module | Tests | Coverage |
|---|---|---|
_safe_filename |
7 | Unsafe chars, empty input, leading dots |
sanitize_content |
4 | Frontmatter stripping, wikilink escaping |
is_valid_output |
6 | Length, template garbage, placeholder detection |
parse_atoms |
11 | Single/multi atoms, tag normalization, link cleaning, hashtag splitting |
_build_content |
6 | Frontmatter generation, domain/subdomain, tags, links, heading |
write_note |
6 | PARA paths, collision handling, unsafe titles |
write_moc |
4 | Domain grouping, wikilinks, deduplication |
vlt_run / helpers |
4 | JSON/plain fallback, non-md filtering, error handling |
| pricing.py | 16 | ModelRate, PhaseUsage, API fetch, history extraction, CostTracker |
| Integration | 3 | Real LLM classification (2), decomposition (1) |
zettelvault/
zettelvault/
__init__.py # Public API exports
__main__.py # CLI entry point (python -m zettelvault)
config.py # Configuration loading and access
vault_io.py # vlt CLI integration (read/list/resolve vaults)
sanitize.py # Content sanitization and wikilink escaping
classify.py # PARA classification and concept indexing
decompose.py # RLM/Predict decomposition and atom parsing
writer.py # Note and MOC file writing
resolve.py # Orphan wikilink resolution (Step 5)
pipeline.py # Pipeline class (LM init, orchestration)
sample.py # Representative sample vault generation
pricing.py # Cost tracking module (OpenRouter API + DSPy history)
config.yaml # Default configuration
config.local.yaml # Local overrides (gitignored)
Makefile # Build targets
pyproject.toml # Project metadata and dependencies
pytest.ini # Test markers (integration)
.env # API keys (gitignored, not committed)
.gitignore
tests/
test_zettelvault.py # Unit + integration tests for the pipeline
test_pricing.py # Unit tests for cost tracking
conftest.py # Shared test fixtures
| File | Contents |
|---|---|
classified_notes.json |
PARA classifications + note content |
atomic_notes.json |
Decomposed atomic notes |
fallback_notes.json |
Notes that fell back to Predict or passthrough |
migration_log.txt |
stdout log from the last run |
The current implementation is deliberately sequential for clarity - this is reference code meant to be read and adapted. Below are optimizations that would improve throughput for large vaults (500+ notes) without changing the pipeline logic.
Parallel classification. Each classify_note() call is a single independent
dspy.Predict invocation with no shared mutable state. These can be parallelized with
asyncio.gather() and a Semaphore to cap concurrency:
import asyncio
sem = asyncio.Semaphore(4)
async def classify_one(title, content):
async with sem:
return await dspy.asyncify(classify_note)(title, content)
results = await asyncio.gather(*[classify_one(t, c) for t, c in notes])At ~15s/note sequentially, 4-way concurrency cuts classification from ~3.5 hours to ~50 minutes on a 770-note vault.
Parallel decomposition. Each decompose_note() is also independent - the concept
index is read-only during decomposition. RLM spawns a Deno subprocess per call, so
ThreadPoolExecutor is a better fit than asyncio:
from concurrent.futures import ThreadPoolExecutor
import threading
lock = threading.Lock()
def decompose_and_checkpoint(title, data, related):
atoms, iters, subs, method = decompose_note(title, data, related)
with lock:
all_atomic.extend(atoms)
ATOMIC_CACHE.write_text(json.dumps(all_atomic, indent=2))
return atoms, method
with ThreadPoolExecutor(max_workers=3) as pool:
futures = [pool.submit(decompose_and_checkpoint, t, d, r) for t, d, r in work]Start with 2-3 workers and increase if the API doesn't rate-limit. At ~2.5 min/note, 3 workers reduces a 35-hour decomposition pass to ~12 hours.
What cannot be parallelized:
- Steps within a single RLM decomposition (sequential REPL iterations by design)
- Classification must complete before decomposition (the concept index needs all notes)
Pipeline overlap. A more advanced optimization: begin decomposing early notes while classification is still running. This requires building the concept index incrementally, which adds complexity for marginal gain since classification is the faster step.
- DSPy 3.0 - framework for structured LLM applications
- Recursive Language Models (Zhang, Kraska, Khattab, 2025) - the paper behind dspy.RLM
- PARA Method - Projects, Areas, Resources, Archive organizational framework
- Zettelkasten - atomic note-taking methodology with heavy cross-linking
- OpenRouter - LLM routing API with per-model pricing
- GLM-5 - Zhipu AI's large language model (used for the production run via z.ai)
- Kimi K2.5 - Moonshot AI's multimodal agentic model (best quality in evaluation)
- Qwen3.5-35B-A3B - 35B MoE model (3B active), 262K context
- vlt - zero-dependency Obsidian vault CLI for AI agents, CI/CD, and shell scripting
MIT
