Convert Microsoft Word documents into structured JSON content suitable for web applications, content management systems, or RAG (Retrieval-Augmented Generation) pipelines.
- Automatic TOC extraction - Identifies chapters and sections from Table of Contents
- md2rag-compatible JSON output - Structured format with navigation links
- Image extraction - Extracts all images including WMF to PNG conversion
- Table processing - Preserves complex table structures
- Markdown export - Optional parallel Markdown output
- Smart title detection - Extracts book title from document content
# 1. Install dependencies
make install-deps
# 2. Configure your book (optional - auto-detects from document)
cp book_config.toml.example book_config.toml
# Edit book_config.toml with your book's details
# 3. Place your Word document
cp your-book.docx original-book.docx
# 4. Build
make buildOutput is generated in export/ (JSON) and export_md/ (Markdown).
export/
βββ {lang}/ # Language folder (e.g., "eng")
β βββ {book_id}/ # Book ID folder
β βββ _book.toml # Book manifest
β βββ 01/ # Chapter 1
β β βββ intro.json # Chapter intro
β β βββ 01.json # Section 1.1
β β βββ 02.json # Section 1.2
β βββ 02/ # Chapter 2
βββ pictures/ # Pictures at root level
βββ {lang}/
βββ {book_id}/
βββ 01/ # Mirrors chapter/section numbers
βββ 01/
βββ 001.png
βββ manifest.json
export_md/ # Markdown export
βββ {lang}/
βββ README.md
βββ style.css
βββ 01/ # Chapter 1
βββ intro.md
βββ 01.md # Section 1.1
βββ 01_01.md # Subsection 1.1.1
canonical_id = "my-book-title"
language = "eng"
title = "My Book Title"
is_original = trueEach section JSON file contains:
{
"id": "my-book-title/01/01",
"title": "Section Title",
"section_id": "chapter_name/section_name",
"links": [
{"type": "previous", "target": "my-book-title/01/intro"},
{"type": "next", "target": "my-book-title/01/02"}
],
"content": [
{"type": "paragraph", "text": "Paragraph content..."},
{"type": "image", "path": "pictures/01/01/001.png", "alt": "", "caption": ""},
{"type": "table", "rows": [{"cells": [{"text": "Cell content"}]}]}
]
}Copy from book_config.toml.example and customize:
# Unique identifier for cross-references between books
canonical_id = "my-book-title"
# ISO 639-2 language code
language = "eng"
# Book title (auto-detected from document if empty)
title = "My Book Title"
# Is this the original language version?
is_original = true
# For translations only:
# original_language = "eng"
# Where to store pictures: "root", "book", or "chapter"
pictures_location = "root"If title is left empty, it will be extracted from:
- DOCX metadata (if available)
- First paragraph of the document
Edit build_book.py to customize paths:
INPUT_DOCX = "original-book.docx"
MARKDOWN_DIR = "export_md"
ENABLE_MARKDOWN = True # Set to False to disable Markdown export- Python 3.8+ with python-docx
- ImageMagick 7+ - Image processing
- Ghostscript - PDF to PNG conversion
- LibreOffice - WMF to PDF conversion (for Windows Metafile images)
macOS:
brew install imagemagick ghostscript
brew install --cask libreoffice
make install-depsLinux (Ubuntu/Debian):
sudo apt-get install imagemagick ghostscript libreoffice python3-pip
make install-depsImportant: Convert automatic numbering to fixed text before processing.
Word/LibreOffice automatic numbering stores section numbers invisibly, causing missing sections.
Quick Fix:
- LibreOffice: Select All β Format β Lists β No List β Save
- Word: Select All β Ctrl+Shift+N β Numbering β None β Save
See DOCUMENT_PREPARATION_GUIDE.md for detailed instructions.
make build # Build book content to export/
make rebuild-all # Clean and rebuild from scratch
make clean # Remove generated files
make check-deps # Verify dependencies installed
make verify # Check image integrity
make status # Show project status
make stats # Display content statisticsIf your document has known numbering inconsistencies, create conf/exceptions.conf:
# Format: wrong_number = correct_number
10.7.7 = 10.7.5
21.4.3 = 21.2.3
# Check LibreOffice is accessible
libreoffice --version
# If not found on macOS
make setup-libreoffice
# Rebuild
make rebuild-allmake check-deps
make install-depsmake clean
make build- DOCUMENT_PREPARATION_GUIDE.md - Document preparation
- WMF_CONVERSION_GUIDE.md - Image conversion guide
- MARKDOWN_GENERATION.md - Markdown output guide
- CONTRIBUTING.md - Contribution guidelines
GNU General Public License v3.0 (GPL-3.0) - See LICENSE file.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Run
make verifyto check integrity - Submit a pull request