Skip to content

jy02140251/rag-document-loader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

RAG Document Loader

Load and chunk documents for Retrieval-Augmented Generation.

Installation

npm install rag-document-loader

Supported Formats

  • PDF (.pdf)
  • Word (.docx)
  • HTML (.html)
  • Markdown (.md)
  • Text (.txt)
  • CSV (.csv)
  • JSON (.json)

Quick Start

import { DocumentLoader, RecursiveTextSplitter } from 'rag-document-loader';

// Load documents
const loader = new DocumentLoader();
const docs = await loader.load('./documents');

// Split into chunks
const splitter = new RecursiveTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});
const chunks = await splitter.split(docs);

// Each chunk has:
// - content: string
// - metadata: { source, page, type, ... }

Chunking Strategies

// By character count
new CharacterTextSplitter({ chunkSize: 1000 });

// By tokens (for LLMs)
new TokenTextSplitter({ chunkSize: 500, model: 'gpt-4' });

// By semantic similarity
new SemanticTextSplitter({ embeddings: openaiEmbeddings });

// By markdown headers
new MarkdownHeaderSplitter();

Metadata Extraction

const loader = new DocumentLoader({
  extractMetadata: true,
  // Extract: title, author, date, keywords
});

License

MIT

About

Load documents for RAG pipelines: PDF, DOCX, HTML, Markdown. Smart chunking, metadata extraction. LangChain compatible.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors