Run the system using the scripts in /scripts. The corresponding implementation code is in the files under /src.
Process raw data from three different sources and create chunks:
python scripts/preprocess_data.py \
--pubmed_path "data/BioASQ/corpus_subset.json" \
--openfda_path "data/OpenFDA Drug data/OpenFDA_corpus.json" \
--kaggle_path "data/kaggle_drug_data/processed/extracted_docs.json" \
--output_dir data/processed \
--max_chunk_size 512 \
--overlap 50Output:
data/processed/documents.jsonl- Processed documentsdata/processed/chunks.jsonl- Document chunksdata/processed/drug_mapping.json- Drug name mappingsdata/processed/preprocessing_stats.json- Statistics
Build hybrid index (vector + BM25) for retrieval:
python scripts/build_index.py \
--chunks_path data/processed_for_our_rag/chunks.jsonl \
--embedding_model pritamdeka/S-PubMedBert-MS-MARCO \
--output_dir data/indicesOutput:
data/indices/bm25_index.pkl- BM25 indexdata/indices/index_metadata.json- Index metadatadata/vector_db/- Qdrant vector database
Test retrieval without generation:
python scripts/query.py \
--query "What are the side effects of aspirin?" \
--top_k 5 \
--fusion_method rrf \
--reranker_kind crossencoder \
--rerank_top_n 30 \
--cross_model cross-encoder/ms-marco-MiniLM-L-6-v2Options:
--query: Query text (required)--top_k: Number of results to return (default: 5)--fusion_method: 'rrf' or 'weighted' (default: 'rrf')--vector_weight: Weight for vector search (for weighted fusion)--bm25_weight: Weight for BM25 search (for weighted fusion)--output: Save results to JSON file--reranker_kind: none | simple | crossencoder. simple: cosine on current embedder (fast, no extra deps). crossencoder: pairwise Cross-Encoder scoring (best accuracy).--rerank_top_n: Size of candidate pool to rerank. Typical range: 20–50.--cross_model: encode model from Hugging Face or local
Run the complete RAG system with answer generation:
Using Template Generator (No LLM required):
python scripts/rag.py \
--query "What are the side effects of aspirin?" \
--top_k 5Using OpenAI (requires API key):
python scripts/rag.py \
--query "What is diabetes?" \
--use_llm \
--model_type openai \
--model_name gpt-3.5-turbo \
--api_key YOUR_API_KEY \
--reranker_kind crossencoder \
--cross_model cross-encoder/ms-marco-MiniLM-L-6-v2Using Anthropic Claude:
python scripts/rag.py \
--query "Treatment for hypertension" \
--use_llm \
--model_type anthropic \
--model_name claude-3-sonnet-20240229 \
--api_key YOUR_API_KEY \
--reranker_kind crossencoder \
--cross_model cross-encoder/ms-marco-MiniLM-L-6-v2Options:
--query: Query text (required)--use_llm: Use LLM for generation (otherwise use template)--model_type: 'openai', 'anthropic', 'huggingface', 'local', or 'template'--model_name: Model identifier--api_key: API key for cloud LLM services--temperature: Sampling temperature (default: 0.7)--max_tokens: Maximum tokens in response (default: 500)--top_k: Number of documents to retrieve (default: 5)--fusion_method: 'rrf' or 'weighted' (default: 'rrf')--reranker_kind: none | simple | crossencoder.--rerank_top_n: Size of candidate pool to rerank. Typical range: 20–50.--cross_model: encode model from Hugging Face or local--output: Save results to JSON file--verbose: Show detailed retrieval results
Evaluate the RAG system on test datasets and compare with baseline results:
python evaluation/comprehensive_evaluation.pyOutput Files:
results/openfda_rag_test_results.json- RAG system results for OpenFDAresults/kaggle_rag_test_results.json- RAG system results for Kaggleresults/bioasq_rag_test_results.json- RAG system results for BioASQ (if run separately)results/comprehensive_evaluation.json- Complete evaluation reportresults/comprehensive_evaluation_report.md- Markdown summary report
Run BioASQ Evaluation Separately: To run BioASQ evaluation separately (it has 200 queries and may take longer):
python evaluation/run_bioasq_evaluation.pyThis will generate results/bioasq_rag_test_results.json which will be automatically used by the comprehensive evaluation script.
Note: The evaluation script automatically runs RAG system evaluation if result files don't exist. To force re-evaluation, delete the existing result files in the results/ directory.
python baselines/faiss/FAISS_BioASQ.pypython baselines/faiss/openfda_faiss_test.pypython baselines/Qdrant/Qdrant_test.py