The Web Info Chatbot is a Streamlit-based application that allows users to extract and query information from websites using AI-powered embeddings. The application:
- Scrapes website content, including JavaScript-rendered pages
- Generates text embeddings
- Stores embeddings locally in FAISS for efficient similarity search
- Provides a chat interface for querying website information
- Website Crawling: Extracts text from web pages, skipping login/signup pages.
- JavaScript Support: Uses Playwright to scrape JavaScript-rendered content.
- Embeddings Storage: Uses FAISS to store and retrieve website content efficiently.
- Chat Interface: Users can ask questions about the scraped website content.
- Session Management: Retains embeddings across interactions and supports reset functionality.
- Python (Backend)
- Streamlit (User Interface)
- Playwright (Web Scraping)
- BeautifulSoup (HTML Parsing)
- Ollama Mistral (Text Embeddings)
- FAISS (Vector Database)
- Asyncio (Asynchronous Programming)
- FastAPI (API)
- Clone the repository:
git clone https://github.com/sidjmishra/Scraper-GPT.git cd scraper-gpt - Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Install Playwright and browsers:
playwright install
Start the FastAPI application using:
uvicorn scraper_bot.app:app --port 8000Start the Streamlit application using:
streamlit run scraper_bot/ui.py- Enter a website URL in the input field and click "Submit".
- The app will scrape and process the content, showing a progress bar.
- Once completed, a chat interface will appear where you can ask questions about the website.
- Click "Reset Application" to clear stored embeddings and restart the process.
Here are some screenshots of the application in action:
- May not handle dynamically loaded content fully if it requires user interaction.
- The scraper is limited to a set number of pages (default: 10).
- Works best on text-heavy websites.
- Support for more complex crawling (multi-page navigation based on links).
- Additional filtering options to exclude specific sections of a website.
- Integration with other embedding models.



