Web Info Chatbot

Overview

The Web Info Chatbot is a Streamlit-based application that allows users to extract and query information from websites using AI-powered embeddings. The application:

Scrapes website content, including JavaScript-rendered pages
Generates text embeddings
Stores embeddings locally in FAISS for efficient similarity search
Provides a chat interface for querying website information

Features

Website Crawling: Extracts text from web pages, skipping login/signup pages.
JavaScript Support: Uses Playwright to scrape JavaScript-rendered content.
Embeddings Storage: Uses FAISS to store and retrieve website content efficiently.
Chat Interface: Users can ask questions about the scraped website content.
Session Management: Retains embeddings across interactions and supports reset functionality.

Technologies Used

Python (Backend)
Streamlit (User Interface)
Playwright (Web Scraping)
BeautifulSoup (HTML Parsing)
Ollama Mistral (Text Embeddings)
FAISS (Vector Database)
Asyncio (Asynchronous Programming)
FastAPI (API)

Installation

Clone the repository:

git clone https://github.com/sidjmishra/Scraper-GPT.git
cd scraper-gpt

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate   # On Windows use: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Install Playwright and browsers:
```
playwright install
```

Running the Application

Start the FastAPI application using:

uvicorn scraper_bot.app:app --port 8000

Start the Streamlit application using:

streamlit run scraper_bot/ui.py

How to Use

Enter a website URL in the input field and click "Submit".
The app will scrape and process the content, showing a progress bar.
Once completed, a chat interface will appear where you can ask questions about the website.
Click "Reset Application" to clear stored embeddings and restart the process.

Output Screenshots

Here are some screenshots of the application in action:

Website Input Page

Loading State

Chat Interface (Question 1)

Chat Interface (Question 2)

Known Issues & Limitations

May not handle dynamically loaded content fully if it requires user interaction.
The scraper is limited to a set number of pages (default: 10).
Works best on text-heavy websites.

Future Enhancements

Support for more complex crawling (multi-page navigation based on links).
Additional filtering options to exclude specific sections of a website.
Integration with other embedding models.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
scraper_bot		scraper_bot
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
interaction_report.pdf		interaction_report.pdf
requirements.txt		requirements.txt
run_apps.py		run_apps.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Info Chatbot

Overview

Features

Technologies Used

Installation

Running the Application

How to Use

Output Screenshots

Website Input Page

Loading State

Chat Interface (Question 1)

Chat Interface (Question 2)

Known Issues & Limitations

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Info Chatbot

Overview

Features

Technologies Used

Installation

Running the Application

How to Use

Output Screenshots

Website Input Page

Loading State

Chat Interface (Question 1)

Chat Interface (Question 2)

Known Issues & Limitations

Future Enhancements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages