Skip to content

πŸ“° Real-time Google News Scraper for Python. Extract headlines, sources, dates, and links via SERP API. No browser required.

License

Notifications You must be signed in to change notification settings

Thordata/google-news-scraper-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Google News Scraper for Python

Thordata Logo

Real-time Google News scraping via API. Extract headlines, sources, and dates instantly.
Powered by Thordata's high-speed SERP infrastructure.

Python License Powered By


🎯 Quick Start: AI News Briefing

Get the latest AI industry news with one command!

# One command to get latest AI news
python main.py --ai-brief

# Get AI breakthroughs only
python main.py --ai-breakthroughs --limit 10

# Export to CSV
python main.py --ai-brief --format csv --limit 30

This feature automatically searches multiple AI-related keywords and combines the results into a comprehensive briefing. Perfect for staying updated on the latest AI developments!


⚑ Features

  • πŸ€– AI News Briefing: One-command feature to get latest AI industry news and breakthroughs
  • πŸ“° Real-Time Data: Get the latest news as it happens (no cache lag when needed).
  • ⚑ Smart Caching: Automatic response caching (5min TTL) for instant repeated queries
  • πŸ”„ Auto Retry: Exponential backoff retry mechanism for reliable requests
  • πŸ“Š Progress Indicators: Visual feedback for long-running operations
  • 🌍 Global Coverage: Support for any country (us, uk, jp, cn, etc.) and language.
  • πŸš€ High Speed: Synchronous API response (<3s average), cached responses <0.1s
  • 🧹 Clean Output: Automatically parses complex JSON into simple lists (JSON/CSV).
  • πŸ›‘οΈ No Bans: Full proxy rotation and anti-bot handling managed by Thordata.
  • πŸ”§ Advanced API: Uses latest SerpRequest and serp_search_advanced for better control.
  • πŸ“± Device Support: Specify device type (desktop, mobile, tablet) for different results.
  • 🌐 Language Control: Fine-tune language settings for localized results.

πŸ“¦ Sample Output

[
  {
    "title": "OpenAI Announces GPT-5 with Revolutionary Capabilities",
    "source": "TechCrunch",
    "date": "2 hours ago",
    "snippet": "OpenAI has unveiled GPT-5, featuring unprecedented reasoning capabilities...",
    "link": "https://techcrunch.com/...",
    "thumbnail": "data:image/png;base64,..."
  },
  {
    "title": "Google DeepMind Breakthrough in Protein Folding",
    "source": "Nature",
    "date": "5 hours ago",
    "snippet": "New AI model predicts protein structures with 95% accuracy...",
    "link": "https://nature.com/...",
    "thumbnail": "data:image/png;base64,..."
  }
]

πŸš€ Installation & Setup

1. Get Your Token

Get your free scraping token from the Thordata Dashboard.

2. Install Dependencies

git clone https://github.com/Thordata/google-news-scraper-python.git
cd google-news-scraper-python
pip install -r requirements.txt

3. Configure

Copy .env.example to .env and fill in your token:

THORDATA_SCRAPER_TOKEN=your_token_here

πŸ’‘ Usage Examples

AI News Briefing (Featured!)

# Get comprehensive AI news briefing
python main.py --ai-brief

# Get AI breakthroughs only
python main.py --ai-breakthroughs --limit 15

# AI news with custom settings
python main.py --ai-brief --limit 50 --country uk --format csv

Basic Search

# Simple search
python main.py "Artificial Intelligence"

# Search with custom limit
python main.py "Crypto Market" --limit 50

Advanced Search

# Search with country and language
python main.py "Tesla News" --country uk --language en

# Search with device type
python main.py "AI Updates" --device mobile --no-cache

# Full example with all options
python main.py "Bitcoin Price" \
  --limit 100 \
  --country jp \
  --language ja \
  --device desktop \
  --format csv \
  --no-cache

πŸ“‹ Command Line Arguments

Argument Description Default
query Search topic (required unless using --ai-brief) -
--ai-brief Get latest AI industry news (one-command feature) False
--ai-breakthroughs Get latest AI breakthroughs only False
--limit Maximum number of results 20
--country Country code (us, uk, jp, cn, etc.) us
--language Language code (en, zh, ja, etc.) Auto
--device Device type (desktop, mobile, tablet) Auto
--format Output format (json, csv) json
--no-cache Bypass cache for fresh results False

🎨 Use Cases

1. Daily AI News Monitoring

# Run this daily to stay updated
python main.py --ai-brief --limit 30 --format csv

2. Research & Analysis

# Collect news for specific research topics
python main.py "machine learning research" --limit 100 --format csv

3. Market Intelligence

# Track industry news by country
python main.py "tech industry" --country us --limit 50
python main.py "tech industry" --country uk --limit 50

4. Content Aggregation

# Aggregate news from multiple sources
python main.py "climate change" --limit 50 --format json

5. Competitive Intelligence

# Monitor competitor news
python main.py "competitor name" --no-cache --limit 20

πŸ“ Output Format

Results are saved to the output/ directory in your chosen format:

  • JSON: Structured data with all fields
  • CSV: Spreadsheet-friendly format

Each file is named based on your query: news_{query}.{format}


πŸ”§ Advanced Configuration

Environment Variables

THORDATA_SCRAPER_TOKEN=your_token_here

Programmatic Usage

from src.scraper import GoogleNewsScraper
from src.ai_news import AINewsBriefing

# Basic search (with automatic caching)
scraper = GoogleNewsScraper()
results = scraper.search("AI", num=20, country="us")  # Cached for 5 minutes

# Bypass cache for fresh results
results = scraper.search("AI", num=20, no_cache=True)

# Clear cache manually
scraper.clear_cache()

# AI news briefing
ai_briefing = AINewsBriefing()
briefing = ai_briefing.get_latest_ai_news(num=30)

Performance Features

Caching:

  • Automatic caching of API responses
  • Default TTL: 5 minutes
  • Instant response for cached queries (<0.1s)
  • Manual cache control available

Retry Mechanism:

  • Automatic retry on transient failures
  • Exponential backoff (1s, 2s, 4s delays)
  • Up to 3 retry attempts
  • Prevents cascading failures

🌟 Why This Scraper?

Compared to Other Solutions

Feature This Scraper Others
AI News Briefing βœ… One-command feature ❌ Manual keyword setup
Smart Caching βœ… Automatic (5min TTL) ❌ No caching
Auto Retry βœ… Exponential backoff ⚠️ Single attempt
Progress Indicators βœ… Visual feedback ❌ No feedback
Real-time Data βœ… <3s response, <0.1s cached ⚠️ Varies
No Bans βœ… Managed by Thordata ⚠️ Risk of blocking
Global Coverage βœ… 195+ countries ⚠️ Limited
Easy Setup βœ… 2 minutes ⚠️ Complex
Output Formats βœ… JSON + CSV ⚠️ Limited
Error Handling βœ… Robust with retries ⚠️ Basic

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“„ License

MIT License. See LICENSE for details.


πŸ™ Acknowledgments

  • Powered by Thordata SERP API
  • Built with ❀️ by the Thordata Developer Team

πŸ“ž Support


πŸ“„ License

MIT License. See LICENSE for details.


πŸ“š Additional Documentation


Built with ❀️ by the Thordata Developer Team.

About

πŸ“° Real-time Google News Scraper for Python. Extract headlines, sources, dates, and links via SERP API. No browser required.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages