This project showcases a simpler, more practical alternative to traditional RAG systems - demonstrating how modern search APIs combined with large context windows can eliminate the complexity of Retrieval-Augmented Generation for many documentation Q&A use cases.
As we enter 2025, there's growing evidence that search-first approaches are becoming more cost-effective and simpler than traditional RAG. With models like Gemini 2.5 Flash offering 5M token context windows at competitive prices, many developers are discovering: "Why build complex RAG pipelines when you can just search and load relevant content into context?"
This project provides a hands-on example of this approach - showcasing intelligent search with domain restrictions and organizational guardrails.
Perfect for organizations wanting to create internal knowledge assistants that stay within approved documentation boundaries without the overhead of traditional RAG infrastructure.
- 🎯 Smart Tool Selection: Automatically chooses between fast search and comprehensive scraping based on query needs
- 🔍 Domain-Restricted Search: Only searches approved organizational documentation websites
- 🧠 Web Scraping Fallback: Comprehensive page scraping when search results are insufficient
- 📝 Intelligent Summarization: Optional AI-powered result summarization reduces token usage by 60-80%
- 💰 Cost-Competitive: At $0.005-$0 8000 .075 per query, often cheaper than traditional RAG systems
- ⚡ Performance Optimized: Fast search for 90% of queries, deep scraping only when needed
- 🛡️ Data Security: No sensitive data sent to vector databases or training systems
- 📊 Transparent Sources: Every answer includes clear source attribution from official documentation
- 🔧 Easy Configuration: Simple CSV file controls which knowledge sources are accessible
- 💬 Conversation Memory: Maintains context across multiple questions in a session
- 🎮 Production Ready: FastAPI backend with proper error handling and logging
To configure which websites your agent can search, edit the sites_data.csv
file. This CSV defines your agent's knowledge boundaries and domains:
domain,site,description
AI Agent Frameworks,github.com/openai/swarm,OpenAI Swarm documentation for lightweight multi-agent orchestration
AI Operations,docs.agentops.ai,AgentOps documentation for testing debugging and deploying AI agents and LLM apps
AI Data Frameworks,docs.llamaindex.ai,LlamaIndex documentation for building LLM-powered agents over your data
CSV Structure:
- domain: The subject area or topic (e.g., "AI Agents", "Web Development", "Machine Learning")
- site: The actual website domain to search (e.g., "docs.langchain.com", "docs.python.org")
- description: A clear explanation of what the site contains and when to use it
Pro Tip: The description is crucial - it's what the agent uses to decide whether a particular site will be helpful for answering a user's question. Be specific about what topics and types of information each site covers.
- Go to tavily.com and sign up for a free account
- Navigate to your dashboard or API section
- Find your API key in the dashboard
- Tavily offers a generous free tier with thousands of searches per month
- Visit ai.google.dev (Google AI Studio)
- Sign in with your Google account
- Click "Get API Key" or navigate to the API keys section
- Create a new project if needed
- Generate your API key
- Google's Gemini API includes a substantial free tier
After obtaining both keys, add them to your .env
file:
TAVILY_API_KEY=your_tavily_key_here
GOOGLE_API_KEY=your_google_key_here
Security Note: Keep these keys secure and never commit them to public repositories. Both services offer excellent free tiers suitable for development and small-scale production use.
# Clone the repository
git clone https://github.com/javiramos1/qagent.git
cd qagent
# Setup environment and install dependencies
make install
# Copy and configure environment variables
cp .env.example .env
# Edit .env with your API keys
# Run the application
make run
# Clone the repository
git clone https://github.com/javiramos1/qagent.git
cd qagent
# Copy and configure environment variables
cp .env.example .env
# Edit .env with your API keys
# Run with Docker Compose
make docker-run
GOOGLE_API_KEY=your_google_api_key_here # Get from Google Cloud Console
TAVILY_API_KEY=your_tavily_api_key_here # Get from Tavily.com
# Search Configuration
MAX_RESULTS=10 # Maximum search results per query
SEARCH_DEPTH=basic # Search depth: basic or advanced
MAX_CONTENT_SIZE=100000 # Maximum content size per result
MAX_SCRAPE_LENGTH=10000 # Maximum content length for web scraping (characters)
ENABLE_SEARCH_SUMMARIZATION=false # Enable AI summarization of search results (reduces tokens 60-80%)
# LLM Configuration
LLM_TEMPERATURE=0.1 # Response creativity (0.0-1.0)
LLM_MAX_TOKENS=10000 # Maximum response length
# Timeout Configuration
REQUEST_TIMEOUT=30 # Request timeout in seconds
LLM_TIMEOUT=60 # LLM response timeout in seconds
# Web Scraping Configuration
USER_AGENT=QAgent/1.0 (Educational Search-First Q&A Agent) # Identifies your requests (prevents warnings)
Our analysis reveals that search-first approaches are now cost-competitive or even cheaper than traditional RAG systems:
# Fair comparison: Same model (Gemini 2.0 Flash), same token usage
# Search-First Approach (this project)
search_cost = $0.075 # 1M tokens input + 1K output
# No additional infrastructure needed
# Traditional RAG Approach
rag_llm_cost = $0.075 # Same LLM costs as search-first
rag_overhead = $0.002 # Embeddings + vector DB queries
rag_infrastructure = $0.001 # Hosting, maintenance, pipelines
total_rag_cost = $0.078 # 4% MORE expensive than search-first!
# Ultra-affordable option
gemini_lite_cost = $0.005 # 128K context with Gemini 2.0 Flash-Lite
- Gemini 2.0 Flash-Lite: $0.005 per query - 15x cheaper than RAG
- Gemini 2.0 Flash: $0.075 per query - same cost as RAG but no infrastructure
- Search-first eliminates: Vector databases, embeddings, chunking, maintenance overhead
- Always fresh: No stale embeddings or index updates needed
Model | Context Window | Token Pricing | Best For |
---|---|---|---|
Gemini 2.0 Flash-Lite | 128K tokens | $0.0375/1M input | Most Q&A scenarios |
Gemini 2.0 Flash | 1M tokens | $0.075/1M input | Complex documentation |
Gemini 2.5 Flash Preview | 1M tokens | $0.15/1M input | Reasoning-heavy tasks |
Gemini 2.5 Pro | 5M tokens | $1.25/1M input | Enterprise analysis |
Traditional RAG | Variable | $0.077/query | Legacy systems only |
Search-First Architecture (This Project):
graph TD
A[User Query] --> B[Search API]
B --> C[Relevant Results]
C --> D[LLM with Context]
D --> E[Response]
style B fill:#ccffcc
style D fill:#cceeff
Traditional RAG Architecture:
graph TD
A[User Query] --> B[Embedding Model]
B --> C[Vector Database]
C --> D[Similarity Search]
D --> E[Chunk Retrieval]
E --> F[Context Assembly]
F --> G[LLM Processing]
G --> H[Response]
I[Document Ingestion] --> J[Chunking]
J --> K[Embedding Generation]
K --> L[Vector Storage]
L --> C
style C fill:#ffcccc
style J fill:#ffcccc
style K fill:#ffcccc
Recent research (2024-2025) shows that search-first approaches often outperform RAG:
- No "lost in the middle" issues - Search returns most relevant content first
- Better context relevance - Search algorithms optimize for query relevance
- Faster iteration - No embedding regeneration when documents change
- Simpler debugging - Easy to 8000 see what content was retrieved and why
🥇 Primary Approach: Search-First (This Project)
- ✅ Public documentation - Use search APIs with large context windows
- ✅ Internal wikis - Search across approved domains with guardrails
- ✅ Cost optimization - 15x cheaper with Gemini 2.0 Flash-Lite
- ✅ Simplicity - No vector databases or embedding maintenance
- ✅ Always current - Real-time search results
🥈 Fallback: Hybrid RAG-Search
- 🔄 Private enterprise data with strict access controls
- 🔄 Fine-grained permissions on document chunks
- 🔄 Offline scenarios where search APIs aren't available
🥉 Legacy: Traditional RAG
⚠️ Specialized use cases requiring complex document relationships⚠️ Ultra-high volume (>100K queries/day) where infrastructure costs amortize
The Verdict: Search-first approaches have fundamentally changed the game in 2025. This project demonstrates: Search + Large Context > RAG for most organizational knowledge systems. 🚀
The system uses a search-first approach with intelligent fallback to web scraping for comprehensive information retrieval:
graph TD
A[User Query] --> B[LangChain Agent]
B --> C{Analyze Query}
C --> D[Select Relevant Sites]
D --> E[Tavily Search API]
E --> F{Search Results Sufficient?}
F -->|Yes| G[Generate Response]
F -->|No| H[Web Scraping Tool]
H --> J[Extract Page Content]
J --> K[Combine Search + Scraped Data]
K --> G
G --> L[Response with Sources]
M[sites_data.csv] --> D
N[Domain Restrictions] --> E
N --> H
style A fill:#e1f5fe
style B fill:#f3e5f5
style E fill:#e8f5e8
style H fill:#fff3e0
style G fill:#e0f2f1
style L fill:#f1f8e9
classDef searchPath stroke:#4caf50,stroke-width:3px
classDef scrapePath stroke:#ff9800,stroke-width:3px
classDef decision stroke:#2196f3,stroke-width:3px
class E,F,G searchPath
class H,J,K scrapePath
class C,F decision
- Domain-Restricted Agent: LangChain agent that only searches approved knowledge sources
- Tavily Search Integration: Fast, targeted search within specific documentation websites
- Web Scraping Tool: Chromium-based scraping for comprehensive page content extraction
- Site Restrictions: CSV-configured domains ensure searches stay within organizational boundaries
- Cost Control: Intelligent tool selection minimizes expensive operations
- Primary: Fast Search - Uses Tavily API to quickly search within approved documentation websites
- Fallback: Deep Scraping - When search results are insufficient, automatically scrapes entire pages for comprehensive content
The agent follows a smart escalation strategy:
- Analyze Query: Determine relevant documentation sites based on technologies mentioned
- Search First: Use fast Tavily search within selected domains
- Evaluate Results: Assess if search provides sufficient information
- Scrape if Needed: Only scrape entire pages when search results are incomplete
- Comprehensive Response: Combine information from both sources for detailed answers
This system strategically uses Gemini 2.0 Flash (non-thinking model) instead of reasoning-heavy models like o3:
Aspect | Gemini Flash (Non-Thinking) | o3-style (Thinking Models) |
---|---|---|
Cost | $0.075/1M tokens | $15-60/1M tokens (200-800x more) |
Speed | 2-5 seconds | 15-60 seconds |
Token Usage | Minimal overhead | Heavy reasoning chains |
Suitability | Perfect for tool-based workflows | Overkill for structured tasks |
ReAct Framework Replaces Internal Reasoning:
Human Query → Agent Thinks → Selects Tool → Executes → Observes → Responds
↑ ↑ ↑ ↑ ↑ ↑
Input ReAct Logic Tool Selection Search Results Answer
Key Advantages:
- Cost-Effective Reasoning: ReAct provides structured thinking at 1/200th the cost
- Transparent Logic: Every reasoning step is visible and debuggable
- Tool-Optimized: Designed specifically for search + scraping workflows
- Faster Responses: No internal chain-of-thought overhead
- Easier Boundaries: Explicit tool constraints prevent hallucination
The agent provides intelligent two-tier information retrieval through a simple REST API:
Session Management: The API uses secure HTTP cookies to maintain separate conversation memory for each user. When you make your first request, a unique session ID (UUID) is automatically generated and stored in a secure cookie. Each session ID creates its own agent instance with isolated memory, so your conversation history never mixes with other users - even if they're using the API simultaneously.
POST /chat
- Send a question to the agent (automatically uses search + scraping as needed)POST /reset
- Reset conversation memoryGET /health
- Detailed health check with system status
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"message": "How do I create a LangChain agent with custom tools?"}'
Example Response:
{
"status": "success",
"response": "Based on the LangChain documentation, here's how to create a custom agent..."
}
Enable intelligent search result summarization to reduce token usage and improve performance:
# Enable summarization in your .env file
ENABLE_SEARCH_SUMMARIZATION=true
- ✅ 60-80% token reduction while preserving key information
- ✅ 2-3x faster processing with smaller contexts
- ✅ Lower costs especially for high-volume deployments
- ✅ Better focus on query-relevant information
- ✅ Automatic fallback if summarization fails
- High-volume scenarios (>1000 queries/day)
- Cost-sensitive deployments requiring maximum efficiency
- Long documentation pages with lots of boilerplate content
- Latency-critical applications where speed matters most
- Uses Gemini 2.0 Flash-Lite for ultra-fast, cheap summarization ($0.0375/1M tokens)
- Preserves technical details, code examples, and source URLs
- Intelligent prompt focuses on query relevance
- Graceful degradation if summarization fails
This design choice makes the system practical for production deployment while maintaining high answer quality through structured tool usage rather than expensive internal reasoning.
This project demonstrates organizational AI safety through multiple layers:
# In search_tool.py
# The agent selects which website domains to search based on the user's query
search_params = {
"query": query,
"include_domains": [site_info["site"] for site_info in sites_info], # e.g., ["docs.langchain.com"]
"max_results": max_results,
"search_depth": search_depth
}
- Agent must use search tool for every question
- Questions outside configured knowledge sources trigger rejection responses
- Clear user guidance about available knowledge areas
- Topic Domains (CSV 'domain' column): Used for categorization and user communication
- Website Domains (CSV 'site' column): Used for actual search restrictions in Tavily API
- ✅ No data leakage - searches only approved documentation websites
- ✅ No hallucination - responses based only on real documentation
- ✅ Audit trail - all searches are logged and traceable
- ✅ Easy updates - modify
sites_data.csv
to change knowledge scope - ✅ Cost control - limited search scope reduces API usage
- Employee onboarding guides and company handbooks
- HR policy documentation and benefits information
- Technical documentation and API references
- Process and procedure manuals
- Intranet search solutions - Direct search across internal sites
- Product documentation and user guides
- FAQ resources and troubleshooting guides
- API documentation and developer resources
- Release notes and changelog information
- Departmental wikis - Search across team-specific documentation
- Project documentation - Access to project specs, requirements, and status updates
- Compliance and regulatory - Search through policy documents and guidelines
- Training materials - Access to learning resources and certification guides
- Regulatory documentation and compliance frameworks
- Safety procedures and emergency protocols
- Audit requirements and reporting guidelines
- Legal documentation and contract templates
Key Advantage: All these use cases can be implemented with simple search approaches rather than complex RAG pipelines.
For internal documentation where Tavily API access is limited, adapt the system to use Elasticsearch.
Enterprise Deployment Benefits:
- ✅ Complete data control - All searches stay within corporate network
- ✅ Security compliance - No external API calls for sensitive documents
- ✅ Unified search - Same agent interface for internal and external docs
- ✅ Permission integration - Leverage existing Elasticsearch security
- ✅ Cost predictability - No per-query API costs for internal searches
Migration Path: Start with Tavily for public documentation, add Elasticsearch for internal content as needed.
This project demonstrates how organizations can:
- ✅ Implement AI Guardrails - Prevent unauthorized knowledge access
- ✅ Create Safe AI Assistants - Domain-restricted organizational tools
- ✅ Use Search-First Architecture - Simpler alternative to RAG systems
- ✅ Build LangChain Agents - Structured chat agents with tools and constraints
- ✅ Deploy Production AI - FastAPI, Docker, and monitoring
- ✅ Manage AI Knowledge Scope - Configuration-driven domain control
- ✅ Ensure Response Reliability - Force tool usage to prevent hallucination
make help # Show all available commands
make install # Setup virtual environment and dependencies
make run # Run the application locally
make test # Run tests
make clean # Clean up temporary files
make docker-build # Build Docker image
make docker-run # Run with docker-compose
make docker-stop # Stop docker-compose services
make format # Format code with black
make lint # Run linting checks
-
Setup Development Environment
make install make dev-install # Install development dependencies
-
Make Changes
# Edit code make format # Format code make lint # Check code quality
-
Test Changes
make test # Run tests make run # Test locally
-
Docker Testing
make docker-build make docker-run make docker-logs # View logs
qagent/
├── main.py # FastAPI application entry point
├── qa_agent.py # Core Q&A agent implementation
├── search_tool.py # Tavily search tool implementation
├── scraping_tool.py # Web scraping tool implementation
├── sites_data.csv # Domain configuration
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose setup
├── Makefile # Development commands
├── .env.example # Environment variables template
├── .gitignore # Git ignore rules
└── README.md # This file
-
API Key Errors
- Ensure
.env
file exists with valid API keys - Check API key permissions and quotas
- Ensure
-
Import Errors
- Activate virtual environment:
source qagent_venv/bin/activate
- Install dependencies:
make install
- Activate virtual environment:
-
Docker Issues
- Ensure Docker is running
- Check port 8000 is available
- View logs:
make docker-logs
-
Search Not Working
- Verify domain configuration in
sites_data.csv
- Check Tavily API key and quota
- Verify domain configuration in
- Check the FastAPI documentation
- Review LangChain documentation
- Examine the logs for error details
This project showcases the intelligent dual-tool approach that's reshaping AI knowledge systems in 2025. By combining fast search with smart scraping, we've created a system that's:
- Simpler than RAG: No vector databases, embeddings, or chunking complexity
- Cheaper than RAG: 15x more cost-effective with Gemini 2.0 Flash-Lite
- More reliable: Official documentation sources with complete transparency
- Always current: Real-time search without stale embedding issues
- Production-ready: Built-in guardrails and organizational safety controls
- Quick Search: Instant results for 90% of queries via Tavily API
- Deep Scraping: Comprehensive extraction when search isn't enough
- Complete Transparency: Every answer traced to official documentation
- Zero Hallucination: Forced tool usage prevents made-up responses
- Organizational Control: CSV-driven knowledge boundaries
Perfect for: Internal knowledge assistants, customer support bots, technical documentation systems, and any scenario requiring reliable, traceable AI responses within defined knowledge boundaries.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Contributions are welcome! This project follows the Apache 2.0 license terms:
- ✅ Fork and experiment with the codebase
- ✅ Submit pull requests for improvements
- ✅ Use in commercial projects (with proper attribution)
- ✅ Create derivative works while maintaining license compliance
- ✅ Educational use encouraged for learning search-first AI development
Please ensure any contributions maintain the educational focus and include proper documentation.
- LangChain - Framework for building applications with large language models
- Google Gemini - Advanced language model capabilities with affordable pricing
- Tavily - Web search API with domain restriction capabilities
- FastAPI - Modern, fast web framework for building APIs
Note: This is an educational project demonstrating search-first AI assistant development as a simpler alternative to traditional RAG systems. Feel free to adapt and extend for your organizational needs while respecting the Apache 2.0 license terms.