A powerful Retrieval-Augmented Generation (RAG) backend for generating comprehensive lesson plans and educational content. This application leverages crawl4ai for web scraping and Google's Gemini LLM for content generation.
- Generate structured topic hierarchies for lesson plans
- Create MDX content from web sources
- Refine content with LLM assistance
- Direct crawling-to-LLM pipeline
- Multiple refinement options with web crawling integration
- Python 3.8+
- pip
-
Clone the repository:
git clone https://github.com/yourusername/TopicMarker-RAG.git cd TopicMarker-RAG
-
Create and activate a virtual environment:
For Linux/macOS:
python -m venv venv source venv/bin/activate
For Windows:
python -m venv venv venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
The
requirements.txt
file contains all the necessary dependencies for this project:fastapi uvicorn[standard] python-dotenv pydantic pydantic-settings pinecone google-generativeai crawl4ai duckduckgo-search requests googlesearch-python langchain langchain-community langchain-openai openai
-
Create a
.env
file in the root directory with the following variables:GEMINI_API_KEY=your_gemini_api_key PINECONE_API_KEY=your_pinecone_api_key PINECONE_ENVIRONMENT=your_pinecone_environment PINECONE_INDEX_NAME=your_pinecone_index_name
Start the FastAPI server:
uvicorn app.main:app --reload
The API will be available at http://localhost:8000
.
- GET / - Welcome message
- Returns:
{"message": "Welcome to the Lesson Plan RAG Backend!"}
- Returns:
- POST /rag/search-topics
- Input:
{"query": "string", "limit": int}
(default limit: 2) - Returns: A structured list of main topics and subtopics suitable for a lesson plan
- Example:
{"status": "success", "data": {"topics": [...]}}
- Input:
-
POST /rag/single-topic
- Input:
{"selected_topic": "string", "main_topic": "string", "num_results": int}
(default num_results: 2) - Returns: Comprehensive MDX content for a single topic
- Example:
{"status": "success", "data": {"mdx_content": "string", "crawled_websites": [...]}}
- Input:
-
POST /rag/single-topic-raw
- Input:
{"selected_topic": "string", "main_topic": "string", "num_results": int}
(default num_results: 2) - Returns: Raw MDX content as plain text (not JSON)
- Input:
-
POST /rag/generate-mdx-llm-only
- Input:
{"selected_topic": "string", "main_topic": "string"}
- Returns: MDX content generated using only LLM knowledge (no web crawling)
- Example:
{"status": "success", "data": {"mdx_content": "string"}}
- Input:
-
POST /rag/generate-mdx-llm-only-raw
- Input:
{"selected_topic": "string", "main_topic": "string"}
- Returns: Raw MDX content generated using only LLM knowledge as plain text (not JSON)
- Input:
-
POST /rag/generate-mdx-from-urls
- Input:
{"urls": ["string"], "selected_topic": "string", "main_topic": "string", "topic": "string" (optional), "use_llm_knowledge": bool}
- URLs: 1 to 5 URLs to crawl
- selected_topic: The subtopic to focus on
- main_topic: The main topic that the selected topic belongs to
- Returns: MDX content generated from multiple URLs
- Example:
{"status": "success", "urls": [...], "selected_topic": "string", "main_topic": "string", "mdx_content": "string"}
- Input:
-
POST /rag/generate-mdx-from-urls-raw
- Input:
{"urls": ["string"], "selected_topic": "string", "main_topic": "string", "topic": "string" (optional), "use_llm_knowledge": bool}
- Returns: Raw MDX content as plain text (not JSON)
- Input:
-
POST /rag/refine-with-selection
- Input:
{"mdx": "string", "selected_text": "string", "selected_topic": "string", "main_topic": "string", "question": "string"}
- Returns: Refined content using the LLM with selected text and topic context
- Example:
{"status": "success", "data": {"answer": "string"}}
- Input:
-
POST /rag/refine-with-selection-raw
- Input:
{"mdx": "string", "selected_text": "string", "selected_topic": "string", "main_topic": "string", "question": "string"}
- Returns: Raw refined content as plain text (not JSON)
- Input:
-
POST /rag/refine-with-crawling
- Input:
{"mdx": "string", "selected_text": "string", "selected_topic": "string", "main_topic": "string", "question": "string", "num_results": int}
(default num_results: 2) - Returns: Refined content by first crawling relevant websites and then using the LLM
- Example:
{"status": "success", "data": {"answer": "string", "crawled_websites": [...]}}
- Input:
-
POST /rag/refine-with-crawling-raw
- Input:
{"mdx": "string", "selected_text": "string", "selected_topic": "string", "main_topic": "string", "question": "string", "num_results": int}
(default num_results: 2) - Returns: Raw refined content as plain text (not JSON)
- Input:
-
POST /rag/refine-with-urls
- Input:
{"mdx": "string", "selected_text": "string", "selected_topic": "string", "main_topic": "string", "question": "string", "urls": ["string"]}
- Returns: Refined content by crawling specific URLs provided by the user
- Example:
{"status": "success", "data": {"answer": "string", "crawled_websites": [...]}}
- Input:
-
POST /rag/refine-with-urls-raw
- Input:
{"mdx": "string", "selected_text": "string", "selected_topic": "string", "main_topic": "string", "question": "string", "urls": ["string"]}
- Returns: Raw refined content as plain text (not JSON)
- Input:
The project includes a comprehensive test suite organized in the tests
directory:
unit/
- Unit tests for individual componentsapi/
- Tests for the API endpointsintegration/
- Integration testshtml/
- HTML test files for manual testing
To run all tests:
cd tests
python run_tests.py --all
To run specific test categories:
# Run only unit tests
python run_tests.py --unit
# Run only API tests (requires the API server to be running)
python run_tests.py --api
# Run only integration tests
python run_tests.py --integration
To test the API endpoints through a browser interface:
cd tests
python serve_test_page.py
Then open your browser and navigate to http://localhost:8080/test_api.html
- FastAPI - Web framework
- Uvicorn - ASGI server
- Pydantic - Data validation
- Pinecone - Vector database
- Google Generative AI (Gemini) - LLM
- crawl4ai - Web crawling
- duckduckgo-search - Web search
- googlesearch-python - Google search API
- langchain - LLM framework
- OpenAI - Embeddings for vector search