Codebase Summarizer LLM

A powerful command-line interface (CLI) tool designed to quickly scan a project directory, generate a clean, structured report of its contents (folder tree + text file content), and optionally pass this report to an LLM for analysis, rendering the result in a local web page.

✨ Features

Project Structure: Generates a visual tree representation of the project directory.
Text File Contents: Includes the full content of identifiable text files within the project.
Intelligent Filtering: Automatically ignores common directories (node_modules, .git, dist, build/cache folders, virtual environments, etc.) and specific noisy files (package-lock.json, .env, lock files, etc.).
Binary/Non-Text Exclusion: Skips binary files, images, archives, media, and other non-text formats (unless specific parsers are available, like for PDF and Word documents).
PDF Scanning: Extracts text from PDF files using the pdf-parse Node.js library.
Word Document Scanning (.docx): Extracts text from modern Microsoft Word documents (.docx) using the mammoth library.
YouTube Transcript Fetching: Automatically detects YouTube links in .txt files, fetches the video transcript (without timestamps), and includes it in the summary directly after the link.
Optional LLM Integration: Pass the generated summary directly to an OpenAI-compatible LLM API for automated analysis using the --llm flag.
Customizable Prompting: Use a template file (--prompt) to control the instructions given to the LLM, injecting the project summary using a special tag ({{SUMMARY}}).
Configurable LLM Settings: Easily adjust the LLM model and temperature via command-line options.
Secure API Key Handling: Loads your OpenAI API key securely from a .env file.
Rich Web Rendering: When using LLM integration, the Markdown response from the model is beautifully rendered in a local web page.
Automatic Browser Opening: The generated web page is automatically opened in your default browser.
Clipboard Integration: Copies the generated report to your clipboard (default behavior when not using --llm, or explicitly with --copy).
Modular Design: New functionalities (LLM processing, web rendering) are kept in separate files for better organization.

🚀 Why Use It?

For LLM Interaction

When working with large language models (LLMs) for tasks like code explanation, refactoring, debugging, or generating documentation, providing the necessary context about your codebase is crucial. Copying individual files and explaining the structure manually is tedious and often incomplete.

This tool simplifies that process significantly:

Comprehensive Context: The generated report gives the LLM (or a human reviewer) both the "map" (folder structure) and the "details" (file contents) in one place.
Reduced Noise: By intelligently ignoring irrelevant files and directories, the report focuses only on the relevant parts, reducing token usage for LLMs and improving context clarity.
Structured Format: The output is formatted with clear separators, making it easier for models (and humans) to parse.
Direct LLM Integration: The --llm flag automates the process of sending this context to an LLM, bypassing manual copy/paste steps and immediately providing the LLM's analysis in an easy-to-read format.
Customizable Workflow: Tailor the LLM's task using a specific prompt template.

For General Code Exploration

It's also highly useful for:

Onboarding new team members by quickly sharing a project overview.
Generating documentation outlines.
Getting a bird's-eye view of an unfamiliar codebase.
Preparing code for sharing or review.

📦 Installation

This tool is a Node.js CLI application. You will need Node.js installed on your system to run it.

Install Node.js: If you don't have Node.js installed, download and install the recommended version (or v18.0.0 or later) from the official website: nodejs.org. We recommend Node.js v18 or later for compatibility with newer features and libraries.

You can verify your installation by opening a terminal and running:
```
node -v
npm -v
```
Make sure the Node.js version is 18.0.0 or higher.
Install the CLI Tool: Once Node.js and npm (Node Package Manager) are installed, you can install the package globally. This makes the summarize command available in your terminal from any directory.

Option A: Install from local directory (recommended): If you have cloned or downloaded this repository, navigate to the project directory and run:
```
# First, install dependencies
npm install

# Then, install the package globally
npm install -g .
# OR use npm link for development
npm link
```
Option B: Install from npm registry: If the package has been published to npm (not available yet), you can install it directly:
```
npm install -g summarize-code-base
```
Note for macOS/Linux users: You might need to use sudo if you encounter permission errors:
```
sudo npm install -g .
# OR
sudo npm link
```
Troubleshooting: If you encounter a "Cannot find module" error when running the summarize command, make sure you've installed the dependencies first with npm install before installing the package globally.
Setup for LLM Integration (Optional): If you plan to use the --llm functionality with OpenAI, you need an API key.
- Get your OpenAI API key from the OpenAI Platform API Keys page.
- In the directory where you installed the summarize-code-base code (if you cloned it), or in your project's root directory where you might run the command from, create a file named .env.
- Add your API key to this file like this:
```
OPENAI_API_KEY=YOUR_ACTUAL_OPENAI_API_KEY_HERE
```
  Important: Replace YOUR_ACTUAL_OPENAI_API_KEY_HERE with your actual secret key.
- Security: Ensure you do not commit your .env file to version control (e.g., add .env to your .gitignore).

💡 Usage

Navigate to the directory you want to summarize, or run the command specifying the target directory path.

The basic command requires the path to the project directory:

# Summarize the current directory (default behavior: console output + clipboard)
summarize .

# Summarize a different directory (default behavior: console output + clipboard)
summarize /path/to/your/project

Default Output (No `--llm`)

When not using the --llm flag (the default), the generated report is printed to your console and automatically copied to your clipboard.

Project Code Summarizer for 'my-project' starts...

--- Section 1: Folder Structure ---
my-project
├── documents
│   ├── report.docx
│   └── manual.pdf
├── mobile_app
│   └── main_view.swift
├── notes.txt
├── package-lock.json  # Ignored file name, but shown in structure
├── package.json
├── project_summary.js
├── scripts
│   └── data_processor.py
└── src
    ├── components
    │   └── Button.js
    └── utils
        └── helpers.js

--- Section 2: File Contents (8 files) ---

--- File: documents/report.docx ---
This is the content of the Word document.
It might contain various text elements.
--- End of File: documents/report.docx ---

--- File: documents/manual.pdf ---
This is the extracted text from the PDF.
PDFs can have complex layouts, but we get the text.
--- End of File: documents/manual.pdf ---

--- File: mobile_app/main_view.swift ---
import SwiftUI

struct MainView: View {
    var body: some View {
        Text("Hello, Swift!")
    }
}
--- End of File: mobile_app/main_view.swift ---

--- File: notes.txt ---
This is a simple text file.
It contains some notes for the project.
--- End of File: notes.txt ---

--- File: package.json ---
{
  "name": "my-project",
  "version": "1.0.0",
  "description": "An example project",
  ...
}
--- End of File: package.json ---

--- File: project_summary.js ---
#!/usr/bin/env node

const fs = require('fs').promises;
...
--- End of File: project_summary.js ---

--- File: scripts/data_processor.py ---
def process_data(data):
    # Imagine complex data processing here
    return data.upper()

print(process_data("sample input"))
--- End of File: scripts/data_processor.py ---

--- File: src/components/Button.js ---
import React from 'react';

const Button = ({ children }) => {
  return <button>{children}</button>;
};

export default Button;
--- End of File: src/components/Button.js ---

Project Code Summarizer for 'my-project' ends.
✅ Summary copied to clipboard!

Using LLM Integration (`--llm`)

Use the --llm flag to send the summary to the LLM for analysis and render the response in a browser.

# Summarize current directory and send to LLM (requires .env with OPENAI_API_KEY)
summarize . --llm

# Summarize a different directory and send to LLM
summarize /path/to/your/project --llm

When using --llm:

The extensive summary is not printed to the console.
The summary is injected into the prompt template.
The prompt is sent to the OpenAI API.
The LLM's response (expected Markdown) is converted to HTML.
A simple local web server starts temporarily.
The HTML report is automatically opened in your default browser.
The raw summary is not copied to the clipboard by default (use --copy to force it).

Here is an example of applying this project summarizer on this Github repo (recursive???) and display the summary website with diagrams and tables:

LLM Options

You can customize the LLM processing using the following options with the --llm flag:

--prompt <path> (Alias: -p): Specify the path to a custom prompt template file. Defaults to prompt_template.txt in the current directory. The template should contain the placeholder {{SUMMARY}} where the project summary will be injected.
```
summarize . --llm --prompt ./my-prompts/analysis-template.txt
```
--model <model_name> (Alias: -m): Specify the OpenAI model to use. Defaults to gpt-4o (or gpt-3.5-turbo if gpt-4o is not available or preferred).
```
summarize . --llm --model gpt-3.5-turbo
```
--temperature <value> (Alias: -t): Set the temperature for the LLM response (a number between 0.0 and 2.0). Defaults to 0.7.
```
summarize . --llm --temperature 1.0
```
--copy (Alias: -c): Force copying the raw generated summary to the clipboard even when using the --llm flag. By default, --copy is true when --llm is false, and false when --llm is true.
```
summarize . --llm --copy # Use LLM AND copy the raw summary to clipboard
summarize . --no-copy   # Don't copy to clipboard (only print to console)
```

You can combine these options:

summarize /path/to/project --llm --model gpt-4o --temperature 0.5 --prompt my_template.txt --copy

🛠 How It Works

Entry Point (index.js): This is the main script executed. It uses yargs to parse all command-line arguments (directory, --llm, --prompt, etc.). It also loads environment variables from .env using dotenv.
Summary Generation (project_summary.js): The index.js script calls the generateProjectSummary function from project_summary.js. This function traverses the specified directory, applies the ignore rules, collects text file content, and formats the output into a single large summary string. This function returns the string but does not print or copy it itself anymore.
Conditional Output: Based on the presence of the --llm flag:
- If --llm is NOT used: The index.js script prints the generated summary string to the console and, if clipboardy is available and --copy is enabled, copies it to the clipboard (replicating the original behavior).
- If --llm IS used:
  - index.js retrieves the OPENAI_API_KEY from environment variables.
  - index.js calls the processWithLLM function from llm_processor.js, passing the summary string and the LLM configuration options (prompt path, model, temperature, API key).
  - LLM Processing (llm_processor.js): This module reads the specified prompt template, replaces the {{SUMMARY}} placeholder with the generated summary, initializes the OpenAI client, makes a request to the OpenAI API, and returns the LLM's text response.
  - index.js receives the LLM response and calls the renderAndServe function from web_renderer.js.
  - Web Rendering (web_renderer.js): This module takes the LLM's Markdown response, converts it into HTML using marked, embeds it in a simple HTML template with basic styling, starts a temporary local HTTP server to serve this HTML content on an available port, and uses the open package to automatically open the server's URL in the user's default browser.

🚫 Ignoring Files and Directories

The tool comes with built-in lists of common directories and files to ignore, defined within project_summary.js. These are designed to focus the summary on relevant codebase files.

IGNORED_DIRS: Contains directories like node_modules, .git, dist, build/cache folders for various languages/frameworks, virtual environments (venv, env), etc.
IGNORED_FILES: Contains specific file names like package-lock.json, .env, .env.local, and various lock files (poetry.lock, yarn.lock, composer.lock, etc.).
NON_TEXT_EXTENSIONS: Contains file extensions for binary files, images, archives, media, databases, fonts, etc. (Note: PDF files are processed using the pdf-parse library).

These lists are quite comprehensive and cover many typical project setups.

Note: Currently, the tool does not support custom ignore patterns via command-line arguments or configuration files. The built-in lists are used.

🙌 Contributing

Contributions are welcome! If you have suggestions for improvements, bug fixes, or want to add more file/directory patterns to the ignore lists, feel free to open an issue or submit a pull request.

Fork the repository.
Clone your fork: git clone https://github.com/TomHuynhSG/code-base-summarizer-llm.git
Install dependencies: npm install
Link the package for local testing: npm link (You can now use summarize in your terminal from any directory, pointing to your local code)
Make your changes.
Test thoroughly.
Commit your changes and push to your fork.
Create a pull request to the original repository.

📄 Document Processing

PDF Processing

The tool processes PDF files using the pdf-parse Node.js library. This library is included as a project dependency and installed automatically when you run npm install. No separate Python or external tool installation is required for PDF handling.

When a PDF file is encountered, pdf-parse attempts to extract its text content. The extracted text is included in the summary report.

Word Document Processing (.docx)

The tool supports scanning modern Microsoft Word documents (.docx files). Text is extracted using the mammoth Node.js library. This library is included as a project dependency and installed automatically when you run npm install. No separate external tool installation is required for .docx handling.

Legacy .doc files are no longer supported.

🗺️ Future Enhancements / Roadmap

Here are some planned features and potential future directions for the summarize-code-base tool:

Custom Ignore Patterns: Allow users to specify additional files, directories, or patterns to ignore via command-line arguments or a configuration file (e.g., .summarizerc).
Support for Other LLMs/APIs: Extend LLM integration to support models from providers other than OpenAI.
Multiple Output Formats: Add options to output the summary or LLM response in different formats (e.g., JSON, pure Markdown file).
Output to File: Implement an option to save the generated report or LLM response directly to a specified file.
Integrate with Local LLM (Concept): Explore integration with local Large Language Models (LLMs).
Enhance PDF Processing: Add more options for PDF processing, such as controlling the level of detail or focusing on specific parts of PDFs.
Integrate with Vision LLM for Images (Concept): Investigate using local Vision-Language Models (VLMs) to analyze image files (currently ignored) and generate text descriptions.
Progress Indicator: For large projects, add a visual indicator to show the scanning progress.

🏆 Author

Huynh Nguyen Minh Thong (Tom Huynh) - tomhuynhsg@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Codebase Summarizer LLM

✨ Features

🚀 Why Use It?

For LLM Interaction

For General Code Exploration

📦 Installation

💡 Usage

Default Output (No `--llm`)

Using LLM Integration (`--llm`)

LLM Options

🛠 How It Works

🚫 Ignoring Files and Directories

🙌 Contributing

📄 Document Processing

PDF Processing

Word Document Processing (.docx)

🗺️ Future Enhancements / Roadmap

🏆 Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
custom-templates		custom-templates
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.js		index.js
llm_processor.js		llm_processor.js
package.json		package.json
project_summary.js		project_summary.js
prompt_template.txt		prompt_template.txt
web_renderer.js		web_renderer.js

License

TomHuynhSG/code-base-summarizer-llm

Folders and files

Latest commit

History

Repository files navigation

Codebase Summarizer LLM

✨ Features

🚀 Why Use It?

For LLM Interaction

For General Code Exploration

📦 Installation

💡 Usage

Default Output (No --llm)

Using LLM Integration (--llm)

LLM Options

🛠 How It Works

🚫 Ignoring Files and Directories

🙌 Contributing

📄 Document Processing

PDF Processing

Word Document Processing (.docx)

🗺️ Future Enhancements / Roadmap

🏆 Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Default Output (No `--llm`)

Using LLM Integration (`--llm`)

Packages