⚙️ Setup

🤖 Automated Fuzzing Corpus Generation 📁

AutoCorpus is a tool backed by a large language model (LLM) for automatically generating corpus files for fuzzing.

AutoCorpus works best when generating corpus files that are based on natural language, such as JSON, XML, or other config files.

⚙️ Setup

System Requirements

AutoCorpus utilizes the Mistral-7B-Instruct-v0.2 model and, where possible, offloads processing to your system's GPU. It is recommended to run AutoCorpus on a machine with a minimum of 16GB of RAM and a dedicated Nvidia GPU with at least 4GB of memory. However, it can run on lower-spec machines, albeit at a significantly slower pace.

AutoCorpus has been tested on Windows 11; however, it should be compatible with Unix and other systems.

Dependencies

AutoCorpus requires Nvidia CUDA for enhanced LLM performance. Follow the steps below:

Ensure your Nvidia drivers are up to date: Nvidia Drivers
Install the appropriate dependencies from here
Validate CUDA installation by running the following command and receiving a prompt: python -c "import torch; print(torch.rand(2,3).cuda())"

Python dependencies can be found in the requirements.txt file:

pip install -r requirements.txt

AutoCorpus can then be installed using the ./setup.py script as below:

python -m pip install .

Running

AutoCorpus can generate corpus files via three different scenarios:

A Single Prompt

For example asking AutoCorpus to generate an XML file would be as follows:

AutoCorpus.exe -o "out" -p "xml file"

Existing Corpus File(s)

AutoCorpus can base new corpus files off existing ones.

AutoCorpus.exe -i "input_folder" -o "out"

Both Existing Corpus Files And a Prompt.

Generation can be run by using both an existing corpus and a prompt.

AutoCorpus.exe -i "input_folder" -o "out" -p "xml file"

Usage

usage: AutoCorpus [-h] [--input_folder INPUT_FOLDER] [--output_folder OUTPUT_FOLDER] [--number_of_corpus_files NUMBER_OF_CORPUS_FILES] [--prompt PROMPT]
                  [--size SIZE] [--verbose]

A tool for automatically generating initial fuzzing input corpus test cases

optional arguments:
  -h, --help            show this help message and exit
  --input_folder INPUT_FOLDER, -i INPUT_FOLDER
                        The input folder to base generated corpus files off. If no prompt is given, the folder needs at least 1 file.
  --output_folder OUTPUT_FOLDER, -o OUTPUT_FOLDER
                        The folder to save generated corpus files to (will default to input folder).
  --number_of_corpus_files NUMBER_OF_CORPUS_FILES, -n NUMBER_OF_CORPUS_FILES
                        The number of corpus files to generate
  --prompt PROMPT, -p PROMPT
                        A sentence defining what the corpus files are for. This helps steer generation.
  --size SIZE, -s SIZE  Max size of tokens created by the LLM
  --verbose, -v         Provides verbose outputs

Examples

JSON Corpus Generation

Generates 5 corpus files solely on the prompt complex json files with varying data.

AutoCorpus.exe -o "out" -p "complex json files with varying data"

[{"id": 1, "name": "John Doe", "age": 30, "city": "New York"},

{"id": 2, "name": "Jane Smith", "age": 28, "city": "Los Angeles"},

{"id": 3, "name": "Mike Johnson", "age": 35, "city": "Chicago"},

{"id": 4, "name": "Emma Watson", "age": 27, "city": "London"}]

AWK Config Corpus Generation

Creates an AWK config based on existing example awk configs in the ..\corpus\awk\ directory along with the prompt config file for busybox awk.

AutoCorpus.exe -i ..\corpus\awk\ -p "config file for busybox awk" -n 10 -s 700

```bash
#!/usr/bin/awk -f

BEGIN {
  FS="\t"
  if (ARGC != 3) {
    print "Usage: awk-script.awk <file> <field> <delimiter>"
    exit 1
  }
  print "Input file:", ARGV[1]
  print "Field to print:", ARGC[2]
  print "Delimiter:", ARGC[3]

  FILENAME = ARGV[1]
  if (open(FILENAME, "r")) {
    while ((getline line < FILENAME) > 0) {
      gsub(/[[:space:]]+/, "", line) # remove whitespaces
      split(line, fields, FS)
      for (i = 1; i <= NF; i++) {
        if (length(fields[i]) > 0 && fields[i] ~ /\Q"[\"']"ARGV[2]"\Q/ && i != ARGC[3]) {
          next
        }
        if (i == ARGC[3] || i == NF) {
          print fields[i]
          break
        }
      }
    }
    close(FILENAME)
  } else {
    print "Error opening file:", FILENAME
    exit 1
  }
}```

🤖 Mistral-7B-Instruct-v0.2

Behind the scenes AutoCorpus uses the Mistral-7B-Instruct-v0.2 model from The Mistral AI Team - see here. The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.2. More can be found on the model here!.

7.24B params
Tensor type: BF16
32k context window (vs 8k context in v0.1)
Rope-theta = 1e6
No Sliding-Window Attention

🙏 Contributions

AutoCorpus is an open-source project and welcomes contributions from the community. If you would like to contribute to AutoCorpus, please follow these guidelines:

Fork the repository to your own GitHub account.
Create a new branch with a descriptive name for your contribution.
Make your changes and test them thoroughly.
Submit a pull request to the main repository, including a detailed description of your changes and any relevant documentation.
Wait for feedback from the maintainers and address any comments or suggestions (if any).
Once your changes have been reviewed and approved, they will be merged into the main repository.

⚖️ Code of Conduct

AutoCorpus follows the Contributor Covenant Code of Conduct. Please make sure to review and adhere to this code of conduct when contributing to AutoCorpus.

🐛 Bug Reports and Feature Requests

If you encounter a bug or have a suggestion for a new feature, please open an issue in the GitHub repository. Please provide as much detail as possible, including steps to reproduce the issue or a clear description of the proposed feature. Your feedback is valuable and will help improve AutoCorpus for everyone.

📜 License

GNU General Public License v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
AutoCorpus		AutoCorpus
LICENSE		LICENSE
README.md		README.md
logo.gif		logo.gif
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚙️ Setup

System Requirements

Dependencies

Running

A Single Prompt

Existing Corpus File(s)

Both Existing Corpus Files And a Prompt.

Usage

Examples

JSON Corpus Generation

AWK Config Corpus Generation

🤖 Mistral-7B-Instruct-v0.2

🙏 Contributions

⚖️ Code of Conduct

🐛 Bug Reports and Feature Requests

📜 License

About

Languages

License

user1342/AutoCorpus

Folders and files

Latest commit

History

Repository files navigation

⚙️ Setup

System Requirements

Dependencies

Running

A Single Prompt

Existing Corpus File(s)

Both Existing Corpus Files And a Prompt.

Usage

Examples

JSON Corpus Generation

AWK Config Corpus Generation

🤖 Mistral-7B-Instruct-v0.2

🙏 Contributions

⚖️ Code of Conduct

🐛 Bug Reports and Feature Requests

📜 License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages