This Python repository contains a set of scripts that allow you to scrape a website, clean the data, organize it, chunk it, and then vectorize it. The resulting vectors can be used for a variety of machine learning tasks, such as similarity search or clustering. Recently, a script was added to consume PDFs and add them to the training data as well.
THESE FUNCTIONS CONSUME THE FILES THEY PROCESS (only in the websites and pdfs directories)
cleaner.py
: This script downloads a website using wget, reads and cleans the HTML files using Beautiful Soup, and saves the resulting text files in a specified directory.chunker.py
: This script splits the text files into smaller chunks using a recursive character-based text splitter. The resulting chunks are saved in a JSONL file.vectorizor.py
: This script loads the JSONL file, creates embeddings using OpenAI's text-embedding-ada-002 model, and indexes the embeddings using Pinecone. This script requires API keys to run.pdf-muncher.py
: This script processes every PDF in the pdf-docs folder and adds the content to the vectorized train.json.
- Python 3.x
- OpenAI API key
- Pinecone API key
bs4
Python libraryjsonlines
Python librarytqdm
Python librarytiktoken
Python librarypinecone-client
Python library
- Clone the repository and navigate to the project directory.
- Install the required Python libraries using
pip install -r requirements.txt
. - Set up your OpenAI and Pinecone API keys.
- Download the website using the wget command:
wget -r -np -nd -A.html,.txt,.tmp -P websites https://www.linkedin.com/in/sean-stobo/
- Run
python cleaner.py
to download and clean the website data. This will break down the directory structure into a list of HTML documents. - Run
python chunker.py
to split the text files into smaller chunks. This outputs train.json in the root directory. - Run
python pdf-muncher.py
to convert the contents of the '/pdfs/' folder to a serialized train.jsonl file in the root directory. - Run
python vectorizor.py
to create embeddings and index them using Pinecone. This will vectorize train.json.
Note: Before running vectorizor.py
, make sure to set up a Pinecone database with 1536 dimensions.
- Choose a site to scrape.
- Observe the 'website' folder filling up with files.
- Run the cleaner script.
- Files are normalized and cleaned up.
- Run the chunker script to chunk and vectorize the website files.
- Run the PDF muncher script to process the PDFs in the pdfs folder.
- Verify that the vectorized training data contains the DnD content.
- Run the vectorizor script to update the Pinecone DB.