Website Scraper and PDF Chunker/Vectorizer for Pinecone DB

This Python repository contains a set of scripts that allow you to scrape a website, clean the data, organize it, chunk it, and then vectorize it. The resulting vectors can be used for a variety of machine learning tasks, such as similarity search or clustering. Recently, a script was added to consume PDFs and add them to the training data as well.

Files

THESE FUNCTIONS CONSUME THE FILES THEY PROCESS (only in the websites and pdfs directories)

cleaner.py: This script downloads a website using wget, reads and cleans the HTML files using Beautiful Soup, and saves the resulting text files in a specified directory.
chunker.py: This script splits the text files into smaller chunks using a recursive character-based text splitter. The resulting chunks are saved in a JSONL file.
vectorizor.py: This script loads the JSONL file, creates embeddings using OpenAI's text-embedding-ada-002 model, and indexes the embeddings using Pinecone. This script requires API keys to run.
pdf-muncher.py: This script processes every PDF in the pdf-docs folder and adds the content to the vectorized train.json.

Requirements

Python 3.x
OpenAI API key
Pinecone API key
bs4 Python library
jsonlines Python library
tqdm Python library
tiktoken Python library
pinecone-client Python library

Usage

Clone the repository and navigate to the project directory.
Install the required Python libraries using pip install -r requirements.txt.
Set up your OpenAI and Pinecone API keys.
Download the website using the wget command: wget -r -np -nd -A.html,.txt,.tmp -P websites https://www.linkedin.com/in/sean-stobo/
Run python cleaner.py to download and clean the website data. This will break down the directory structure into a list of HTML documents.
Run python chunker.py to split the text files into smaller chunks. This outputs train.json in the root directory.
Run python pdf-muncher.py to convert the contents of the '/pdfs/' folder to a serialized train.jsonl file in the root directory.
Run python vectorizor.py to create embeddings and index them using Pinecone. This will vectorize train.json.

Note: Before running vectorizor.py, make sure to set up a Pinecone database with 1536 dimensions.

Visual Guide

Choose a site to scrape.
Observe the 'website' folder filling up with files.
Run the cleaner script.
Files are normalized and cleaned up.
Run the chunker script to chunk and vectorize the website files.
Run the PDF muncher script to process the PDFs in the pdfs folder.
Verify that the vectorized training data contains the DnD content.
Run the vectorizor script to update the Pinecone DB.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.genaiscript		.genaiscript
.vscode		.vscode
data		data
pdfs		pdfs
.DS_Store		.DS_Store
.env		.env
.gitattributes		.gitattributes
Readme.md		Readme.md
chat.py		chat.py
chunker.py		chunker.py
cleaner.py		cleaner.py
pdf-muncher.py		pdf-muncher.py
requirements.txt		requirements.txt
train.jsonl		train.jsonl
vectorizor.py		vectorizor.py
wget-log		wget-log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Website Scraper and PDF Chunker/Vectorizer for Pinecone DB

Files

Requirements

Usage

Visual Guide

About

Uh oh!

Releases

Packages

Languages

ramiisaac/Site-Sn33k

Folders and files

Latest commit

History

Repository files navigation

Website Scraper and PDF Chunker/Vectorizer for Pinecone DB

Files

Requirements

Usage

Visual Guide

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages