After cloning the repo...
python -m venv venv
source venv/bin/activate # For macOS/Linux
.\venv\Scripts\activate # For Windows
pip install -r requirements.txt
Make env files key.env
and .env
. In .env
, store your HuggingFace access token for access to Meta's Llama models in HF_TOKEN
In key.env
store your Gemini Flash API key in GOOGLE_API_KEY
.
The dataset, stored in lc_hard.json
, is comprised of Hard difficulty problems from LeetCode.
As it stands, there are 50 entries; the first 20 are from questions that are asked much more frequently in real life technical interviews, and the remaining 30 are selected from the set of other Hard problems on LeetCode.
Each entry in the dataset follows the schema below:
{
"desc": "",
"skeleton": "",
"examples": [""],
"ref": [""],
"test": {"input": [[]], "output": []},
"func": ""
}
Features:
-
desc
stores a string representation of the problem statement. -
skeleton
stores a string representation of the code starting point. -
examples
stores a list of problems with solutions for in-context learning. [This is currently not utilized, but room for it exists in the dataset] -
ref
stores a list of human produced reference solutions to the problem. -
test
stores a dictionary whereinput
maps to a list of inputs for test cases andoutput
maps to a list of outputs. A test case's input is represented as a list of inputs, even if there is only one. For example, if the input for a singular test case is 4, it would be represented as[4]
. If the inputs for a singular test were 4 and "foo", it would be represented as[4, "foo"]
-
func
stores a string representation of the function name the model is completing. This is used for the dynamic code execution, not model prompting.
Because the dataset's json representation needs desc
, skeleton
, ref
, and func
to be single line strings, it can be hard to type everything out and preserve structure for code. To combat this difficulty, multiple scripts were made to take multiline strings that are easy to paste in and read, convert them into a json acceptable format, and write them to the json.
To edit the feature of an entry/entries, simply open lc_hard_{feature}.py
and set lcX
where X
is the problem number of interest your desired string. When you've made the changes you want to make, save the file and run the script to update the dataset.
! Note that the current version of these scripts is hard coded for 50 entries. If entries are added, the value in range
for when gathering the global variables should be changed accordingly.
The evaluation in this project is three-pronged:
- Structural Similarity
- Accuracy
- Readability
Structural similarity and accuaracy are calculated in eval.ipnyb
. ! This must be run before running readability.ipnyb
.
Structural similarity is calculated using Corpus BLEU Score. ! It is important to note that the non-deterministic results of the model's generation (due to not setting a seed and using default top-p and temperature parameters) that the BLEU score can vary significantly due to the sensitivity of the metric.
Accuracy is calculated by dynamically executing the code generated by the model and then running the corresponding input output pairs from the dataset against the function produced.
It is noted as Average Pass Rate across all entries, Pass Rate for Popular Problems (first 20), and Pass Rate for Unpopular Problems (last 30).
! Note that Pass Rate for Unpopular Problems is calculated based on all problems other than the first 20, so keep this factor in mind if you add entries.
Running the cells in eval.ipnyb
will produce the model's outputs in the generated_outputs
folder as individual txt files. After this is done, running the cells in readability.ipnyb
will store readability scores for each code snippet in readability_scores.json
, produce a histogram of readabilty scores as judged by Gemini Flash, and an average readability score.
For ease of access, we pasted our results as printed in eval.ipnyb
in stats.txt
.
As mentioned above, running readability.ipnyb
will yield a histogram of readabilty scores as judged by Gemini Flash. Currently, our other graphs are generated by the simple make_graphs.ipnyb
. Paste the dict representing Pass Rate by Problem into acc_by_snippet
. Running this notebook will produce graphs for Readability by Accuracy.
The check_cuda.py
exists to check if your GPU is available.