8000 GitHub - vljap/MultiPL-E: A multi-programming language benchmark for evaluating the performance of large language model of code.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
forked from nuprl/MultiPL-E

A multi-programming language benchmark for evaluating the performance of large language model of code.

License

Notifications You must be signed in to change notification settings

vljap/MultiPL-E

 
 

Repository files navigation

Multi-Programming Language Evaluation of Large Language Models of Code (MultiPL-E)

Please visit the website or read our paper for more information.

Evaluation on Discovery

These instructions will run inference and evaluation on the Northeastern Discovery cluster. It should be possible to easily adapt the scripts for other Slurm clusters.

Prerequisites

On a compute node, run

singularity pull docker://ghcr.io/nuprl/multipl-e-evaluation

This wll create the file multipl-e-evaluation_latest.sif, which is the container. The file cluster/discovery_evaluation.sh assumes that the file is saved as /work/arjunguha-research-group/arjun/containers/multipl-e-evaluation_latest.sif.

You also need an environment that has the MultiPL-E dependencies. On Discovery, you can use source ~a.guha/bin/gpuenv, which activates an appropriate Conda environment.

Running the Evaluation

You can do this on the login node or a compute node with limited resources.

  1. Activate an appropriate environment:

    source ~a.guha/bin/gpuenv
    
  2. Enter the root of the MultiPL-E repository:

    cd /work/arjunguha-research-group/arjun/repos/MultiPL-E
    
  3. Create a directory for experiment results:

    mkdir experiments
    

    You can re-use this directory to incrementally add new experiments.

  4. Create a file called experiments/inference.sh. Each line of the file should run inference. For example:

    python -m inference --model-name inference.bigcode_mha --root-dataset humaneval --lang py --temperature 0.2 --batch-size 50
    

    We will not run this shell script directly. Instead, we will run each line on a separate GPU node. Therefore, ensure that no command spans multiple lines (i.e., do not use trailing \) and do not include the #! on the first line.

  5. Run ./cluster/pipeline.sh experiments

    You will receive an email at your @northeastern.edu address when complete.

    The script puts all logs files in experiments/logs.

About

A multi-programming language benchmark for evaluating the performance of large language model of code.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 85.8%
  • Lua 10.4%
  • Rust 1.9%
  • Jupyter Notebook 1.0%
  • Shell 0.7%
  • Dockerfile 0.2%
0