Llama3 Experiment

You can fetch original model weights from hugging face:https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct or modelscope:https://www.modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct.

git lfs install
git clone https://www.modelscope.cn/LLM-Research/Meta-Llama-3-8B-Instruct.git

code_train.xlsx for classification trainning, eval.xlsx for classification evalation, valid.xlsx for classification validation.
You can fetch training source codes from Github OpenChatKit Project:https://github.com/togethercomputer/OpenChatKit
Please refer to the training script to adapt it to your hardware configuration. "start_phd_train.sh" is my training script. This script has been adapted for use with the llama model and the GPT-NeoX model.
use data_loader to load xls files for training datasets.
The trained model can be fetched from https://huggingface.co/Misery-HaHa/SkyLlama38Code

GPT-NeoX Experiment

You can fetch original model weights from GPT-NeoX:https://huggingface.co/EleutherAI/gpt-neox-20b or OpenChatKit-GPT-NeoX-20B:https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B
You can fetch training source codes from Github OpenChatKit Project:https://github.com/togethercomputer/OpenChatKit
Please refer to the training script to adapt it to your hardware configuration. "start_phd_train.sh" is my training script. This script has been adapted for use with the llama model and the GPT-NeoX model.
C_wasm_source_code_52000.rar includes 52,000 C language samples. It is from ojclone datasets. You can fetch it from https://github.com/clonebench/BigCloneBench
use run_c2wasm.sh to compile C source codes

#!/bin/bash
success=0
total=0
err_num=0
for file in *.c; do
	echo "------------compile wasm from c/c++ file: $file-----------------"
	emcc -Oz -ferror-limit=1 -s WASM=1 -s SIDE_MODULE=1 -s USE_BOOST_HEADERS=0 -s ASSERTIONS=0 -g0 -Wmain-return-type -Wreturn-type -Werror,-Wimplicit-function-declaration,-Wunknown-warning-option,-Wimplicit-function-declaration,-Wdeprecated -o $(basename "$file").wasm $file
	#wasm2wat -o $(basename "$file").wast $(basename "$file").wasm
	#echo --------compiler wasm---------------------
	#echo wasmtime $(basename "$file").wasm
	ret=$?
	total=$(expr $total + 1 )
	echo "compile result: $ret"
	if [ $ret -eq 0 ]; then
		#echo "success:$success,$total. fail:$err_num"
		success=$(expr $success + 1)
		echo "success:$success,$total. fail:$err_num"
	else
		err_num=$(expr $err_num + 1)
		echo "error files count : $err_num"
	fi
done
echo "error=$err_num, success=$success, total=$total" > log.txt
ls -l | grep .c.wasm | wc -l

use data_loader to load wasm files for training datasets byte by byte as token index. don't token the instructions.
Adjust the batch size and the number of layers of the model network loaded by each GPU, and then start the training. The startup script can be referred to as start_phd_train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Llama3 Experiment

GPT-NeoX Experiment

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
C_wasm_source_code_52000.rar		C_wasm_source_code_52000.rar
README.md		README.md
code_train.xlsx		code_train.xlsx
eval.xlsx		eval.xlsx
model_350_20240803020257_code_datasets.xlsx		model_350_20240803020257_code_datasets.xlsx
run_c2wasm.sh		run_c2wasm.sh
start_phd_train.sh		start_phd_train.sh
system_prompt.txt		system_prompt.txt
valid.xlsx		valid.xlsx

dengliangjun/TextEmbCLLMs

Folders and files

Latest commit

History

Repository files navigation

Llama3 Experiment

GPT-NeoX Experiment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages