-
-
Awesome-DataCentric-LLM Public
Trending projects & awesome papers about data-centric llm studies.
Megatron-Sailor2 Public
Forked from sail-sg/Megatron-Sailor2Megatron for Qwen
nanoverl Public
Collections of RLxLM experiments using minimal codes
verl Public
Forked from volcengine/verlveRL: Volcano Engine Reinforcement Learning for LLM
trafilatura Public
Forked from adbar/trafilaturaPython & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Python Apache License 2.0 UpdatedDec 28, 2024 magpie Public
Forked from magpie-align/magpieOfficial repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
Python MIT License UpdatedOct 23, 2024 temp-open-instruct Public
Forked from allenai/open-instructtemp-fork
Python Apache License 2.0 UpdatedSep 30, 2024 llm-swarm Public
Forked from huggingface/llm-swarmManage scalable open LLM inference endpoints in Slurm clusters
Python MIT License UpdatedJul 11, 2024 datatrove Public
Forked from huggingface/datatroveFreeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Python Apache License 2.0 UpdatedJun 3, 2024 code-llm-contamination Public
Forked from yale-nlp/code-llm-contaminationPython MIT License UpdatedMay 16, 2024 amber-train Public
Forked from LLM360/amber-trainPre-training code for Amber 7B LLM
Python Apache License 2.0 UpdatedMay 10, 2024 sailcraft Public
Forked from sail-sg/sailcraftData Toolkit for Sailor Language Models
Python UpdatedApr 30, 2024 CodeQwen1.5 Public
Forked from QwenLM/Qwen2.5-CoderCodeQwen1.5 is the code version of Qwen, the large language model series developed by Qwen team, Alibaba Cloud.
Python UpdatedApr 16, 2024 LLaVA Public
Forked from haotian-liu/LLaVA[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Python Apache License 2.0 UpdatedMar 8, 2024 prismatic-vlms Public
Forked from TRI-ML/prismatic-vlmsA flexible and efficient codebase for training visually-conditioned language models (VLMs)
Python MIT License UpdatedMar 5, 2024 TinyLlama Public
Forked from jzhang38/TinyLlamaThe TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Python Apache License 2.0 UpdatedFeb 3, 2024 dspy Public
Forked from stanfordnlp/dspyDSPy: The framework for programming—not prompting—foundation models
Python MIT License UpdatedJan 29, 2024 mink Public
Forked from swj0419/detect-pretrain-codeThis repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu , Terra Blevins…
Python Apache License 2.0 UpdatedNov 3, 2023 awesome-llm-powered-agent Public
Forked from hyp1231/awesome-llm-powered-agent 8C00Awesome things about LLM-powered agents. Papers / Repos / Blogs / ...
MIT License UpdatedOct 17, 2023 ebooks Public
Forked from kska32/ebooks收藏的一些经典的历史、政治、心理、哲学、数学、计算机方面电子书(约10万本)
JavaScript UpdatedSep 28, 2023 open-interpreter Public
Forked from OpenInterpreter/open-interpreterOpenAI's Code Interpreter in your terminal, running locally
Python MIT License UpdatedSep 19, 2023 tacube Public
[EMNLP 2022] TaCube: Pre-computing Data Cubes for Answering Numerical-Reasoning Questions over Tabular Data
17 UpdatedMay 17, 2023 openai-cookbook Public
Forked from openai/openai-cookbookExamples and guides for using the OpenAI API
Jupyter Notebook MIT License UpdatedMar 12, 2023 datasets Public
Forked from huggingface/datasets🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Python Apache License 2.0 UpdatedJun 1, 2022