- Beijing
-
01:25
(UTC +08:00) - https://scholar.google.com/citations?user=j4EmuqkAAAAJ&hl=zh-CN
Stars
Graphic notes on Gilbert Strang's "Linear Algebra for Everyone"
Generative Agents: Interactive Simulacra of Human Behavior
目前已囊括213个大模型,覆盖chatgpt、gpt-4o、o3-mini、谷歌gemini、Claude3.5、智谱GLM-Zero、文心一言、qwen-max、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及DeepSeek-R1、qwq-32b、deepseek-v3、qwen2.5、llama3.3、phi-4、glm4、gemma3、mistral、书生in…
Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
🎉 Repo for LaWGPT, Chinese-Llama tuned with Chinese Legal knowledge. 基于中文法律知识的大语言模型
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
GAOKAO-Bench is an evaluation framework that utilizes GAOKAO questions as a dataset to evaluate large language models.
BELLE: Be Everyone's Large Language model Engine(开源中文对话大模型)
We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tuning) together for easy use. We welcome open-source enthusiasts…
Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks
4 bits quantization of LLaMA using GPTQ
A collection of libraries to optimise AI model performances
A High-Performance Pytorch Implementation of face detection models, including RetinaFace and DSFD
An implement of the paper of EDA for Chinese corpus.中文语料的EDA数据增强工具。NLP数据增强。论文阅读笔记。
FaRL for Facial Representation Learning [Official, CVPR 2022]
Synthetic Faces High Quality (SFHQ) Dataset. 425,000 curated 1024x1024 synthetic face images
State-of-the-Art Text Embeddings
Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
[CVPR 2021] Multi-Modal-CelebA-HQ: A Large-Scale Text-Driven Face Generation and Understanding Dataset
Free English to Chinese Dictionary Database
ChineseSemanticKB,chinese semantic knowledge base, 面向中文处理的12类、百万规模的语义常用词典,包括34万抽象语义库、34万反义语义库、43万同义语义库等,可支持句子扩展、转写、事件抽象与泛化等多种应用场景。