Knowledgeable Large Language Model Framework

🗣️ [ 中文 | English ]

Knowledgeable Large Language Model Framework

Project Background

With the rapid development of deep learning technology, large language models such as ChatGPT have made substantial strides in the realm of natural language processing. However, these expansive models still encounter several challenges in acquiring and comprehending knowledge, including the difficulty of updating knowledge and potential knowledge discrepancies and biases, collectively known as knowledge fallacies. The KnowLM project endeavors to tackle these issues by launching an open-source large-scale knowledgable language model framework and releasing corresponding models.

The project's initial phase introduced a knowledge extraction LLM based on LLaMA, dubbed ZhiXi (智析, which means intelligent analysis of data for knowledge extraction). To integrate the capacity of Chinese understanding into the language models without compromising their inherent knowledge, we firstly (1) use Chinese corpora for the full-scale pre-training with LLaMA (13B), augment the language model's understanding of Chinese and improve its knowledge richness while retaining its original English and code capacities; Then (2) we fine-tune the model obtained from the first step with an instruction dataset, thus bolstering the language model's understanding of human instructions for knowledge extraction.

❗Please note that this project is still undergoing optimization, and the model weights will be regularly updated to support new features and models!

The features of this project are as follows:

Centered on knowledge and large models, a full-scale pre-training of the large model, such as LLaMA, is conducted using the built Chinese & English pre-training corpus.
Based on the technology of KG2Instructions, the knowledge extraction tasks, including NER, RE, and IE, are optimized and can be completed using human instructions.
Using the built Chinese instruction dataset (approximately 1400K), LoRA fine-tuning is used to enhance the model's understanding of human instructions.
The weights of the pre-training model and LoRA's instruction fine-tuning are open-sourced.
The full-scale pre-training code (providing conversion, construction, and loading of large corpora) and LoRA instruction fine-tuning code are open-sourced (support multi-machine multi-GPU).

All weights and datasets have been uploaded to HuggingFace🤗. Click here to get started right away!

❗If you encounter any issues during the installation or use of KnowLM, please check FAQ or promptly submit an issue, and we will assist you with resolving the problem!

Category	Base	Name	Version	Download Link	Note
Base Model	LlaMA1	KnowLM-13B-Base	V1.0	HuggingFace WiseModel ModelScope	Base Model
Dialogue Model	LlaMA1	KnowLM-13B-ZhiXi	V1.0	HuggingFace WiseModel ModelScope	Information Extraction Model
Dialogue Model	LlaMA1	KnowLM-13B-IE	V1.0	HuggingFace WiseModel ModelScope	Information Extraction Model
Dialogue Model	LlaMA2	KnowLM-7B-Ocean(OceanGPT)	V1.0	HuggingFace WiseModel	Ocean Model
Dialogue Model	LlaMA2	OneKE	V1.0	HuggingFace WiseModel ModelScope	Information Extraction Model

Instruction Dataset Name	Number	Download Link	Is it used by ZhiXi	Note
KnowLM-CR (CoT&Reasoning, Chinese and English)	202,333	Google Drive HuggingFace	Yes
KnowLM-IE (Information Extraction, Chinese)	281,860	Google Drive HuggingFace	Yes	Due to using distant supervision, there exists noise.
KnowLM-Tool (Tool Learning，English)	38,241	Google Drive HuggingFace	No	It will be used in the next version.
OceanBench (Benchmark，English)	11,000	HuggingFace	No

Data description: 1. Other data sources for information extraction come from CoNLL, ACE, casis, DuEE, People Daily, DuIE, etc. 2. The KnowLM-Tool dataset comes from the paper "Making Language Models Better Tool Learners with Execution Feedback" and the gitHub can be found here. 3. The KnowLM-IE dataset comes from the paper "InstructIE: A Chinese Instruction-based Information Extraction Dataset" and the gitHub can be found here.

📬 NEWS

[March 2024] We release a new paper: "KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents".
[February 2024] We release a large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction tuning dataset named IEPile.
[February 2024] We release a new paper: "EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models" with an HF demo EasyInstruct.
[January 2024] We release a new paper:"A Comprehensive Study of Knowledge Editing for Large Language Models" with a new benchmark KnowEdit.
[August 2023] The full parameters have been released (omitting the parameter consolidation process).
[July 2023] The instruction dataset has been released.
[July 2023] Support instruction fine-tuning and vllm for LLaMA-2
[June 2023] The project name has been changed from CaMA to KnowLM.
[June 2023] Release the first version of pre-trained weights and the LoRA weights.

📍 What's the KnowLM?

This is an overview of the KnowLM, which mainly consists of three technical features:

Knowledge Prompting: It generates knowledge prompts based on structured data such as knowledge graphs and utilizes knowledge augmentation constraints to address knowledge extraction and reasoning issues.

Knowledge Editing: It aligns outdated, incorrect, and biased knowledge within large models using knowledge editing techniques to tackle knowledge fallacy problems (English Tutorial).

Knowledge Interaction: It enables dynamic knowledge interaction and feedback to achieve tool-based learning and multi-agent collaboration, resolving the problem of embodiment cognition in LLMs (English Tutorial).

The tools corresponding to these three technologies are EasyInstruct, EasyEdit, and EasyAgent (under development). We will soon provide use cases for knowledge prompting and knowledge editing based on the KnowLMframework.

< EDBE div class="markdown-heading" dir="auto">

Name		Name	Last commit message	Last commit date
Latest commit History 519 Commits
assets		assets
examples		examples
finetune		finetune
inference		inference
pdf		pdf
pretrain		pretrain
tools		tools
tutorial-notebooks		tutorial-notebooks
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
requirements.txt		requirements.txt

Parameter	Values
micro batch size	20
gradient accumulation	3
global batch size	20324=1440
Time-consuming of a step	260s

Dataset	Number
COT Datasets (Chinese, English)	202,333
General Datasets (Chinese, English)	105,216
Code Datasets (Chinese, English)	44,688
Information Extraction Datasets (English)	537,429
Information Extraction Datasets (Chinese)	486,768

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledgeable Large Language Model Framework

📬 NEWS

📍 What's the KnowLM?

🗂️ Contents

All Thanks To Our Contributors :

🚴1. Quick Start

🛠️1.1 Environment Configuration

🔧Manual Environment Configuration

🐳Building With Docker Images

💻1.2 Model Usage Guide

🎯1.3 Information Extraction Prompt

🐐1.4 LlaMA.cpp

📌1.5 Instruction Processing

🖊️1.6 Model Editing

🌰2. Cases

🌰2.1 Pretraining Cases

🌰2.2 Information Extraction Cases

🌰2.3 General Abilities Cases

🌰2.4 Model Editing Cases

🥊3. Training Details

🧾3.1 Dataset Construction (Pretraining)

⏳3.2 Training Process (Pretraining)

🧾3.3 Dataset Construction (Instruction tuning)

⏳3.4 Training Process (Instruction tuning)

🔴4. Limitations

🕐5. TODO List

❓6. FAQ

👋7. Others

👨‍👩‍👦7.1 Contributors

📇7.2 Citation

💡7.3 Acknowledgment

Why it's called ZhiXi (智析)?

About

Releases

Packages

Languages

License

chrissoso/KnowLM

Folders and files

Latest commit

History

Repository files navigation

Knowledgeable Large Language Model Framework

📬 NEWS

📍 What's the KnowLM?

🗂️ Contents

All Thanks To Our Contributors :

🚴1. Quick Start

🛠️1.1 Environment Configuration

🔧Manual Environment Configuration

🐳Building With Docker Images

💻1.2 Model Usage Guide

🎯1.3 Information Extraction Prompt

🐐1.4 LlaMA.cpp

📌1.5 Instruction Processing

🖊️1.6 Model Editing

🌰2. Cases

🌰2.1 Pretraining Cases

🌰2.2 Information Extraction Cases

🌰2.3 General Abilities Cases

🌰2.4 Model Editing Cases

🥊3. Training Details

🧾3.1 Dataset Construction (Pretraining)

⏳3.2 Training Process (Pretraining)

🧾3.3 Dataset Construction (Instruction tuning)

⏳3.4 Training Process (Instruction tuning)

🔴4. Limitations

🕐5. TODO List

❓6. FAQ

👋7. Others

👨‍👩‍👦7.1 Contributors

📇7.2 Citation

💡7.3 Acknowledgment

Why it's called ZhiXi (智析)?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages