🗣️ [ 中文 | English ]
Project Background
With the rapid development of deep learning technology, large language models such as ChatGPT have made substantial strides in the realm of natural language processing. However, these expansive models still encounter several challenges in acquiring and comprehending knowledge, including the difficulty of updating knowledge and potential knowledge discrepancies and biases, collectively known as knowledge fallacies. The KnowLM project endeavors to tackle these issues by launching an open-source large-scale knowledgable language model framework and releasing corresponding models.The project's initial phase
introduced a knowledge extraction LLM based on LLaMA, dubbed ZhiXi (智析, which means intelligent analysis of data for knowledge extraction). To integrate the capacity of Chinese understanding into the language models without compromising their inherent knowledge, we firstly (1) use Chinese corpora for the full-scale pre-training with LLaMA (13B), augment the language model's understanding of Chinese and improve its knowledge richness while retaining its original English and code capacities; Then (2) we fine-tune the model obtained from the first step with an instruction dataset, thus bolstering the language model's understanding of human instructions for knowledge extraction.
- ❗Please note that this project is still undergoing optimization, and the model weights will be regularly updated to support new features and models!
The features of this project are as follows:
- Centered on knowledge and large models, a full-scale pre-training of the large model, such as LLaMA, is conducted using the built Chinese & English pre-training corpus.
- Based on the technology of KG2Instructions, the knowledge extraction tasks, including NER, RE, and IE, are optimized and can be completed using human instructions.
- Using the built Chinese instruction dataset (approximately 1400K), LoRA fine-tuning is used to enhance the model's understanding of human instructions.
- The weights of the pre-training model and LoRA's instruction fine-tuning are open-sourced.
- The full-scale pre-training code (providing conversion, construction, and loading of large corpora) and LoRA instruction fine-tuning code are open-sourced (support multi-machine multi-GPU).
All weights and datasets have been uploaded to HuggingFace🤗. Click here to get started right away!
❗If you encounter any issues during the installation or use of KnowLM, please check FAQ or promptly submit an issue, and we will assist you with resolving the problem!
Category | Base | Name | Version | Download Link | Note |
---|---|---|---|---|---|
Base Model | LlaMA1 | KnowLM-13B-Base | V1.0 | HuggingFace WiseModel ModelScope |
Base Model |
Dialogue Model | LlaMA1 | KnowLM-13B-ZhiXi | V1.0 | HuggingFace WiseModel ModelScope |
Information Extraction Model |
Dialogue Model | LlaMA1 | KnowLM-13B-IE | V1.0 | HuggingFace WiseModel ModelScope |
Information Extraction Model |
Dialogue Model | LlaMA2 | KnowLM-7B-Ocean(OceanGPT) | V1.0 | HuggingFace WiseModel |
Ocean Model |
Dialogue Model | LlaMA2 | OneKE | V1.0 | HuggingFace WiseModel ModelScope |
Information Extraction Model |
Instruction Dataset Name | Number | Download Link | Is it used by ZhiXi | Note |
---|---|---|---|---|
KnowLM-CR (CoT&Reasoning, Chinese and English) | 202,333 | Google Drive HuggingFace |
Yes | |
KnowLM-IE (Information Extraction, Chinese) | 281,860 | Google Drive HuggingFace |
Yes | Due to using distant supervision, there exists noise. |
KnowLM-Tool (Tool Learning,English) | 38,241 | Google Drive HuggingFace |
No | It will be used in the next version. |
OceanBench (Benchmark,English) | 11,000 | HuggingFace | No |
Data description: 1. Other data sources for information extraction come from CoNLL
, ACE
, casis
, DuEE
, People Daily
, DuIE
, etc. 2. The KnowLM-Tool
dataset comes from the paper "Making Language Models Better Tool Learners with Execution Feedback" and the gitHub can be found here. 3. The KnowLM-IE
dataset comes from the paper "InstructIE: A Chinese Instruction-based Information Extraction Dataset" and the gitHub can be found here.
- [March 2024] We release a new paper: "KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents".
- [February 2024] We release a large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction tuning dataset named IEPile.
- [February 2024] We release a new paper: "EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models" with an HF demo EasyInstruct.
- [January 2024] We release a new paper:"A Comprehensive Study of Knowledge Editing for Large Language Models" with a new benchmark KnowEdit.
- [August 2023] The full parameters have been released (omitting the parameter consolidation process).
- [July 2023] The instruction dataset has been released.
- [July 2023] Support instruction fine-tuning and vllm for
LLaMA-2
- [June 2023] The project name has been changed from
CaMA
toKnowLM
. - [June 2023] Release the first version of pre-trained weights and the LoRA weights.
This is an overview of the KnowLM
, which mainly consists of three technical features:
Knowledge Prompting: It generates knowledge prompts based on structured data such as knowledge graphs and utilizes knowledge augmentation constraints to address knowledge extraction and reasoning issues.
Knowledge Editing: It aligns outdated, incorrect, and biased knowledge within large models using knowledge editing techniques to tackle knowledge fallacy problems (English Tutorial).
Knowledge Interaction: It enables dynamic knowledge interaction and feedback to achieve tool-based learning and multi-agent collaboration, resolving the problem of embodiment cognition in LLMs (English Tutorial).
The tools corresponding to these three technologies are EasyInstruct, EasyEdit, and EasyAgent (under development). We will soon provide use cases for knowledge prompting and knowledge editing based on the KnowLM
framework.