8000 GitHub - JingwenSZ/LLMxMapReduce
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

JingwenSZ/LLMxMapReduce

 
 

Folders and files

NameName
Last comm 8000 it message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

$\text{LLM}\times \text{MapReduce}$: Simplified Long-Sequence Processing using Large Language Models

📖Introduction ⚡️Getting Started📊Experiment Results📝 Citation

📃V1 Paper📃V2 Paper📚 SurveyEval📃Chinese README

🎉 News

  • 20250409: Introducing the $\text{LLM}\times \text{MapReduce}$-V2 framework to support long-to-long generation! Released V2 paper on arXiv.
  • 20250221: Added support for both OpenAI API and OpenAI-compatible APIs (e.g., vLLM). 🚀
  • 20241012: Released our V1 paper on arXiv. 🎇
  • 20240912: Introducing the $\text{LLM}\times \text{MapReduce}$ framework, which delivers strong performance on long-sequence benchmarks and is compatible with various open-source LLMs. 🎊

📖 Introduction

The $\text{LLM}\times \text{MapReduce}$-V2 was jointly proposed by the THUNLP group from Tsinghua University, OpenBMB, and the 9#AISoft team.

$\text{LLM}\times \text{MapReduce}$-V1 Readme could be seen here.

Long-form generation is crucial for a wide range of practical applications, typically categorized into short-to-long and long-to-long generation. While short-to-long generations have received considerable attention, generating long texts from extremely long resources remains relatively underexplored. The primary challenge in long-to-long generation lies in effectively integrating and analyzing relevant information from extensive inputs, which remains difficult for current large language models (LLMs). In this paper, we propose $\text{LLM}\times \text{MapReduce}$-V2, a novel test-time scaling strategy designed to enhance the ability of LLMs to process extremely long inputs. Drawing inspiration from convolutional neural networks, which iteratively integrate local features into higher-level global representations, $\text{LLM}\times \text{MapReduce}$-V2 utilizes stacked convolutional scaling layers to progressively expand the understanding of input materials. Both quantitative and qualitative experimental results demonstrate that our approach substantially enhances the ability of LLMs to process long inputs and generate coherent, informative long-form articles, outperforming several representative baselines.

$\text{LLM}\times \text{MapReduce}$-V2 framework

⚡️ Getting Started

The following steps are about $\text{LLM}\times \text{MapReduce}$-V2. If you want to use $\text{LLM}\times \text{MapReduce}$-V1, you need to refer to here.

To get started, ensure all dependencies listed in requirements.txt are installed. You can do this by running:

cd LLMxMapReduce_V2
conda create -n llm_mr_v2 python=3.11
conda activate llm_mr_v2
pip install -r requirements.txt
python -m playwright install --with-deps chromium

Before evaluation, you need to download punkt_tab firstly.

import nltk
nltk.download('punkt_tab')

Env config

Please set your OPENAI_API_KEY and OPENAI_API_BASE in your environment variables before start the pipeline. If you use miniconda, replace anaconda3 in LD_LIBRARY_PATH with miniconda3

export LD_LIBRARY_PATH=${HOME}/anaconda3/envs/llm_mr_v2/lib/python3.11/site-packages/nvidia/nvjitlink/lib:${LD_LIBRARY_PATH}
export PYTHONPATH=$(pwd):${PYTHONPATH}
export OPENAI_API_KEY=Your OpenAI Key. You need to set it when you choose the infer type as OpenAI.
export OPENAI_API_BASE=Your OpenAI base url
export GOOGLE_API_KEY=Your Google Cloud key. you need to set it when you choose the infer type as Google.
export SERP_API_KEY= Get SERP API key from https://serpapi.com

We provide both English and Chinese version of prompt. Default version is English. If you wish to use Chinese version, please set this env:

export PROMPT_LANGUAGE="zh"

Model Set

⚠️ We strongly recommend using the Gemini Flash models. There may be unknown errors when using any other models. This project has high requirements for API consumption and concurrent volume, so it's not recommended to use locally deployed models.

The models used in the generation process are configured in the ./LLMxMapReduce_V2/config/model_config.json file. Currently, we support both the OpenAI API and the Google API. You can specify the API to be used in the infer_type key. Additionally, you need to specify the model name in the model key.

Moreover, the crawling process also requires large language model (LLM) inference. You can make changes in a similar manner in the ./LLMxMapReduce_V2/src/start_pipeline.py file.

Start LLMxMapReduce_V2 pipeline

Follow the instructions and generate a report. The generated Markdown file is at ./output/md.

cd LLMxMapReduce_V2
bash scripts/pipeline_start.sh TOPIC output_file_path.jsonl

If you wish to use your own data, you need to set the --input_file in scripts.

The input data should have following components at least:

{
  "title": "The article title you wish to write",
  "papers": [
    {
      "title": "The material title",
      "abstract": "The abstract material. Optional, if not, part of the full text will be excerpted",
      "txt": "The reference material full content"
    }
  ]
}

You could use to use this script to convert data from .jsonl to multiple .md files.

📃 Evaluation

The following steps are about $\text{LLM}\times \text{MapReduce}$-V2. If you want to use $\text{LLM}\times \text{MapReduce}$-V1, you need to refer to here.

Follow the steps below to set up the evaluation:

1. Download the Dataset

Before running the evaluation, you need to download the test split of SurveyEval dataset. After downloading, store it in a .jsonl file.

2. Run the Evaluation

Execute the scripts to evaluate the generated result.

cd LLMxMapReduce_V2
bash scripts/eval_all.sh output_data_file_path.jsonl

Aware that the evaluation process is token-consuming, you need to make sure you have enough balance.

📊 Experiment Results

Our experiments demonstrate the improved performance of LLM using the $\text{LLM}\times \text{MapReduce}$-V2 framework on SurveyEval. Detailed results are provided below.

Methods Struct. Fait. Rele. Lang. Crit. Num. Dens. Prec. Recall
Vanilla 94.44 96.43 100.00 96.50 37.11 78.75 74.64 25.48 26.46
+ Skeleton 98.95 97.03 100.00 95.95 41.01 135.15 72.96 62.60 65.11
AutoSurvey 86.00 93.10 100.00 92.90 68.39 423.35 31.97 50.12 51.73
LLMxMapReduce_V2 95.00 97.22 100.00 94.34 71.99 474.90 52.23 95.50 95.80

📑ToDo

  • Support Autonomous Terminate
  • Opensource crawler for searching papers

📝 Citation

If you have used the content of this repository, please cite the paper and leave your star :).

@misc{wang2025llmtimesmapreducev2entropydrivenconvolutionaltesttime,
      title={$\text{LLM}\times \text{MapReduce}$-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources}, 
      author={Haoyu Wang and Yujia Fu and Zhu Zhang and Shuo Wang and Zirui Ren and Xiaorong Wang and Zhili Li and Chaoqun He and Bo An and Zhiyuan Liu and Maosong Sun},
      year={2025},
      eprint={2504.05732},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.05732}, 
}

@misc{zhou2024llmtimesmapreducesimplifiedlongsequenceprocessing,
      title={$\text{LLM}\times \text{MapReduce}$: Simplified Long-Sequence Processing using Large Language Models}, 
      author={Zihan Zhou and Chong Li and Xinyi Chen and Shuo Wang and Yu Chao and Zhili Li and Haoyu Wang and Rongqiao An and Qi Shi and Zhixing Tan and Xu Han and Xiaodong Shi and Zhiyuan Liu and Maosong Sun},
      year={2024},
      eprint={2410.09342},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.09342}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.3%
  • Shell 1.7%
0