新闻事件聚类 📰 🗞️ 👋

项目逻辑：

原始新闻文档 ---> 抽取5w1h、organization、category ---> 存入neo4j数据库 --->louvain算法---> 聚类事件属性添加--->当日事件入库

✋ 使用：

1️⃣ 拉取项目：

git clone https://github.com/dawnai/Louvain.git

2️⃣ 使用conda创建虚拟环境，推荐python 10环境：

conda create -n py10 python=10

3️⃣准备代抽取的新闻，放置在data/waite_to_extrac/文件夹路径下

|-data
|---waite_to_extrac
|------news1.json
|------news2.json

4️⃣运行：

python main.py

运行结束后，可以在neo4j中查看抽取效果。

⚠️注意事项：

1、该项目必须使用json格式的新闻抽取文档，格式如下：

[
  {
    "id": 1,
    "title": "",
    "created_at": "",
    "text": ""
  },
  {
    "id": 2,
    "title": "",
    "created_at": "",
    "text": ""
  }
]

2、main.py中配置文件说明：

CONFIG = {
    "input_json": "./data/waite_to_extrac/2-3.json",#当日新闻json文件地址
    "output_files": {
        "excel": "./data/waite_to_neo4j/xlsx/2-3.xlsx",#抽取结果存放地址
        "json": "./data/waite_to_neo4j/json/2-3.json"
    },
    "api": {	#LLM配置
        "key": "sk-d0c3b3fe823c4fcfbe6a56a8a13c946c",
        "base_url": "https://llm.jnu.cn/v1",
        "model": "Qwen2.5-72B-Instruct",
        "retries": 3,	#失败重试次数
        "timeout": 30	#超时
    },
    "processing": {		#多5w1h抽取策略
        "batch_size": 5,
        "max_events_per_news": 3,	#每条新闻最多抽取多少5w1h
        "request_interval": 1,	#抽取间隔
        "text_truncate_length": 1000	#文本截断
    },
    "neo4j_config":{ 	#neo4j数据库配置
        "uri": "bolt://172.20.77.180:7687",
        "user": "neo4j",
        "password": "neo4j@openspg",
        "database": "today", 	#当日数据库
        "alldatabase":"allday"	#总数据库
    },
    "target_columns":['what', 'where', 'when', 'who', 'why', 'how', 'title','organization','news_id'],#用于判断xlsx表格中哪些列需要被抽取
    "config_louvain":{
        "semantic_threshold": 0.8,  #语义相似度阈值
        "embedding_uri":'https://embedding.jnu.cn/v1',#ollama地址http://172.20.71.112:11434 暨大：https://embedding.jnu.cn/v1
        "embedding_name":'bge-m3'#模型
    }
}

3、项目采用多线程进行实体抽取和数据处理工作

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
__pycache__		__pycache__
data		data
embedding_cache		embedding_cache
event_add_property		event_add_property
event_cover		event_cover
tool		tool
trach		trach
.DS_Store		.DS_Store
5w1h_main_mulit.py		5w1h_main_mulit.py
README.md		README.md
louvain_main.py		louvain_main.py
main.py		main.py
results.json		results.json
thing copy.py		thing copy.py
upload_data_to_neo4j_mulit.py		upload_data_to_neo4j_mulit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

新闻事件聚类 📰 🗞️ 👋

✋ 使用：

⚠️注意事项：

About

Uh oh!

Releases 1

Packages

Languages

dawnai/Louvain

Folders and files

Latest commit

History

Repository files navigation

新闻事件聚类 📰 🗞️ 👋

✋ 使用：

⚠️注意事项：

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages