python_extractor

python实现网络蜘蛛，通过基于行块函数分布抽取通用网页正文通过python编程实现了网络蜘蛛的功能，用list类型来存储广度优先的队列（搜索深度不超过3），然后通过基于行块分布函数的通用网页正文抽取算法来提取了网页的正文内容（准确度95%左右），最后通过检测正文文本的utf – 8编码序号来提取英文内容。

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
NetSpider.py		NetSpider.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python_extractor

About

Releases

Packages

Languages

xiaoyu698/python_extractor

Folders and files

Latest commit

History

Repository files navigation

python_extractor

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages