异æ¥çˆ¬è™«ä»£ç†æ± ,以 Python asyncio 为基础,旨在充分利用 Python çš„å¼‚æ¥æ€§èƒ½ã€‚
项目使用了 sanic,(也æä¾›äº† Flask)一个异æ¥ç½‘络框架。所以建议è¿è¡Œ Python 环境为 Python3.5+,并且 sanic 䏿”¯æŒ Windows 系统,Windows 用户(比如我 😄)å¯ä»¥è€ƒè™‘使用 Ubuntu on Windows。
项目数æ®åº“使用了 Redis,Redis 是一个开æºï¼ˆBSD 许å¯ï¼‰çš„,内å˜ä¸çš„æ•°æ®ç»“æž„å˜å‚¨ç³»ç»Ÿï¼Œå®ƒå¯ä»¥ç”¨ä 8000 ½œæ•°æ®åº“ã€ç¼“å˜å’Œæ¶ˆæ¯ä¸é—´ä»¶ã€‚所以请确ä¿è¿è¡ŒçŽ¯å¢ƒå·²ç»æ£ç¡®å®‰è£…了 Redis。安装方法请å‚照官网指å—。
$ git clone https://github.com/chenjiandongx/async-proxy-pool.git
使用 requirements.txt
$ pip install -r requirements.txt
é…置文件 config.py,ä¿å˜äº†é¡¹ç›®æ‰€ä½¿ç”¨åˆ°çš„æ‰€æœ‰é…置项。如下所示,用户å¯ä»¥æ ¹æ®éœ€æ±‚自行更改。ä¸ç„¶æŒ‰é»˜è®¤å³å¯ã€‚
#!/usr/bin/env python
# coding=utf-8
# 请求超时时间(秒)
REQUEST_TIMEOUT = 15
# 请求延迟时间(秒)
REQUEST_DELAY = 0
# redis 地å€
REDIS_HOST = "localhost"
# redis 端å£
REDIS_PORT = 6379
# redis 密ç
REDIS_PASSWORD = None
# redis set key
REDIS_KEY = "proxies:ranking"
# redis è¿žæŽ¥æ± æœ€å¤§è¿žæŽ¥é‡
REDIS_MAX_CONNECTION = 20
# REDIS SCORE 最大分数
MAX_SCORE = 10
# REDIS SCORE 最å°åˆ†æ•°
MIN_SCORE = 0
# REDIS SCORE åˆå§‹åˆ†æ•°
INIT_SCORE = 9
# server web host
SERVER_HOST = "localhost"
# server web port
SERVER_PORT = 3289
# 是å¦å¼€å¯æ—¥å¿—记录
SERVER_ACCESS_LOG = True
# æ‰¹é‡æµ‹è¯•æ•°é‡
VALIDATOR_BATCH_COUNT = 256
# æ ¡éªŒå™¨æµ‹è¯•ç½‘ç«™ï¼Œå¯ä»¥å®šå‘改为自己想爬å–的网站,如新浪,知乎ç‰
VALIDATOR_BASE_URL = "https://httpbin.org/"
# æ ¡éªŒå™¨å¾ªçŽ¯å‘¨æœŸï¼ˆåˆ†é’Ÿï¼‰
VALIDATOR_RUN_CYCLE = 15
# 爬å–器循环周期(分钟)
CRAWLER_RUN_CYCLE = 30
# 请求 headers
HEADERS = {
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
}
è¿è¡Œå®¢æˆ·ç«¯ï¼Œå¯åŠ¨æ”¶é›†å™¨å’Œæ ¡éªŒå™¨
# å¯è®¾ç½®æ ¡éªŒç½‘站环境å˜é‡ set/export VALIDATOR_BASE_URL="https://example.com"
$ python client.py
2018-05-16 23:41:39,234 - Crawler working...
2018-05-16 23:41:40,509 - Crawler √ http://202.83.123.33:3128
2018-05-16 23:41:40,509 - Crawler √ http://123.53.118.122:61234
2018-05-16 23:41:40,510 - Crawler √ http://212.237.63.84:8888
2018-05-16 23:41:40,510 - Crawler √ http://36.73.102.245:8080
2018-05-16 23:41:40,511 - Crawler √ http://78.137.90.253:8080
2018-05-16 23:41:40,512 - Crawler √ http://5.45.70.39:1490
2018-05-16 23:41:40,512 - Crawler √ http://117.102.97.162:8080
2018-05-16 23:41:40,513 - Crawler √ http://109.185.149.65:8080
2018-05-16 23:41:40,513 - Crawler √ http://189.39.143.172:20183
2018-05-16 23:41:40,514 - Crawler √ http://186.225.112.62:20183
2018-05-16 23:41:40,514 - Crawler √ http://189.126.66.154:20183
...
2018-05-16 23:41:55,866 - Validator working...
2018-05-16 23:41:56,951 - Validator × https://114.113.126.82:80
2018-05-16 23:41:56,953 - Validator × https://114.199.125.242:80
2018-05-16 23:41:56,955 - Validator × https://114.228.75.17:6666
2018-05-16 23:41:56,957 - Validator × https://115.227.3.86:9000
2018-05-16 23:41:56,960 - Validator × https://115.229.88.191:9000
2018-05-16 23:41:56,964 - Validator × https://115.229.89.100:9000
2018-05-16 23:41:56,966 - Validator × https://103.18.180.194:8080
2018-05-16 23:41:56,967 - Validator × https://115.229.90.207:9000
2018-05-16 23:41:56,968 - Validator × https://103.216.144.17:8080
2018-05-16 23:41:56,969 - Validator × https://117.65.43.29:31588
2018-05-16 23:41:56,971 - Validator × https://103.248.232.135:8080
2018-05-16 23:41:56,972 - Validator × https://117.94.69.166:61234
2018-05-16 23:41:56,975 - Validator × https://103.26.56.109:8080
...
è¿è¡ŒæœåŠ¡å™¨ï¼Œå¯åЍ web æœåŠ¡
$ python server_sanic.py
[2018-05-16 23:36:22 +0800] [108] [INFO] Goin' Fast @ http://localhost:3289
[2018-05-16 23:36:22 +0800] [108] [INFO] Starting worker [108]
$ python server_flask.py
* Serving Flask app "async_proxy_pool.webapi_flask" (lazy loading)
* Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
* Debug mode: on
* Restarting with stat
* Debugger is active!
* Debugger PIN: 322-954-449
* Running on http://localhost:3289/ (Press CTRL+C to quit)
项目主è¦å‡ 大模å—åˆ†åˆ«æ˜¯çˆ¬å–æ¨¡å—,å˜å‚¨æ¨¡å—ï¼Œæ ¡éªŒæ¨¡å—,调度模å—ï¼ŒæŽ¥å£æ¨¡å—。
çˆ¬å–æ¨¡å—:负责爬å–代ç†ç½‘站,并将所得到的代ç†å˜å…¥åˆ°æ•°æ®åº“,æ¯ä¸ªä»£ç†çš„åˆå§‹åŒ–æƒå€¼ä¸º INIT_SCORE。
å˜å‚¨æ¨¡å—:å°è£…了 Redis æ“作的一些接å£ï¼Œæä¾› Redis è¿žæŽ¥æ± ã€‚
æ ¡éªŒæ¨¡å—:验è¯ä»£ç† IP 是å¦å¯ç”¨ï¼Œå¦‚果代ç†å¯ç”¨åˆ™æƒå€¼ +1,最大值为 MAX_SCORE。ä¸å¯ç”¨åˆ™æƒå€¼ -1,直至æƒå€¼ä¸º 0 时将代ç†ä»Žæ•°æ®åº“ä¸åˆ 除。
调度模å—:负责调度爬å–å™¨å’Œæ ¡éªŒå™¨çš„è¿è¡Œã€‚
æŽ¥å£æ¨¡å—:使用 sanic æä¾› WEB API 。
/
欢迎页é¢
$ http http://localhost:3289/
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 42
Content-Type: application/json
Keep-Alive: 5
{
"Welcome": "This is a proxy pool system."
}
/pop
éšæœºè¿”回一个代ç†ï¼Œåˆ†ä¸‰æ¬¡å°è¯•。
- å°è¯•返回æƒå€¼ä¸º MAX_SCORE,也就是最新å¯ç”¨çš„代ç†ã€‚
- å°è¯•è¿”å›žéšæœºæƒå€¼åœ¨ (MAX_SCORE -3) - MAX_SCORE 之间的代ç†ã€‚
- å°è¯•返回æƒå€¼åœ¨ 0 - MAX_SCORE 之间的代ç†
$ http http://localhost:3289/pop
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 38
Content-Type: application/json
Keep-Alive: 5
{
"http": "http://46.48.105.235:8080"
}
/get/<count:int>
返回指定数é‡çš„代ç†ï¼Œæƒå€¼ä»Žå¤§åˆ°å°æŽ’åºã€‚
$ http http://localhost:3289/get/10
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 393
Content-Type: application/json
Keep-Alive: 5
[
{
"http": "http://94.177.214.215:3128"
},
{
"http": "http://94.139.242.70:53281"
},
{
"http": "http://94.130.92.40:3128"
},
{
"http": "http://82.78.28.139:8080"
},
{
"http": "http://82.222.153.227:9090"
},
{
"http": "http://80.211.228.238:8888"
},
{
"http": "http://80.211.180.224:3128"
},
{
"http": "http://79.101.98.2:53281"
},
{
"http": "http://66.96.233.182:8080"
},
{
"http": "http://61.228.45.165:8080"
}
]
/count
è¿”å›žä»£ç†æ± 䏿‰€æœ‰ä»£ç†æ€»æ•°
$ http http://localhost:3289/count
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 15
Content-Type: application/json
Keep-Alive: 5
{
"count": "698"
}
/count/<score:int>
返回指定æƒå€¼ä»£ç†æ€»æ•°
$ http http://localhost:3289/count/10
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 15
Content-Type: application/json
Keep-Alive: 5
{
"count": "143"
}
/clear/<score:int>
åˆ é™¤æƒå€¼å°äºŽç‰äºŽ score 的代ç†
$ http http://localhost:3289/clear/0
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 22
Content-Type: application/json
Keep-Alive: 5
{
"Clear": "Successful"
}
在 crawler.py æ–‡ä»¶é‡Œæ–°å¢žä½ è‡ªå·±çš„çˆ¬å–æ–¹æ³•。
class Crawler:
@staticmethod
def run():
...
# æ–°å¢žä½ è‡ªå·±çš„çˆ¬å–æ–¹æ³•
@staticmethod
@collect_funcs # åŠ å…¥è£…é¥°å™¨ç”¨äºŽæœ€åŽè¿è¡Œå‡½æ•°
def crawl_xxx():
# 爬å–逻辑
本项目使用了 Sanic,但是开å‘者完全å¯ä»¥æ ¹æ®è‡ªå·±çš„需求选择其他 web 框架,web æ¨¡å—æ˜¯å®Œå…¨ç‹¬ç«‹çš„ï¼Œæ›¿æ¢æ¡†æž¶ä¸ä¼šå½±å“到项目的æ£å¸¸è¿è¡Œã€‚需è¦å¦‚下æ¥éª¤ã€‚
使用 wrk 进行æœåŠ¡å™¨åŽ‹åŠ›æµ‹è¯•ã€‚åŸºå‡†æµ‹è¯• 30 ç§’, 使用 12 个线程, å¹¶å‘ 400 个 http 连接。
测试 http://127.0.0.1:3289/pop
$ wrk -t12 -c400 -d30s http://127.0.0.1:3289/pop
Running 30s test @ http://127.0.0.1:3289/pop
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 350.37ms 118.99ms 660.41ms 60.94%
Req/Sec 98.18 35.94 277.00 79.43%
33694 requests in 30.10s, 4.77MB read
Socket errors: connect 0, read 340, write 0, timeout 0
Requests/sec: 1119.44
Transfer/sec: 162.23KB
测试 http://127.0.0.1:3289/get/10
Running 30s test @ http://127.0.0.1:3289/get/10
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 254.90ms 95.43ms 615.14ms 63.51%
Req/Sec 144.84 61.52 320.00 66.58%
46538 requests in 30.10s, 22.37MB read
Socket errors: connect 0, read 28, write 0, timeout 0
Requests/sec: 1546.20
Transfer/sec: 761.02KB
性能还算ä¸é”™ï¼Œå†æµ‹è¯•一下没有 Redis æ“作的 http://127.0.0.1:3289/
$ wrk -t12 -c400 -d30s http://127.0.0.1:3289/
Running 30s test @ http://127.0.0.1:3289/
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 127.86ms 41.71ms 260.69ms 55.22%
Req/Sec 258.56 92.25 520.00 68.90%
92766 requests in 30.10s, 13.45MB read
Requests/sec: 3081.87
Transfer/sec: 457.47KB
âï¸ Requests/sec: 3081.87
å…³é— sanic 日志记录,测试 http://127.0.0.1:3289/
$ wrk -t12 -c400 -d30s http://127.0.0.1:3289/
Running 30s test @ http://127.0.0.1:3289/
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 34.63ms 12.66ms 96.28ms 58.07%
Req/Sec 0.96k 137.29 2.21k 73.29%
342764 requests in 30.10s, 49.69MB read
Requests/sec: 11387.89
Transfer/sec: 1.65MB
âï¸ Requests/sec: 11387.89
test_proxy.py ç”¨äºŽæµ‹è¯•å®žé™…ä»£ç†æ€§èƒ½
$ cd test
$ python test_proxy.py
# å¯è®¾ç½®çš„环境å˜é‡
TEST_COUNT = os.environ.get("TEST_COUNT") or 1000
TEST_WEBSITE = os.environ.get("TEST_WEBSITE") or "https://httpbin.org/"
TEST_PROXIES = os.environ.get("TEST_PROXIES") or "http://localhost:3289/get/20"
测试代ç†ï¼š http://localhost:3289/get/20
测试网站: https://httpbin.org/
测试次数: 1000
æˆåŠŸæ¬¡æ•°ï¼š 1000
失败次数: 0
æˆåŠŸçŽ‡ï¼š 1.0
测试代ç†ï¼š http://localhost:3289/get/20
测试网站: https://taobao.com/
测试次数: 1000
æˆåŠŸæ¬¡æ•°ï¼š 984
失败次数: 16
æˆåŠŸçŽ‡ï¼š 0.984
测试代ç†ï¼š http://localhost:3289/get/20
测试网站: https://baidu.com
测试次数: 1000
æˆåŠŸæ¬¡æ•°ï¼š 975
失败次数: 25
æˆåŠŸçŽ‡ï¼š 0.975
测试代ç†ï¼š http://localhost:3289/get/20
测试网站: https://zhihu.com
测试次数: 1000
æˆåŠŸæ¬¡æ•°ï¼š 1000
失败次数: 0
æˆåŠŸçŽ‡ï¼š 1.0
å¯ä»¥çœ‹åˆ°å…¶å®žæ€§èƒ½æ˜¯éžå¸¸æ£’的,æˆåŠŸçŽ‡æžé«˜ã€‚ 😉
import random
import requests
# ç¡®ä¿å·²ç»å¯åЍ sanic æœåŠ¡
# 获å–多个然åŽéšæœºé€‰ä¸€ä¸ª
try:
proxies = requests.get("http://localhost:3289/get/20").json()
req = requests.get("https://example.com", proxies=random.choice(proxies))
except:
raise
# 或者å•独弹出一个
try:
proxy = requests.get("http://localhost:3289/pop").json()
req = requests.get("https://example.com", proxies=proxy)
except:
raise
整个项目都是基于 aiohttp 这个异æ¥ç½‘络库的,在这个项目的文档ä¸ï¼Œå…³äºŽä»£ç†çš„ä»‹ç»æ˜¯è¿™æ ·çš„。
划é‡ç‚¹ï¼šaiohttp supports HTTP/HTTPS proxies
ä½†æ˜¯ï¼Œå®ƒæ ¹æœ¬å°±ä¸æ”¯æŒ https 代ç†å¥½å§ï¼Œåœ¨å®ƒçš„代ç 䏿˜¯è¿™æ ·å†™çš„。
划é‡ç‚¹ï¼šOnly http proxies are supported
我的心情å¯ä»¥è¯´æ˜¯ååˆ†å¤æ‚的。😲 ä¸è¿‡åªæœ‰ http ä»£ç†æ•ˆæžœä¹Ÿä¸é”™æ²¡ä»€ä¹ˆå¤ªå¤§å½±å“,å‚è§ä¸Šé¢çš„æµ‹è¯•æ•°æ®ã€‚
✨ðŸ°âœ¨
MIT ©chenjiandongx