About

Reversi reinforcement learning by AlphaGo Zero methods.

Environment

Python 3.6.3
tensorflow-gpu: 1.3.0
- tensorflow==1.3.0 is also ok, but very slow. When play_gui, tensorflow(cpu) is enough speed.
Keras: 2.0.8

Modules

Reinforcement Learning

This AlphaGo Zero implementation consists of three worker self, opt and eval.

self is Self-Play to generate training data by self-play using BestModel.
opt is Trainer to train model, and generate next-generation models.
eval is Evaluator to evaluate whether the next-generation model is better than BestModel. If better, replace BestModel.

Evaluation

For evaluation, you can play reversi with the BestModel.

play_gui is Play Game vs BestModel using wxPython.

Data

data/model/model_best_*: BestModel.
data/model/next_generation/*: next-generation models.
data/play_data/play_*.json: generated training data.
logs/main.log: log file.

If you want to train the model from the beginning, delete the above directories.

How to use

Setup

install libraries

pip install -r requirements.txt

install libraries with Anaconda

cp requirements.txt conda-requirements.txt

Comment out lines for jedi, Keras, parso, python-dotenv, tensorflow-tensorboard, wxPython libraries
Replace '-' with '_' for ipython-genutils, jupyter-*, prompt-toolkit libraries

conda env create -f environment.yml
source activate reversi-a0
conda install --yes --file conda-requirements.txt

If you want use GPU,

pip install tensorflow-gpu

set environment variables

Create .env file and write this.

KERAS_BACKEND=tensorflow

Download Trained BestModel(If needed)

Download trained BestModel for example.

sh ./download_best_model.sh

Basic Usages

For training model, execute Self-Play, Trainer and Evaluator.

Self-Play

python src/reversi_zero/run.py self

When executed, Self-Play will start using BestModel. If the BestModel does not exist, new random model will be created and become BestModel.

options

--new: create new BestModel
--type mini: use mini config for testing, (see src/reversi_zero/configs/mini.py)

Trainer

python src/reversi_zero/run.py opt

When executed, Training will start. A base model will be loaded from latest saved next-generation model. If not existed, BestModel is used. Trained model will be saved every 2000 steps(mini-batch) after epoch.

options

--type mini: use mini config for testing, (see src/reversi_zero/configs/mini.py)
--total-step: specify total step(mini-batch) numbers. The total step affects learning rate of training.

Evaluator

python src/reversi_zero/run.py eval

When executed, Evaluation will start. It evaluates BestModel and the latest next-generation model by playing about 200 games. If next-generation model wins, it becomes BestModel.

options

--type mini: use mini config for testing, (see src/reversi_zero/configs/mini.py)

Play Game

python src/reversi_zero/run.py play_gui

Note: Mac pyenv environment

play_gui uses wxPython. It can not execute if your python environment is built without Framework. Try following pyenv install option.

env PYTHON_CONFIGURE_OPTS="--enable-framework" pyenv install 3.6.3

For Anaconda users:

conda install python.app
pythonw src/reversi_zero/run.py play_gui

When executed, ordinary reversi board will be displayed and you can play against BestModel. After BestModel moves, numbers are displayed on the board.

Top left numbers(1) mean 'Visit Count (=N(s,a))' of the last search.
Bottom left numbers(2) mean 'Q Value (=Q(s,a)) on AI side' of the last state and move. The Q values are multiplied by 100.

Tips and Memo

GPU Memory

In my environment of GeForce GTX 1080, memory is about 8GB, so sometimes lack of memory happen. Usually the lack of memory cause warnings, not error. If error happens, try to change per_process_gpu_memory_fraction in src/worker/{evaluate.py,optimize.py,self_play.py},

tf_util.set_session_config(per_process_gpu_memory_fraction=0.2)

Less batch_size will reduce memory usage of opt. Try to change TrainerConfig#batch_size in NormalConfig.

Training Speed

CPU: 8 core i7-7700K CPU @ 4.20GHz
GPU: GeForce GTX 1080
1 game in Self-Play: about 47 sec.
1 game in Evaluation: about 50 sec.
1 step(mini-batch, batch size=512) in Training: about 2.3 sec.

Model Performance

The following table is records of the best models. For model performance evaluation, sometimes I am competing with iOS app(https://itunes.apple.com/ca/app/id574915961) and the best model.

It takes about 2~3 hours to evaluate one model in my environment. Therefore, if you divide the time taken by 3, you can see the approximate number of evaluation times.

best model generation	date	winning percentage to best model	Time Spent(hours)	note
1	-	-	-
2	2017/10/24	94.1%	-
3	2017/10/24	63.4%	13
4	2017/10/25	62.0%	3
5	2017/10/25	56.7%	8
6	2017/10/25	67.3%	7
7	2017/10/25	59.0%	3
8	2017/10/26	59.7%	6
9	2017/10/26	59.4%	3
10	2017/10/26	55.7%	5
11	2017/10/26	57.9%	9
12	2017/10/27	55.6%	5
13	2017/10/27	56.5%	7
14	2017/10/28	58.4%	20
15	2017/10/28	62.4%	3
16	2017/10/28	56.0%	11
17	2017/10/29	64.9%	17
18	2017/10/30	55.2%	19
19	2017/10/31	57.2%	33
20	2017/11/01	55.7%	12
21	2017/11/01	59.7%	20
22	2017/11/02	57.8%	19
23	2017/11/03	55.8%	15
24	2017/11/03	64.2%	12
25	2017/11/04	55.4%	21
26	2017/11/04	56.7%	6
27	2017/11/05	57.5%	11
28	2017/11/06	58.5%	15
29	2017/11/06	55.3%	5
30	2017/11/06	55.0%	8
31	2017/11/06	56.9%	5
32	2017/11/07	56.1%	9
33	2017/11/08	55.7%	22
34	2017/11/08	56.1%	3
35	2017/11/08	59.0%	3
36	2017/11/08	59.4%	3
37	2017/11/08	56.2%	9
38	2017/11/10	55.4%	52	Won the app LV9, LV10
39	2017/11/12	57.2%	29
40	2017/11/12	55.1%	12	Won the app LV11
41	2017/11/13	55.7%	14	Won the app LV12, 13, 14, 15, 16, 17. I can't win anymore.
42	2017/11/15	57.8%	18	Won the app LV18, 19
43	2017/11/15	55.8%	16
44	2017/11/16	57.5%	8
45	2017/11/16	56.2%	3	Won the app LV20
46	2017/11/18	55.6%	49
47	2017/11/19	55.9%	34
48	2017/11/19	59.4%	9
49	2017/11/20	55.9%	6
50	2017/11/22	56.0%	44
51	2017/11/26	55.8%	112	11/25 morning, changed c_puct from 3 to 1.5.
52	2017/11/26	59.7%	6
53	2017/11/28	56.2%	33	Won the app LV21
54	2017/11/29	59.0%	24
55	2017/12/01	56.6%	58
56	2017/12/03	58.1%	49	self-play: always save policy of tau=1
57	2017/12/04	55.1%	24
58	2017/12/05	55.9%	35
59	2017/12/06	55.4%	6
-	2017/12/06	-	-	implement mokemokechicken#13
60	2017/12/07	61.7%	25
61	2017/12/07	58.1%	3	Won the app LV21,22
62	2017/12/07	57.8%	11
-	2017/12/07	-	-	fix bug about virtual loss W
63	2017/12/08	57.5%	9
64	2017/12/08	56.0%	9

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
doc/img		doc/img
notebook		notebook
src		src
test		test
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
download_best_model.sh		download_best_model.sh
environment.yml		environment.yml
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Environment

Modules

Reinforcement Learning

Evaluation

Data

How to use

Setup

install libraries

install libraries with Anaconda

set environment variables

Download Trained BestModel(If needed)

Basic Usages

Self-Play

options

Trainer

options

Evaluator

options

Play Game

Note: Mac pyenv environment

Tips and Memo

GPU Memory

Training Speed

Model Performance

About

Uh oh!

Releases

Packages

Languages

License

gmhuili/reversi-alpha-zero

Folders and files

Latest commit

History

Repository files navigation

About

Environment

Modules

Reinforcement Learning

Evaluation

Data

How to use

Setup

install libraries

install libraries with Anaconda

set environment variables

Download Trained BestModel(If needed)

Basic Usages

Self-Play

options

Trainer

options

Evaluator

options

Play Game

Note: Mac pyenv environment

Tips and Memo

GPU Memory

Training Speed

Model Performance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages