[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] GNU Backgammon overview/background

From: Øystein Johansen
Subject: Re: [Bug-gnubg] GNU Backgammon overview/background
Date: Thu, 13 Nov 2003 03:50:50 +0100
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.2.1) Gecko/20021130

Thomas Hauk wrote:

About the training:
Temporal difference learning was only used in the first stages of the program. With this kind of training the program reached a intermediate playing strength. Later the nets have been trained with supervised training with positions from a position database. The desired outputs of the training positions in the database are based on rollouts. The latest contact and crashed nets are trained by a slighly different technique. It's like the net is trained to play like the 2-ply evaluation.

So... let me get this straight.

TD learning was employed to train a NN, which then produced a position database along with what the NN thought to be the best move for that position. Then a second NN was trained with only this resulting position DB? In other words, the second NN is trained with expertly-labelled data, where the expert is an older NN trained only with self-play?

Eehhh, no.....

(I must be terrible bad at explaining things, since it doesn't seem like you got this correctly. I'm really sorry... Here's my new attemt)

Long story.......

TD learning was the first thing. I'm not sure what version number this net had, but lets call this contact net 0.0.

contact net 0.0 wasn't the best player in the world, and the things didn't become anything better by the fact that it played together with a bad race net. Let's call the first race net 0.0. Together I estimate they would have a ~1600 fibs rating.

One of the probles was that the nets didn't match each other either. It refused to break contact when it was up in the race. And sometimes it also refused to hold contact when it was down in the race. The race net also refused to bear in chequers in race in some games, this was due to the rece net gave a better GWC than the bearoff database. Nearly funny to watch these games.

Joseph started his experiments. One on the first things he tried, was to split the contact neural net to a contact net and a PBG(?) net for backgames and priming games. This split didn't work very well. Joseph had a hard time training these nets (Still using TD ?), and they where terrible out of sync with each other and caused the same problem as described in the paragraph above, except that the PBG net and contact net was more out sync than the race and contact net, so it was causing more trouble than it gained strenght. (Correct me if I'm wrong, Joseph.)

Joseph also found that some neural net inputs didn't contribute to the learning, so he removed some inputs, and adjusted some others.

So, Joseph rejected the PBG net, and started supervised training instead. The theory was simple enough: If a position has a cubeless equity at 0-ply X ppg, and the same position has a cubeless equity of Y ppg at 1-ply, then if abs(X-Y) > a threshold value, then it's reason to belive that the 0.0 contact net does not understand this position well, and this position and (position type) need more training. (The threshold valuse was set to 0.1, I think). Net 0.0 was set to play against it self, and for each position it reached, it calculated the 0-ply equity and 1-ply equity (both cubeless of course, cube alorithms even wasn't implemented yet). If the criteria of abs(X-Y)>threshold applied, then the position was stored in a database. Just the position, nothing else. After some number of games, the database had lots of positions, where the net needed more training and the self play was stopped. Joseph may know how many positions there are in this position database.

Now, for each position in the database, it was performed a cubeless rollout, to estimate the 'real' ('real' = at least better) probabilities and equity of the position. The rollouts was done with the same 0.0 net. The rollout results was added to the database of positions. In this way Joseph developed a database of positions and rollout results for each position.

So, this databased was used for training the next version of the neural net. Normal backpropagation for each position in the database, with the rollout result as the desired result. Running on through each position in the position database and updating the weights with backpropagation based on the rollout results, is called an epoch. Several epochs are run over the database, starting with high alpha values (alpha is the learning rate which is a foctor of hoe much the weights should be adjusted), and decreasing as the weight values converges. The order of the positions where also randomized for each epoch. The resulting neural net was called 0.10, IIRC.

This net improved the strength of the program a lot, but still the race net sucked. To avoid Joseph made a OSR based race evaluator, written i c++ (you can still see the "race-c++" branch in the cvs, but this branch was never merged into the main branch, but it was added to the fibs2html tool), and used this with mgnutest. Joseph also developed a small neural net for pruning moves for deeper evluations. This net had only 5 hidden nodes, and was therefore very fast. The fast 5 hidden node net for pruning combined with 2-ply search and the OSR race evaluator, reached reached a FIBS rating of about 1850-1900.

Still something had to be done to the race net. I remember we discussed how to solve this problem. I remember several schemes discussed. I thought about a neural net with the chequer positions for one player as the input, and the distribution of number rolls to get off as the output, and someone alse had other ideas, and while we discussed this problem at the list, Joseph suddenly released a new race network! I remember I was partly shocked! Wow! With this new race net he used 14 input nodes for borne off checkers. One input for each checker checker off. I remember I asked why, and he replied something like: "I don't know, but it works!". This network that he had breaded was trained based on OSR evaluations, And it was released with the contact net 0.10 and together they where called 0.10 (I'm taking this from my memory now, so the numbering may be wrong)

The first GamesGrid players GGraccoon and GGbeaver was based on these two networks, the contact 0.10 net and the race 0.10 net. (David added the reduced search algorithm which was used by GGbeaver, and this change was implemented in the gnubg project. Actually there was a reduced search algorithm before David's as well, but David's algorithm was better and faster.)

In february 2001, Joseph makes some new changes to the evaluation algorithm and and a new set of weights is released. He calls this net of weights 0.15 in fibs2html, and the same weights are called 0.11 in gnubg. I'm not sure of the differences of this net and the 0.10 net. Joseph knows the details. A better cache system was added to the evaluations, and a neural net speed up trick was also implemented)

Late 2001 (November/December) Joseph splits the contact network into a crashed net and the ordinary net. I'm not sure how he trained these net's but I assume he used supervised training from the position database. This net was mergg into the GNU Backgammon project in January 2002. This was the really great update! It was amazing to watch mgnutest play in fibs with this net. I really playd a master game. I showed that it was able to find moves with long term plans, and I would already at that stage say that it would outplay JellyFish in most positions and play equal Snowie 3.2 or better in most positions. (It would of course outplay Snowie in bear in situations, but we all know that Snowie suched at bearing in). The new neural net trio formed was numbered 0.12. I'm not sure about the training method, but I assume it was based on the database of positions and rollouts. (Maybe the rollouts was rerun or something... Joseph knows.)

The strange thing with the 0.12 nets, was that even though it was a class for crashed position on it's own, specially trained on crashed positions, it really played crashed positions badly? The gain was really the contact net and not the crashed net! It's possible that the contact net got better, since it now didn't have to learn about crashed positions. Some of the brain capacity was freed to learn the contact positions better. The out-of-sync problem between the crashed and contact net and it wasn't a big problem either, since a game very seldom goes from contact to crashed and then back to contact again. Joseph runned some analysis of the occurence of each position class from mgnutest fibs playing, and he reported that 79% of all evaluations where contact evaluations, 8% was crashed evaluations, 7% was race and 6% was bearoff evaluations.

Joseph then used some months to improve the crashed net. I guess he used the same technic as usual described on his page. Database of positons and rollout results traind over several epochs. and we had nets called 0.12a and 0.12b, and the crashed net slowly improved.

Now the new problem arise: It's not a problem to bread new neural nets, but how can you say that one net is better than an other one? Two good nets needs millions of games to give a significant anser of which is best. For this purpose Joseph started the gnubg training project. Voluenteers over the internet, specially Ian Shaw, had big files of position collected from online games, and rolled them out in big scale. (We had a own tool for rolling the positions out called sagnubg.) The rollout results was used as reference positions or benchmark, and the net who played best according to all the rollout results in the reference positions, was considered the best neural net. I guess we rolled out about half a million positions (?).

Now Joseph started to use a slightly other technique for training. He wanted to let the neural net learn "consepts". He let the computer play games against it self or on fibs. He analysed moves instead of positions. If a 2-ply evaluation chooses another move than a 0-ply evaluation, there is a "concept" of the position it doesn't understand. If 0-ply an 2-ply disagreed on a move, both resulting positions was added to the training database. Then these positions in the database was rolled out, and a new series of supervised epoch training was performed. If the new resulting net scored better against the reference database, the net was considered better. In this way the static evaluations was trained to play like the 2-ply evaluations. Note that the positions in the reference database was never used for training, just for reference.

The training method in the above paragraph gave the some new versions of 0.12 nets, all the net was gradually improved, the race net, the crached net and the contact net. these where named 0.12c, 0.12d and so on. In December 2002 or January 2003 the 0.13 nets was released. It was real world class! Really amazing! We started to roll out contact position for reference position benchmarking as well, and just as the last benchmark rollout was completed, Joseph released the 0.14 nets. This nets are probably the best neural nets for backgammon playing in the universe right now. (I can only think of mloner and zbot as real competitors, since I strongly belive that the 0.14 net has a edge over Snowie4.)

(It's late I have to finnish this mail....)

If this mail sounds like a tribute and hail to Joseph and his amazing work on the neural net, I must say it acctually _is_ a tribute. Without his intelligence, his insight, his computer wizardry, his push and go-ahead spirit, the GNU Backgammon project would still have been a poor player with a fibs rating about 1600. The backgammon community all over the world owes him a lot. THANK YOU, PEPE!

Thomas, I hope this gives you some of the story and some info on the training techniques used in GNU Backgammon

Strongest in the World?
I actually feel more confident with the analysis done by Michael Depreli at GammOnLine. Search net for his results. He's probably reading this mail, so I guess he can mail you his methods and final results.

I hope he does. I don't subscribe to GammonLine so someone will have to help me out here ;-)

Here is some data:

(Sorry the typos, it's late here in Norway)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]