[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-gnubg] Re: Training neural nets: How does size matter?

From: pepster
Subject: [Bug-gnubg] Re: Training neural nets: How does size matter?
Date: Mon, 02 Sep 2002 19:20:17 +1200

(Using web mail - no spelling checker - sorry for the numerous spelling errors) Douglas Zare writes:
Quoting Øystein O Johansen <address@hidden>:

About the above statement: 1000K parameters? 250K parameters? This
sounds like a lot to me. The networks gnubg is using, we have 250
input nodes and 128 hidden nodes. That's 32640 weights. Is that
what you call parameters?

Basically. It does mean the weight files I'm using are too large to fit on a floppy disk.

A curious criteria for this day and age. If you tell me your whole program fits on one floppy I resign now!
There is not enough data to even hazard a guess.
- Did the two nets differ only in number of hidden nodes or also in the number of inputs?
- Where they trained on the same data set?
- Of what size?
- Which training method? My totally baseless hunch would be that the bigger net did not "mature" or did not "mature gracfully". In other words, a big net can match a data set at many arbitrary points, without developing the "right concepts". But again, this assumes either the data set was too small or not enough training, which I understand is unlikly because you have both the computing power and inovations in training.
Many years ago, I spoke to Fredrik Dahl. He doesn't say much about
the JellyFish development, but the one thing he said was, that it
wasn't much point in having to many nodes -- the training process
will just be slower.

It is valuable on multiple levels to be able to evaluate the network rapidly. However, more weights does not necessarily mean slower evaluations.

I wonder what you mean by that? Does it mean you use a sparse net topology? Otherwise it seem to ne that more weights must result in a slower net.
I much prefer the crispness of Jellyfish's rapid play to Snowie's sluggishness, particularly given that Snowie does not seem to have a big advantage in money play. (I look forward to seeing more data on this.) I think FD mentioned that Jellyfish uses about 20K weights. Computing power has improved quite a bit since then, of course.
I think this is Joseph's experience as well. When he started to work
on the gnubg networks he actually removed some of the input nodes
that he believed didn't contribute to the training. I have also asked
him about adding specific input nodes, but after some training with
these input nodes, he concludes that the new input nodes doesn't
contribute or the weights connected to this input don't converge.

Do you mean training from scratch using the new inputs, or adding the new input with initially low weight to an existing trained network?

I have tried both in the past. My concern (which I assume is the same as yours) is that starting with an existing net will discriminate against the new inputs. I now belive this is not a problem for me, given my training method. However, if I belive an input *should* contibute, I try will both. As I noted in the past, I think it is very hard to prove an input does not contribute.

Check also the history of eval.c [ref. 2] and look for changes made
by Joseph.
> First, roughly what level of improvement do you expect with mature
> networks of different numbers of hidden nodes?
No idea! I have never seen a ppg vs. hidden node chart either. I think
Tesauro gradually increased the number of hidden nodes started at only
40 hidden nodes and increased this to 80 hidden nodes, and then used
160 hidden nodes in TD-gammon 3.1 [ref. 1].

I'm familiar with his descriptions in earlier articles. However, I don't know what the corresponding improvements in playing strength are supposed to have been (the performance in short sessions is inconclusive, of course), and whether he felt that the networks were fully trained. It would probably be worth training networks with only a few hidden nodes and a fixed input set to see how well they perform. It wouldn't take much computing time to train the networks, but I had hoped that you all had already done it.
> The quality of a neural net is hard to quantify abstractly, so one
> could pin it down to, say, correct absolute evaluations in
> non-contact positions for the racing net, or elo, or cubeless ppg
> against a decent standard.
Yes, this is one of the problems, yes!

This is more medicine than science. I think one should pick a few benchmarks and use them, and if they aren't enough, add more. Which benchmarks are you set up to use so far?

This is exactly what I am trying to establish for the crashed net, and if it works for the contact net as well. There can't be one benchmark, but several in successions. I know you disaprove of my method, but I am not convinced, and definitly not aware of anything better. Chances are I am wrong but I have to learn it from my own experience.

> I don't think Snowie 3's nets were mature, but if they
> and Snowie 4's nets are, then how much of an improvement should one
> expect to see if Snowie 4 has neural nets with twice as many hidden
> nodes?
Same answer as above: I have no idea! Maybe Joseph has an idea.

I don't believe the assumptions, but my guess is that the answer is more than 0.02 ppg.
> Second, how many fewer nodes can you use for the same quality, if
> you release the net from predicting what is covered in the racing
> database?
You don't train a network to evaluate something it is not supposed to
evaluate in the future, do you?

Of course. You do, too, from what you write below.

Again I am not sure what Doug is saying here. Of course the race net is not trained on bearoff positions, and is not burdened with such knowledge. However, it might be worth while to check if adding some "limiting" cases would improve the transition gaps. (limiting - i.e. race positions arrived by one play from a contact position)

I noticed a jump in the performance of the contact network after the
crashed position was separated, and the network was only trained on
"contact" position. It was like some brain capacity was released, and
this brain capacity was used to improve the game in the contact

> Third, Tesauro mentions that a neural network seems to learn a
> linear regression first. Are there other describable qualitative
> phases that one encounters? For example, does a neural network with
> 50 nodes first imitate the linear regression, then a typical mature
> 5 node network, then 10 node?
I have no idea what so ever!

It's probably worth taking some time to understand these smaller nets. From the time of TD-Gammon onwards, backgammon programs have been better than almost all of the human players, restricting the critiques of their play too much. However, a network with intermediate play can be analyzed in a helpful fashion by any human expert.
> It might be wishful thinking, but if it is the case, it might be
> possible to retain most of the information by training a smaller
> network to imitate the larger network's evaluations. The smaller
> network might be faster to train, and then one could pass the
> information back.

(Again) it is not clear to me what "passing back the info" means. What I intend to do is add as many possible inputs we come up with, train, and then try to prune the non contributing ones, and then see which is the smallest net that can model the data. In other words, start big and work backwards.
It's about this Joseph is doing in fibs2html and mgnu_zp and other
friends. He has a very small network with only 5 hidden nodes. This
network is not only faster to train, but of course also faster to
evaluate. As I understand it, this net is used to prune candidates for
the real network. Joseph says this a huge speed improvement.

That's what Jellyfish level 3 is (though not specifically 5 nodes), right? Though its play seems laughable to me now, playing primarily against JF level 3 took me from the novice level (I learned that an opening 6-1 should not be played 13/6 in July or August of 1999) to 1800 on FIBS in a few months. I don't think that I would have learned as quickly from a slower program that played more accurately.
> Are there thresholds for the number of nodes necessary with one
> hidden layer before particular backgammon concepts begin to be
> understood? Again, in the Tesauro article [Ref. 1], he writes:
   "The largest network examined in the raw encoding experiments had
   40 hidden units, and its performance appeared to saturate after
   about 200,000 games. This network achieved a strong intermediate
   level of play approximately equal to Neurogammon."

That doesn't say that the raw encoding (which is already quite clever, including a lot of backgammon understanding) understood the same concepts that Neurogammon understood. Further, I'm more interested in the performance of networks that have more complicated inputs than the raw encoding. In my experience intermediate level play is achievable in a few GHz-minutes.
> In chess, people say that with enough lookahead, strategy becomes
> tactics, but how many nodes do you need before the timing issues
> of a high anchor holding game are understood by static evaluations?
> How many for a deep anchor holding game?
Hard to say. But I must also say, I believe (and I might be wrong)
this does not necessarily depend on the size of the net that much,
but rather depend on the training of it.

I think it clearly does depend on the size of the net, though of course a huge net might not play optimally for its size. By the size of the net, I don't just mean possible increases in size, but also possible decreases. So I mean that if you shrink the net too much, it won't be able to understand, e.g., what a safe contact bearoff structure looks like, or how to build 3 stacked points into a prime on 0-ply. Of course, asymptotically, perfect play is achievable with sufficiently many nodes. It may well be the case that gnu's network is large enough to play much better than it does, and that training is the key to improvement.
Now to my questions:

I'm going to skip most of these. Suffice it to say that I and others working with me have introduced some innovations to most of these, but I don't want to describe them yet.
How do you evaluate races? Do you have different inputs for your
race network? Or do you use a race database? If you use a NN,
how did you train this net? TD is completely useless here?

I'll answer here. First, from one-sided databases we have constructed some look-
up tables that are used as inputs. For example, the lookup table can include the exact chances of winning at DMP against a pure n-roll position for each n. Second, one can apply these lookup tables even in contact positions.
How do you benchmark your nets?

I use a variety of methods. One is through checking evaluations of reference positions. Another is the level of disagreement between plies (see my "Bot Confusion" column). I expect to include rollouts of positions of one-sided errors and variance reduced play against opponents of fixed strength soon.
Douglas Zare

I would like to stress again my current view about the importance of the data set used to train the net. I think it is rarly considered as a problem, but nowdays I tend to see it as crucial. It takes me longer to generate it than to train the net. -Joseph

reply via email to

[Prev in Thread] Current Thread [Next in Thread]