bug-gnubg
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] redesign gnubg to master/slave


From: Olivier Baur
Subject: Re: [Bug-gnubg] redesign gnubg to master/slave
Date: Tue, 23 Dec 2003 00:38:46 +0100

Hi everybody!

I'm finally back from Africa -- back to the gnubg world!

Le 21 déc. 03, à 20:54, Joern Thyssen a écrit :

One of the recent topics has been multi-threading. Instead of
multi-threading Jim suggested splitting gnubg into a GUI (master) and an
engine (slave) which communicates over sockets.

Besides avoiding the problem of making the engine of gnubg thread-safe
it also follows the Unix design principles
(http://www.catb.org/~esr/writings/taoup/html), i.e.,
http://www.catb.org/~esr/writings/taoup/html/ch11s06.html.

As Holger pointed out, I have written a thread-safe version of gnubg; more precisely, a version of gnubg that can compute rollouts in multiple threads, taking advantage of multi-processor hosts. It can also communicate with other gnubg's on a TCP/IP network, extending the notion of multi-threaded "gnubg computing units" to "remote gnubg computing units".

The main parts of the work here:
1/ make gnubg thread-safe (at least in its rollout part and GUI part) -- that is , mainly hunting down "bad" behaved globals and other static variables that would mix up between different thread contexts; 2/ rewrite the rollout function so it now divides its rollout job in smaller 1-game rollout tasks, with are then handled and dispatched by a "task engine", which collects results back to the rollout function for result accumulation; the bad news is this is several-month old, and I think some new features have been added to rollouts, like rollout interrupting/resuming. 3/ write TCP/IP exchange mechanism (and network slave/master notification and automatic joining)
4/ write a GTK GUI (and get a running thread-safe GTK version of gnubg

For those who will have a look at cvs branch-multi, I have set up my implementation notes at the following address (plain ascii text file): http://mapage.noos.fr/gnubgosx/processingunits.txt (different than the user-oriented HTML version that Holger already mentioned: http://mapage.noos.fr/gnubgosx/procunits.html) It describes more technically the concepts I used, how I made gnubg thread-safe, and how I implemented multi-threading and task handling.


Le 22 déc. 03, à 19:42, Jim Segrave a écrit :

A single instance of gnubg should be capable of doing all of play,
analyse and rollout without having to have any external
processes. This is conceptually easiest for users - there's one
program and it does everything they expect.

I also agree with this

There should be a stripped down version of gnubg, call it
gnubg-analyser, which should be able to do the following:

1) Given a position, a set of dice rolls and the evaluation rules,
   select the best move, if any exists, for each roll - sort of a
   FindnSaveBestMove() over a more limited set.

Or

2) given a position, a rollout setting, and a trial number, rollout a
   single game and return the results - more or less
   BasicCubefulRollout()

Or

3) given a position, a dice roll and a move or just a cube decision,
   and an analysis context, do an AnalyseMove()

This is what I called "slave processing units" ; processing units are threads of gnubg (typically, one per processor, on one or several hosts) ; and there's the master gnubg thread, that drives the GUI (GTK) and uses the available slave processing units, either local -- running on the host, one per proc --, or remote -- running on remote hosts, one per processor on remote host --) . The version of gnubg I wrote starts up in master mode (the usual gnubg), but can be switched to "slave" mode, ready to join a master. This can be done either in TTY or GTK.

Please note there are also some non-trivial data to transmit along, like random number generator info (like seed roots or RNG state) -- which game will be rolled out with wich dice is of utmost importance to variance reduction.

gnubg itself should allow the user to identify client gnubg-analysers
which it can use to help with evaluations. These would be
interconnected with TCP/IP sockets. They could be on localhost or on
remote machines. At startup, gnubg would attempt to open connections
to the gnubg-analysers so that it would know how many were available
at any given time. An analyser would only server a single gnubg, once
it accepts a connection from that gnubg, any other gnubg will be
unable to connect. Users with multiple machines who wanted to have two
gnubg's going and 4 analysers would have to choose which analysers
would be available to which machines.

I've also written this.
Slaves can notifiy their presence either to all available master hosts on the local TCP/IP network, or to a single, user-chosen master. When a master spots a self-notifying slave, it automatically adds it as co-operating gnubg to its list of processing units. A master can also be set to connect to an arbitrary list of gnubg slaves, for exemple by feeding it with a startup file containing gnubg "pu" instructions (type "help pu" , "help get pu", "help set pu" in gnubg to get the details, or some older mail I posted about it on the list some months ago; I need to find the references back)

During startup, the analysers should inform gnubg about their
approximate computing power, this allows the master to make some
guesses about load balancing between machines. Say someone has a
couple of old P-300s on a home network and a 2.4G main machine. You
would not want to try to split the work 33%-33%-33%, you'd probably
want to do something like 10% - 10% - 80% when doing analysis or
rollouts.

I didn't go to these ends, but for now you can define, on the master, different queue sizes for each remote slave; the queue size basically relates to the number of tasks that are sent at the same time to a remote host; so if you're using 2 remote hosts, one being twice as fast as the other one, you would set the queue size of the faster to 2 and the slower to 1, or maybe 8 and 4 if they happen to compute so fast that network overhead becomes too important as compared to real computations) ; basically, it's hard to precompute how you are going to split the job: maybe your 10%-10%-80% ratio is not correct; maybe one host will crash and you'll have to split the remaining; maybe some new hosts might join in the middle of a long rollout; I think the best policy is "serve a soon as finished": feed a slave as soon as it has returned the results of all the tasks it was given; work on the fly. The "ratio" issue only pertains when there are only a few tasks remaining to compute in the whole job, but then it gets a very hard work to make a good job splitting decision: who do you give these remaining tasks to? do you give them all to process A who is ready to compute? or do you keep some for process B, which runs very fast, and "should" be ready soon? I chose the simple solution (feed slave as soon as available, with some refinment in the "end of job near" case), with some "simple" queue parameter (which the user can adjust according to remote host processing speed --increase queue size if host is fast-- and network overhead --don't use queue size so small that network communication gets noticeable--)

The connections should have keep-alives as part of their protocol, so
that the loss of one of the gnubg-analysers would be noticed before
the main gnubg has waited too long for answers that won't be
forthcoming.

I still have to add better "timeout" handling.
For now, gnubg masters can safely handle slave disappearing from the network (network problem or slave crash/killed/returned to master mode)

I can see a few different modes of working:

During ordinary play, if gnubg is set to some fairly demanding
standard - I often play against gnubg on 2 ply supremo settings, then
individual moves could be shared among analysers. I'm not sure how
cube decisions would be partitioned, but given a dice roll, I'd think
you could do something like:

gnubg builds a list of legal moves. It then divides that list up and
lets the gnubg-analysers pick the best moves out of a subset of the
legal ones, while gnubg does some for itself. When it completes it's
selection, it then waits for the others to complete, gathers the
results and makes the final decision.
 For 0 ply (and quite possibly 1 play), there's
probably no point in using any remote analysis. For 2 ply you assume
that each legal move will take the same amount of time to analyse, so
in the situation postulated before of 2 slow analysers, given 12 legal
moves, gnubg would do 10 and pass 1 to each to the other machines.


I didn't write anything for n-ply evaluation yet (only rollouts, see below) I agree with you that it's no use doing anything with 0-ply and 1-ply evals. I don't think it's a good idea to split the work before starting to compute. Rather, divide the work (eg, quickly said, divide a 2-ply eval into, say, 18 1-ply evals if there are 18 legal moves, and have these 18 tasks given to available threads/processes (processing units in my terminlogy) -- you can't know beforehand the time that n-ply evaluations will take (it depends on number of available moves in deeper plies), so it's hard to fairly dispatch the tasks. One main thread should do the "job management" (this I called the "task engine" in my implementation, run by the master thread: send tasks, collect results), and one more "computing thread" per processor crunches numbers; plus processes running on remote hosts and communicating with master thread (actually with other threads that run locally on the master host and handle the individual TCP connections and exchange protocol)

The main disadvantage to this is
that caching is largely defeated, since the forward evaluations of the
selected move are likely to have been made on a remote machine. Here
gnubg must decide how much work to hand to each analyser and how much
to keep for itself.

I'm not sure cache impact would be so great; I think that cache can be mostly useful as you "travel down" the tree of possibilities (plies), since positions might reappear on deeper layers of the tree; I'd think there is less to share among different (parallel) parts of the tree when the first node is already different. I'm not sure if I what I mean is clear...

A different option for play would be to let gnubg handle all aspects
of play. But it could run a background task, particularly when it's
waiting for user input, where it passes any unanalysed moves to
gnubg-analysers and installes their results in the moverecords. If a
user then goes back in the game list and changes a play, this is no
major issue. If one of the moves which has now been undone, as it
were, has been passed to a gnubg-analyser, then when the answer is
returned, gnubg checks and finds that the moverecord is now gone and
simply discards the answer. The idea is that gnubg will have a match
analysed not long after it has been completed - it may even be
possible for the user to see analysis of earlier moves while the game
is still in progress.

Yes, this would be a particularily nice feature to implemented, which would benefit all users, not only those who have multi-processor machines or several computers available on a TCP/IP network.

When anlaysing a match or game, the same procedure as the above
paragraph would be followed - gnubg would build the moverecords, then
begin handing them out to the remote analysers. The only difference
would be that the user would have given the analyse game/session/match
command, so gnubg itself would know that it could also do some of the
analysis itself - it would include itself in the list of available
analysers.

Yes, agreed.

As soon as an analyser completes a move, it is given the next
unprocessed one. In this case there's no need to guess at which
machine is the most powerful, they will each work to full capacity as
long as there are uncompleted moves.

Agreed too. This meets the "feed slave as soon as ready" strategy I propose for position evals and that I implemented in rollouts.

But I think cache will be important here; that is, if you have already analysed play N in the match with one processing unit, then you'd better give it play N+1 to analyse, because there surely will be plenty of useful data in the cache! I think we'll need some task dispatch algorithm that strive to give processing units as many consequent plays to evaluate as possible, without precomputing the job splitting. Interesting problem... :-)


Rollouts are the most interesting problem, because we want to be able
to interrupt and resume rollouts. Gnubg begins by deciding it will do
the first trial, then passing the next n trials to the
gnubg-analysers. During it's rollout, it checks for analysers having
completed. If they have, it stores the results of that one rollout and
issues the next trial to the analyser. When gnubg completes a rollout,
it then takes the results of that rollout and any following ones in
the series (not including gaps) and builds the cumulative result.

Agreed, this is also my "feed as soon as ready" philosophy.

here's a picture or a rollout where gnubg is faster than analyser 1
and much faster than analyser 2. Trial 1 would be the first rollout a
single gnubg would do, trial 1 would be the second, etc. So, for a
rollout of a single move, these would be the same as the games rolled
out. When rolling out two moves, or a cube decisions, trial 1 is move
1, game 1, trial 2 is move 2, game 1, trial 3 is move 1, game 2, etc.



        Time
      0         1         2         3         4         5         6
0.........0.........0.........0.........0.........0.........0.....
gnubg | trial 1  |  trial 4  | trial 7  | trial 9  | trial 11 |

anal1 | trial 2      | trial 5       | trial 8       | trial 12      |

anal2 | trial 3          | trial 6          | trial 10         |


Time    action
0       initiate the rollouts
11      gnubg completes trial 1. It can put this into the cumulative
        results and begin trial 4
15      anal1 completes a result. This is stored and anal1 begins
        trial 5
19 anal2 completes trial 3. This is stored and anal2 begins trial 6
23      gnubg completes trial 4. It can put trial 2, 3, and 4 into the
        cumulative results and begin trial 7
31 anal1 completes trial 5. This is stored and anal1 begins trial 8
34      gnubg complets trial 7. Trial 5 can be added to the cumulative
        results, but trial 7 must be stored. gnubg begins trial 9
38 anal2 completes trial 6, this is stored and anal2 begins trial 10
45      gnubg completes trial 9. Trial 6 and 7 are added to the
        cumulative results, gnubg stores trial 9 and begins trial 11
47 anal1 completes trial 8. This is stored and anal1 begins trial 12 56 gnubg completes trial 11. Trials 8 and 9 are added to the cumulative
        results.

...
If the rollout were interrupted between 45 and 47, then we'd only have
usable results for trials 1..7, if it were between 48 and 56, then
we'd have trials 8 and 9 as well. In general, this would keep all the
processes working more or less flat out (it should even be possible to
add or subtract servers during a rollout).

That's what my "task engine" is responsible for: assigning tasks (and reassigning failed tasks ) and collecting results.

The potential for people with a home network or friends who are
willing to lend their machines for overnight rollouts or whatever

Right. This I implemented too: slaves joining or quitting a long rollout without disturbing the whole job -- it basically comes free with the "serve as soon as..." strategy :-) One thing: this can also be done on the Internet (why limit to a home network?) One can even imagine "gnubg slave farms" to compute big rollouts or maybe help in training new neural nets (I don't know if training algorithms can be easily split down into subtasks to dispatch among cooperating slaves, like the SETI screensaver)

Other thoughts:

To deal with firewall issues, it should be possible to specify the
ports to be used for both outbound connections from gnubg and inbound
to the analyser.
It would be a good idea to link to libwrap

The analyser should be allowed to work either from stdin/stdout (so it
can be invoked from inetd or with a command line to listen on a
specific port.

All the exchanges should be done in ASCII, numerical values can be
exchanged in %.18g or similar format.

It's probably easiest, though not the most efficient, to exchange the
full analysis or rollout contexts with each service request and to
pass them back again as part of the results.

Agreed to. I've also implemented the above, to some degree. Details can be found in the txt document I pointed at the beginning of this mail.


Bottom line
I'm not sure it is necessary to go through a full rewrite of gnubg in master/slave at the GUI / engine level, as I've already done it at master thread / slave thread level, with only the master thread having access to GUI, and a single gnubg process being able to be switched to/from master/slave state. Please have a look at branch-multi, and try and compile it on your platforms. I know that it works on Macs (multi-threading and remote processing), on PC's (multithreading only, still got problems with winsock for remote processing), and I think it should work with most POSIX-compliant flavours of UNIX/Linux.


Best

-- Olivier




reply via email to

[Prev in Thread] Current Thread [Next in Thread]