This presentation will briefly touch on my motivation and basic methods for doing a parallel simulation but will focus mainly on a "toy" problem that demonstrates how to organize J code to allow it to be run in parallel in order to take advantage of a multi-core machine.
I was inspired to work on a simulation of various aspects of poker by reading about the treatment of the game in a seminal book on game theory, shown here. The book devoted quite a few pages to this game in particular but used a simplified version of it in order to more clearly treat it with the tools of this theory. There are even more simplified versions I've seen on the web which are stripped down to encompass only two players with a deck of only three cards.
The game itself is quite complex - in some ways more so than chess - in part because it is an "information incomplete" game. This means, unlike a game like chess, not all the players have all the same information available to them. This makes reasoning about an opponent's play subject to considerable uncertainty and raises the necessity of bluffing. Also, players continually adapt to the strategies of their opponents, further increasing the complexity of the game.
The general method I chose is what is called "agent-based" simulation. This treats each poker player as an autonomous agent and allows us to define a player's behavior with a few parameters. The initial focus of the study is to figure out how different bluffing strategies work in conjunction with each other. I hope to be able to answer questions about the effectiveness of "conservative" or "loose" strategies in games with different combinations of players following one or the other mode of play.
The example I will show is based on the idea of "coarse-grained" or "task-based" parallelism for a few reasons. One consideration, illustrated on the left below, is that a finer-grained parallelism in which each player is modeled by a separate process is more complex to code as there are quite a few interfaces - shown by the red "gears" - to handle in order to coordinate the game properly. The other main reason is that the coarser-grain is better suited to matching parallel tasks to separate cores of a modern computer.
Overview of Method
Here is an overview of the general method by which I parallelize my J code: I customize a generic "template" file to produce distinct task-level files, each of which I then launch as a separate process. The results from these files are collected into a central file for subsequent analysis.
This is an example of a blank template, indicating the major sections by comments. The comment about setting the random seed is included because it's necessary to do this to ensure that each simulation is different. This code is structured to allow the customizer to simply append lines to the end of the template. We could also customize the template by replacing specially designated key words, e.g. something in curly braces, with values. Appending to the end seems a bit neater.
This slide illustrates the initialization section of the template we'll be using. The code shown here is not general: the first two anonymous verbs are specific to my own, personal environment and should be replaced by a user's own code to set up the run-time environment desired. The "setSeed" example is customized for a 64-bit system; the 32-bit version would use 32-bit integers - modulo 2,147,483,647 instead of the value noted as <:2^63 shown here.
The work to be done in this toy example is simply matrix inversion, a common benchmark task as was noted in Joey Tuttle's talk. Since we're not interested in the actual results of this inversion, we collect only the timing information. An actual job to be done would also collect the results of the work.
The end of this section also raises the question of how we signal that a job is complete. We could do this by monitoring the expected output file but in this example we monitor the processes we fork off in parallel.
Finally, at the end of our template, it's useful to include a comment reminding us what we expect the customization lines to look like. Note that these lines could be uncommented and we could replace the keywords in curly braces with the unique values for each task if we weren't using this "append to the end" method. This method is perhaps neater but the replacement method of customization is perhaps more general.
Note that we explicitly log out each task with the J "quit" command "2!:55 ''" - the importance of this will become more apparent in the digression on the effects of forking copies of the J interpreter in a later section.
Now we look at the first and main part of the customization script. The code in "buildTask" creates the custom lines per task, appends them to the end of the template, writes the customized task to a uniquely named file, and returns the name of this file to be run.
This section illustrates loading the customizer script. As a note on style, many of my top-level scripts include output on what I usually do upon loading them. This reminds me of what steps there are and allows me to copy the lines, modify them, and run them in my J session with minimal effort. Here we see how loading "runMultiple.ijs" outputs six commonly-used lines based on this script file.
Below this output, we see examples of running the code to set the number of processes to run, then using this number to generate customized code files and the command lines to run them. The executables are distinctly-named copies of the jconsole.exe. This was done so we can distinguish the parallel tasks we start from the session under which we're running. This name distinction is not necessary under Windows 7 as that version of Task Manager allows us to see the invocation command line. However, the distinct names that differ from our parent console session are useful for monitoring the processes we've launched so we can tell when they've finished.
Here we take a look at the contents of one of the task files we generated to see that it contains the lines we expect. We see that the customized name of the results file is coordinated to match the name of its corresponding J script file.
We also compare one of our distinctly-named executables with jconsole.exe to confirm that it's the same.
Digression: Peculiarities of Forking Processes in J
In the following three slides, we illustrate some of J's odd behavior when we spin off parallel instances of the interpreter. There seems to be some sort of environment shadowing, as shown by the change in "ARGV_z_" from the initial command line we used to begin our session to the command line we forked off to run the tasks we created. Note how the "quit" foreign conjunction " 2!:55 '' " seems not to exit J as it ought to. It appears to (eventually) remove one of the shadowing child processes instead.
Here we see how another attempt to exit the J session is followed instead by the result of the "watchTillDone" verb that monitors the "J7Pll*.exe" processes. We subsequently see that our "ARGV_z_" global now has its original value, indicating we are now back to our original J session. We can now check to see if the results files have been created - they have - and if the contents of one of them is as expected - it is.
Here we look at what happens when we run a simple script via a forked invocation of the J console: the script's output is seen in the active session and we seem to be in the child process's environment even after that script ends. This is because we did not include an exit command ( 2!:55 '' ) in the script file shown below. Once we enter this exit command in our interactive session, we close out the child environment and revert back to the parent one.
So far, this behavior makes some kind of sense. However, if the child script errors out, the behavior seems a bit odd. Here we see that, after the error, we're in the parent environment. However, when we exit this session, we see the DOS prompt but are dropped back into the child's J session upon hitting enter!
Example: Monitoring Processes
Now that we know to expect this puzzling behavior, we can return to our main exposition. Here we see the "watchTillDone" verb that monitors a set of processes until they are no longer running. This is how we monitor when all our parallel children have finished. Doing it this way synchronizes the asynchronous processing of the child tasks. It is not necessary to do this synchronization but it simplifies our subsequent processing for this example.
The "pslist" command invoked by this code is an external Windows utility that returns a list of processes currently running on our machine.
Here we look at an alternate way we might monitor the child processes by monitoring changes in their log files. If we planned to work this way, we may want to have all the parallel tasks write to the same log file to simplify our checking. If we do write to the same file from multiple processes, we'll need to use the locking routines in the "mutex.ijs" script (included above but not shown) to ensure that one task does not overwrite another task's output.
At the end of this slide we see an example of how we consolidate the results from numerous files into a single one for simpler post-processing.
Timings for Different Numbers of Processes in Parallel
In this last section, we'll look at how parallelism affects the timing of our processing.
Morton Kromberg of Dyalog observed that a common limiting factor with multiple processes is not so much the memory as the memory bandwidth: even multi-core processors are limited to about 4 GB of memory accessible at one time, which may be reflected in the low memory usage seen here.
* Verbs to monitor running processes by parsing the result of the (Windows-specific) command-line routine pslist.
* "Mutual Exclusion" (mutex) routines to allow multiple processes to access a common file without stepping on each other