[Nauty] Parallelizing generation

Mon Jan 18 21:54:32 AEDT 2021

Hi Tomas,

Compile the program with SPLITTEST defined.  Then run it with 
non-trivial splitting.  It will report the splitting level, the number 
of nodes on that level, and the time.  For example, with parameters "8 8 
2/3" I get

    >Z 25102903 splitting cases at level 6; cpu=1.43 sec

The splitting level does not depend on res/mod but you have to specify 
something for this feature to work. (I should fix that.)

There is an undocumented option -X#, where # is an integer, that moves 
the default splitting level up by #. The value of # can be negative.  
For example, "-X-1 8 8 2/3" gives

    >Z 834126 splitting cases at level 5; cpu=0.09 sec

You can use -X to experiment with different splitting levels to see if 
there is one suiting your application.  Note that if the number of 
splitting cases is not much larger than mod, the parts probably won't be 
of similar sizes.

Brendan.

On 18/1/21 9:09 pm, Tomas Peitl wrote:
> Hi Brendan,
>
> Thanks for the detailed explanation!
>
> How is the splitting level determined? Can I set it manually, or is it 
> simply the first level that has size >= mod? Can I diagnose this; for 
> instance see where the splitting occurred and possibly count how much 
> time was spent on pruning and generation before and after that?
>
> It seems that the space is split into roughly even parts, so a 
> plausible explanation is indeed that generating the shared tree takes 
> too long. But on the other hand I'm not trying to use that many cores 
> (30-50) and I already see behavior that suggests it may take as much 
> as 300 seconds for the shared tree, which I can't quite believe (this 
> is for bicoloured 12 + 11 vertices with pruning and a matching on the 
> 12 vertices, using up to 32 cores, the output has ~140k graphs, 
> running times and output sizes are roughly even across parts).
>
> Best,
> Tomas
>
> On 16.01.21 04:03, Brendan McKay wrote:
>> Hi Tomas,
>>
>> I suspect you are using parameters for which the splitting is not good.
>>
>> Consider the case of generating all bicoloured graphs with 8 vertices 
>> on each side, in total 14,685,630,688 graphs.
>>
>> One part:  3872 sec
>> 10 parts:  392.9 sec each (factor 9.85)
>> 100 parts: 41.0 sec each (factor 94.4)
>> 1000 parts: 5.33 sec each (factor 726)
>> 10000 parts: 1.74 sec each (factor 2220)
>>
>> This is close to linear speedup up to around 1000 parts.  I'll 
>> explain how it is done.  Remember that a part is specified as a pair 
>> res/mod, where 0<=res<=mod-1.
>>
>> The generation has the form of a tree, with output graphs as leaves.  
>> Draw a line across the tree at some level (the splitting level) and 
>> number the nodes on that level modulo mod (starting at 0).  Part 
>> res/mod corresponds to the portion of the tree descended from those 
>> nodes numbered res.
>>
>> Note that the entire tree must be generated as far as the splitting 
>> level, which is an overhead for every part. In the example above, it 
>> is about 1.35 seconds, which is why the parallelism gets poor when 
>> the time per part isn't much more than the overhead. Check that the 
>> time per part is close to 1.35 + 3870/N for N parts.
>>
>> In the case of different classes of bipartite graphs, various things 
>> can go wrong.  If the shape of the tree is very irregular, the size 
>> of the parts can vary a lot, which matters if your application relies 
>> on waiting for the slowest part.  If the tree is strongly pruned in 
>> the levels above the splitting level, it could mean that the 
>> splitting overhead becomes a larger fraction of the total. Sometimes 
>> there is no splitting level that gives both acceptable evenness of 
>> splitting and acceptable overhead.
>>
>> I have used two splitting levels together on occasion. This feature 
>> is not in the code but it is something to try if changing the 
>> splitting level is not enough to solve the problem.
>>
>> Brendan.
>>
>>
>> On 16/1/21 2:15 am, Tomas Peitl wrote:
>>> Hi everyone,
>>>
>>> I've got a question about parallelization in generators such as 
>>> genbg (I'm using a modified version of genbg provided by Brendan 
>>> some time ago, but that shouldn't make a difference I hope).
>>>
>>> I noticed that the speedup from parallelization using the res/mod 
>>> feature pretty much plateaus at around 5-6 cores, and it doesn't 
>>> really make too much sense to use more than that. Is this more or 
>>> less correct, or am I missing something? To be fair, the speedup 
>>> does grow a bit even beyond 6 cores (depends on the particular case 
>>> I guess), but the dependence on the number of jobs looks more 
>>> logarithmic than linear (didn't really try to fit a curve). So if 
>>> you have more than one generation task, it makes sense to use fewer 
>>> cores per task, and instead run several tasks in parallel.
>>>
>>> I have some aggressive custom pruning, but on the other hand each 
>>> job generates roughly the same number of results, and the running 
>>> times are also more or less uniformly distributed, so the splitting 
>>> of the search space seems OK.
>>>
>>> How does the res/mod feature work? Does each job still have to 
>>> generate everything, just with the advantage of quickly skipping 
>>> over graphs that belong to other jobs? Or can it run in "O(N/j)"?
>>>
>>> Many thanks,
>>> Tomas
>>>
>>> _______________________________________________
>>> Nauty mailing list
>>> Nauty at anu.edu.au
>>> http://mailman.anu.edu.au/mailman/listinfo/nauty
>> _______________________________________________
>> Nauty mailing list
>> Nauty at anu.edu.au
>> http://mailman.anu.edu.au/mailman/listinfo/nauty
> _______________________________________________
> Nauty mailing list
> Nauty at anu.edu.au
> http://mailman.anu.edu.au/mailman/listinfo/nauty