Parallel Matrix Elimination
I released a new version of M4RI today which contains a parallel implementation for matrix elimination. Below I reproduce some timings for this code to give a rough idea of the performance of this code.
| Matrix Dimension |
Magma 2.14-13 (64-bit, 1 core) |
M4RI (64-bit, 1 core) |
M4RI (64-bit, 4 cores) |
|---|---|---|---|
| 10,000 x 10,000 | 3.283 | 2.509 | 1.064 |
| 16,384 x 16,384 | 11.204 | 10.741 | 3.918 |
| 20,000 x 20,000 | 16.911 | 19.776 | 7.216 |
| 32,000 x 32,000 | 57.761 | 86.071 | 32.420 |
| 64,000 x 64,000 | 355.477 | 640.742 | 307.213 |
The examples hfe25_5, hfe30_5 and hfe35_5 from the M4RI website take 1.44, 9.29 and 51.56 seconds respectively.
Note that this is work in progress and that the algorithm still has worse complexity than the one implemented in Magma. Also note that the speed-up is far from linear and that the speed-up decreases with the size. This is probably because each thread falls out of L2 more often and the threads clog each other.

