LQUP vs. PLUQ
At SD16 Clément Pernet and myself have been working on improving the asymptotically fast PLUQ factorisation over GF(2) in M4RI. As mentioned earlier, one of the main problems is that column swaps are pretty expensive compared to many other operations we do. Eventually, we settled for LQUP over PLUQ since it has fewer column swaps overall since it does not compress U. We also improved the base case both w.r.t. to sparse matrices and in general (more Gray code tables are used now) and the column swap performance overall (cf. SD 16 Wiki). The result is noticeable, but we are not quite there yet:
There are still some places which could be improved so this should get better eventually. Also, we might have another strategy to deal with these sparse-ish/structured matrices. Anyway, the new PLUQ code is at least as fast as M4RI for the structured HFE examples on the M4RI website on my Core2Duo 2.33Ghz notebook (and of course much faster on random examples and on other platforms) The new code is available on BitBucket.
Thu, 18. Jun 2009More OpenMP Experiments with M4RI
Motivated by a thread on [mpir-dev] I played around with OpenMP again today. The performance does not scale linearly … but hey it scales at all. I guess eventually I’ll have to get serious about this and sit down to make this proper. Anyway, here are the timings (on geom.math.washington.edu)
| n | M4RI 1 thread |
PLUQ 1 thread |
M4RI 4 threads |
PLUQ 4 threads |
M4RI 16 threads |
PLUQ 16 threads |
PLUQ 16 threads cutoff=2048 |
|---|---|---|---|---|---|---|---|
| 10,000 | 1.72 | 0.85 | 1.03 | 0.86 | 0.58 | 0.80 | 0.77 |
| 16,384 | 13.75 | 5.76 | 4.78 | 4.23 | |||
| 20,000 | 27.02 | 5.45 | 7.35 | 5.48 | 3.27 | 3.68 | |
| 32,000 | 112.74 | 21.96 | 30.51 | 22.02 | 13.78 | 13.91 | 12.95 |
| 64,000 | 227.80 | 157.03 | 104.94 | 75.95 | 66.54 | ||
| 100,000 | 1078.72 | 429.32 | 869.43 | 596.51 | 428.08 | 260.99 | 231.01 |
For some reason which I don’t understand yet is PLUQ slower for 16,384 than 20,000 on this machine. The code is on bitbucket.
Wed, 03. Jun 2009Geometric XL
The paper by Sean Murphy and Maura Paterson has been around for quite some time now (same for my toy implementation). Gröbner basis algorithms and related methods like XL are algebraic in nature. In particular, their complexity is not invariant under a linear change of coordinates. For example consider Cyclic-6
sage: P.<a,b,c,d,e,f,h> = PolynomialRing(GF(32003))
sage: I = sage.rings.ideal.Cyclic(P,6).homogenize(h)
sage: J = Ideal(I.groebner_basis())
The generators of J are a Gröbner basis and we can use this property to find a common root for these generators. Now, consider the same equations but permute the variables in the ring.
sage: P.<a,b,c,d,e,f,h> = PolynomialRing(GF(32003),order='lex')
sage: I = sage.rings.ideal.Cyclic(P,6).homogenize(h)
sage: J = Ideal(I.groebner_basis())
sage: R = PolynomialRing(GF(32003),P.ngens(),list(reversed(P.variable_names())),order='lex')
sage: H = Ideal([R(str(f)) for f in J.gens()])
The generators of H do not form a Gröbner basis in R which is just P with its variables reversed. If we are only trying to solve a system of equations choosing the right permutation of variables might have a significant impact on the performance of our Gröbner basis algorithm:
sage: t = cputime()
sage: gb = H.groebner_basis('libsingular:std')
sage: gb[-1].degree()
19
sage: cputime(t) # output random-ish
25.36...
While in this example it is easy to see which variable permutation is the cheapest one, this is not necessarily the case in general. The GeometricXL algorithm is invariant under any linear change of coordinates and has the following property:
Let D be the degree reached by the XL algorithm to solve a given system of equations under the optimal linear change of coordinates. Then GeometricXL will also solve this system of equations for the degree D, without applying this optimal linear change of coordinates first.
The above behaviour holds under two assumptions:
- the characteristic of the base field K is bigger than D and
- the system of equations has one over “very few” solution.
To demonstrate this behaviour, consider the synthetic benchmark available here which is a Gröbner basis under a linear change of coordinates:
sage: e,h = random_example(n=6)
e is the original easy system while h is the “rotated” system
sage: e.basis_is_groebner()
True
sage: max([f.total_degree() for f in e.gens()])
2
sage: h.basis_is_groebner()
False
sage: max([f.total_degree() for f in h.gens()])
2
GeometricXL recovers linear factors and thus candidates for common roots at D=2:
sage: hH = h.homogenize()
sage: f = GeometricXL(hH, D=2); f.factor(False)
0.0...s -- 1. D: 2
...
(-2684)
* (-1056*x5 - 2964*x4 - 177*x3 + 6206*x2 + 376*x1 + 6257*x0 + h)
* (2957*x5 - 792*x4 - 4323*x3 - 14408*x2 - 2750*x1 - 8823*x0 + h)
While any Gröbner basis algorithm would have to reach at least degree 64:
sage: gb = h.groebner_basis('libsingular:slimgb')
sage: gb[-1].degree()
64
While my toy implementation is neither robust not efficient, the following table (using the same synthetic benchmark as above) should give some indication that for some problems GeometricXL is a good choice:
| n | GeometricXL degree | GeometricXL time in seconds | Singular 3-0-4 time in seconds | Magma 2.14 time in seconds | Gröbner basis degree |
|---|---|---|---|---|---|
| 2 | 2 | 0.01 | 0.00 | 0.00 | 4 |
| 3 | 2 | 0.07 | 0.00 | 0.00 | 8 |
| 4 | 2 | 0.18 | 0.00 | 0.00 | 16 |
| 5 | 2 | 0.37 | 0.02 | 0.01 | 32 |
| 6 | 2 | 0.73 | 0.12 | 0.04 | 64 |
| 7 | 2 | 1.36 | 0.94 | 0.09 | 128 |
| 8 | 2 | 2.35 | 9.59 | 0.56 | 256 |
| 9 | 2 | 4.24 | 117.43 | 3.79 | 512 |
| 10 | 2 | 7.34 | — | 28.68 | 1024 |
| 11 | 2 | 12.08 | — | 205.05 | 2048 |
| 12 | 2 | 20.03 | — | — | — |
| 13 | 2 | 35.53 | — | — | — |
| 14 | 2 | 53.98 | — | — | — |
| 15 | 2 | 88.36 | — | — | — |
| 16 | 2 | 143.16 | — | — | — |
While the benchmark is synthetical and unrealistic, there are a few problems in multivariate quadratic public key cryptography which might be tackled using an idea similar to the GeometricXL algorithm. Also, note that constructing a GeometricMatrixF5 algorithm is straight forward by replacing the first step of the algorithm.
Wed, 20. May 2009F4-Style F5 — Next Attempt
Since the F4-style F5 I mentioned a while ago wasn’t really that “F4-style” I’ve pushed my next attempt into the public “algebraic attack” bitbucket repository. This version swaps the two outer loops, i.e. it proceeds by degree in the outer loop instead of the index of the polynomials. Of course, we are talking about a toy implementation here to understand the algorithm and not an attempt to implement F5 efficiently.
SSH Paper Available Online
Our paper on plaintext recovery attacks against SSH is now available online both locally and on Kenny’s website. Btw. contrary to popular belief the paper is not on an implementation error in OpenSSH but on a design flaw in the SSHv2 specification which enables a variety of attacks against various implementations of the standard.
Tue, 07. Apr 2009Funny
It might very well be just a conincidence, but superficially it looks like Mathematica is targeting Sage users with ads on Google.
Source: [sage-devel] … scroll down to see Harald Schilly debunk the whole thing. Still, it’s funny.
Sun, 22. Mar 2009Windows Sage 0.3
As you might have heard, there is a Windows port of Sage in the making. While the current version doesn’t look like Sage proper, it already ships quite a few building blocks. Since, incidently the Information Security Group is now part of the “Microsoft Developer Academic Alliance” — which grants students free access to their products — I gave it a quick spin. See below.

Installation was straight forward once you have Visual Studio 2008 installed, just call build.bat and wait an hour or so.
Fri, 20. Mar 2009M4RI API Change and Big Matrices
Finally, I found some time to work on M4RI again: I changed the internal representation of matrices to support more than one malloc() call per matrix, i.e. each matrix is split into blocks of some maximal size. This allows to deal with much bigger matrices under Linux because the kernel often won’t just give you 8GB mapped to consecutive addresses but it will give you 1GB eight times. This limitation bugged me quite some time now, since this also limited the kind of systems I could solve using PolyBoRi with M4RI enabled. The result is available at http://bitbucket.org/malb/m4ri/ but bear in mind that the API changed quite a bit for this (e.g., I renamed the packedmatrix to mzd_t) on the way).
| Matrix Dimension |
Memory (expected) |
M4RI/M4RI (64-bit, 1 core) |
M4RI/PLUQ (64-bit, 1 core) |
|---|---|---|---|
| 100,000 x 100,000 | > 1.16 GB | 1078.72 s | 429.32 s |
| 200,000 x 200,000 | > 4.65 GB | — | 2298.30 s |
| 256,000 x 256,000 | > 7.63 GB | 8979.33 s | 3709.25 s |
The above table gives the time to compute the rank of random matrices of the given dimension using the given algorithms on http://geom.math.washington.edu.
Tue, 10. Mar 2009A Simple Bitslice Implementation of PRESENT
I had a hunch the other day which required $2^{36}$ plaintext—ciphertext pairs to be tested. While the hunch turned out to be wrong, at least I got a simple bitslice implementation of PRESENT (in pure C99) out of it. The implementation is far from optimal:
- it is pure C but critical paths should be pushed down to assembly,
- it doesn’t use SSE2’s 128-bit instructions but it should, and
- the S-Box is hardly optimised (it is simply ANF).
Still, it is better than what I found on the web via a quick Google search. On my 2.33 Ghz Core2Duo Macbook Pro it seems to run at 28 cycles per byte for long messages (not counting the key schedule). For comparison, I think the current speed of AES in bit slice mode is like 10 cycles per byte on the Core2. If I need some performance in the future, I might sit down and improve the implementation (such that it doesn’t totally suck) but for now I have to attend to other projects.
Mon, 02. Mar 2009Attacking Cryptographic Schemes Based on “Perturbation Polynomials”
“We show attacks on several cryptographic schemes that have recently been proposed for achieving various security goals in sensor networks. Roughly speaking, these schemes all use “perturbation polynomials” to add “noise” to polynomial-based systems that offer information-theoretic security, in an attempt to increase the resilience threshold while maintaining efficiency. We show that the heuristic security arguments given for these modified schemes do not hold, and that they can be completely broken once we allow even a slight extension of the parameters beyond those achieved by the underlying information-theoretic schemes. Our attacks apply to the key predistribution scheme of Zhang et al. (MobiHoc~2007), the access-control schemes of Subramanian et al. (PerCom~2007), and the authentication schemes of Zhang et~al. (INFOCOM~2008).”
That’s the abstract of a paper of Craig Gentry, Shai Halevi, Jonathan Katz and myself which just appeared on eprint. The source code for the experiments conducted in that work is available at bitbucket.
Mon, 02. Feb 2009Sage hits Debian/unstable
As jzmer just pointed out on #sage-devel, Sage hit Debian/unstable today-ish. Of course, this also means that the wealth of math software Sage ships (e.g., Singular) is now also available in Debian/unstable. Well done Tim!
Update:Carl — also on IRC — points out that there are issues with the binary shipped with Debian as of now. Hopefully, this will get resolved quickly.
Wed, 28. Jan 2009Semi-Regular Bits and Pieces
John Perry’s slides from his talk at Sage Days 12 in San Diego are available on the Sage wiki.
In my Diplomarbeit I had an algorithm for computing Gröbner bases of block ciphers called “Gröbner surfing”: “Instead of computing a reduced Gröbner basis for all rounds $rgb_F$ it computes the reduced Gröbner basis $rgb_{i+1}$ up to round $i + 1$ recursively as $$rgb_{i+1} = rgb(gb_i + round_{i+1})$$ with $rgb_0 = rgb(round_0)$ where $$rgb(round_i)$$ denotes any algorithm returning a reduced Gröbner basis for a given finite set of polynomials $round_i$.” This obviously is a tiny subset of what $F_5$ does: computing the Gröbner basis for $\langle f_i,\dots,f_m \rangle$ from the Gröbner basis of $\langle f_{i+1},\dots,f_m \rangle$. So nothing new here, move along.
I’m going to the Fast Software Encryption (FSE) 2009 workshop in Leuven to present our paper “Algebraic Techniques in Differential Cryptanalysis”. The full list of accepted papers is available online.
Sun, 25. Jan 2009F4-Style F5
When I asked Jean-Charles Faugère a while back why he doesn’t publish his F4-style F5 he answered that there would be nothing to publish since it is straight forward. I don’t know about that but I was actually quite surprised how quickly John Perry and I could come up with an F4-style F5 here at Sage Days 12 (pictures). Btw. Till Stegers calls this F4.5 but I don’t know how Jean-Charles Faugère refers it.
I’ve uploaded the toy implementation to bitbucket. It seems to behave like the polynomial F5 implementation in the same file, so at least it is bug by bug compatible with that. Speaking of behaviour: When computing Cyclic-6 over $\mathbb{F}_{32003}$ w.r.t. degrevlex I noticed that my F4 implementation only goes up to degree 16 while my F5 implementations need to consider degree 18. Although no matrix is constructed for degree 18, the F4-style F5 does construct and eliminate a matrix for degree 17 … puzzling.
Fri, 16. Jan 2009Bitslicing and the Method of the Four Russians over Larger Finite Fields
Tom Boothby’s and Robert Bradshaw’s paper on the “Method of the Four Russian” multiplication algorithm over $\mathbb{F}_3$, $\mathbb{F}_5$, $\mathbb{F}_7$, $\mathbb{F}_{2^2}$ and $\mathbb{F}_{2^3}$ is available as pre-print on the arXiv. If you’re into fast exact linear algebra I highly recommend reading it since it has some really nice ideas in it and is well written.
Abstract. “We present a method of computing with matrices over very small finite fields of size larger than 2. Specifically, we show how the Method of Four Russians can be efficiently adapted to these larger fields, and introduce a row-wise matrix compression scheme that both reduces memory requirements and allows one to vectorize element operations. We also present timings which confirm the efficiency of these methods and exceed the speed of the fastest implementations the authors are aware of.”
Mon, 12. Jan 2009F5 Sans Rewriting
In his PhD thesis Justin Gash writes: “Rewritten gives us information to be used as an additional criterion for eliminating critical pairs. … In short, we could remove all discussion of rules and Rewritten and F5 would work fine. (But it would work much more slowly.)”
This motivated an implementation of F5SansRewriting which removes all mentions of Rewriting to make it easier to solely focus on the F5 criterion first. Of course the result is horribly inefficient. Cyclic-5 over $Z_{32003}$ takes 0.15 seconds using F5 and 81520.64 seconds using F5SansRewriting … well, at least it gives correct results and hopefully helps somehow.
Fri, 09. Jan 2009“What Every Programmer Should Know About Memory”
is the title of an interesting paper by Ulrich Drepper of Rad Hat, Inc. The paper is — as you might have guessed — about the memory architecture in modern CPUs and what you — dear programmer — can do to improve performance of your code based on this knowledge. The paper has a section on tools for performance tuning which also discusses OProfile. OProfile is an interface to the CPU’s event counters (which vary across platforms). These counters can be queried to investigate the behaviour of a given piece of code on a given CPU. Intel’s Intel 64 and IA-32 Architectures Optimization Reference Manual suggests a list of event counters to look at. For instance the Intel engineers write:
“Clocks Per Instruction Retired Ratio (CPI): CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. The Intel Core microarchitecture is capable of reaching CPI as low as 0.25 in ideal situations. But most of the code has higher CPI The greater value of CPI for a given workload indicate it has more opportunity for code tuning to improve performance. The CPI is an overall metric, it does not provide specificity of what microarchitectural sub-system may be contributing to a high CPI value.”
To take it for a spin I chose the obvious target: the libM4RI library. Below is the CPI for bench_elimination 20000 20000 pluq.
== UNIFORMLY RANDOM ==
symbol CPI % of overall time
------------------------------------------------------
_mzd_mul_m4rm 0.94 (49.68)
mzd_process_rows 1.76 (13.68)
.plt 0.62 (10.47)
mzd_process_rows2_pluq 1.88 (10.07)
mzd_make_table 0.76 ( 8.31)
mzd_copy 1.85 ( 1.80)
_mzd_mul_naive 1.10 ( 1.38)
mzd_apply_p_right_trans 0.99 ( 1.14)
_mzd_trsm_lower_left_even 0.75 ( 0.94)
mzd_apply_p_right 0.81 ( 0.93)
mzd_combine 3.17 ( 0.61)
Note that mzd_process_rows() and mzd_process_rows2_pluq() (which are both called from the PLUQ MMPF base case) have a rather high CPI compared to the rest of the code, indicating that I should sit down and implement using more than two tables (multiplication uses eight tables). I don’t understand the high CPI for mzd_copy() though. Half rank matrices behave quite differently, but surprisingly the CPI for mzd_apply_p_right_trans() is rather low.
== HALF RANK ==
symbol CPI % of overall time
------------------------------------------------------
_mzd_mul_m4rm 0.93 (40.01)
mzd_apply_p_right_trans 0.79 (10.05)
mzd_apply_p_right 0.78 ( 7.44)
mzd_make_table 0.75 ( 7.26)
mzd_find_pivot 1.44 ( 6.00)
.plt 0.68 ( 5.64)
_mzd_pluq_mmpf 1.58 ( 4.58)
_mzd_pluq_submatrix 0.90 ( 3.88)
mzd_col_swap 1.20 ( 3.58)
mzd_process_rows 1.76 ( 2.85)
mzd_row_add_offset 1.61 ( 2.24)
mzd_process_rows2_pluq 1.88 ( 2.10)
_mzd_mul_naive 1.19 ( 1.49)
mzd_copy 2.17 ( 1.32)
mzd_combine 3.31 ( 0.89)
_mzd_trsm_lower_left_even 0.75 ( 0.25)
_mzd_trsm_upper_left_even 0.67 ( 0.13)
mzd_init_window 1.24 ( 0.07)
My preliminary understanding of those high .plt values is that compiling the benchmarketing software with -static would improve the timings (or that I should inline more?).
Wed, 07. Jan 2009PhD Thesis on F5
May I point my dear reader’s attention to Justin Gash’s PhD thesis on F5. It is mainly referred to because it points out an error in the proof of termination in the original F5 paper. I started reading the thesis today (in preparation of SD12 in San Diego) and it already helped me a lot to understand the algorithm. As a consequence I added a bunch of comments from Justin’s thesis to my toy F5 implementation which should explain a bit what is going on.
However, since Justin’s pseudocode is not identical with John Perry’s pseudocode (on which my implementation is based) things don’t match up perfectly. For instance, I still need to figure out what the differences in critical_pair exactly come down to.
Mon, 05. Jan 2009M4RI-20090105 Released
Sources are available for download. See release notes for details about what changed since the last version and some timings (1, 2, 3, 4) to get an idea of the performance.
Sat, 03. Jan 2009New M4RI Release Coming Up
I will probably cut a new M4RI release within a week or so. The main changes are:
- Asymptotically Fast PLUQ Factorisation [mzd_pluq() and _mzd_pluq_mmpf()] due to Clément Pernet and myself: This enables asymptotically fast rank computation, row echelon forms and rank profiles. However, don’t hold your breath yet because the code is not really optimised yet. While it is faster than M4RI for random dense matrices it is slower for sparse-ish and structured matrices (see image below).
- System Solving [mzd_solve_left()]: Jean-Guillaume Dumas wrote a high-level wrapper around the PLUQ and TRSM routines to provide linear system solving.
- M4RI Performance Improvement [mzd_echelonize_m4ri()]: A bug was fixed in M4RI which resulted in poor performance of M4RI for sparse-ish matrices (see my blog post for details).
- Refactoring: A few functions where added, deleted and renamed. This release will break source compatibility.
On a related note: Marco Bodrato came up with a new Strassen-like sequence for multiplying and squaring matrices which provides a small (linear) speed-up for squaring. He also provides a drop-in strassen.c replacement for M4RI-20080521 which implements his new sequence.
Tue, 23. Dec 2008M4RI’s and MMPF’s Sensitivity to Density
The last couple of days I’ve been working on improving libM4RI for sparse-ish matrices. The matrices I am talking about here are still represented as dense matrices but have non-uniformly distributed entries. While PLUQ factorisation is still very very (very very) slow for e.g. half rank matrices, things are getting better for M4RI matrix elimination.
The updated sources are available on bitbucket. I will probably cut a new release very early next year.
Sat, 13. Dec 2008Bit Bucket
About three weeks ago I discovered Bitbucket: a web service (with free and paid-for plans) for hosting Mercurial repositories. So far it has everything I would want:
- a free plan with 150MB of disk space, one private repository and as many public repositories as you like;
- the offer to host your open-source project even if it is bigger than 150MB;
- optional issue trackers and wikis (which are also under revision control) for each repository;
- convenient online source code browsing, viewing/comparison of changesets, downloads;
- push/pull via SSH (with public keys) and HTTPS;
- straight-forward management of read/write access control for each repository;
- and all kinds of third party service integrations (twittering your changesets and such).
I’m now hosting the main M4RI repository on bitbucket and so far it is a very smooth experience. Speaking of M4RI, it now contains an implementation of PLUQ factorisation inspired by Greg’s M4RI algorithm dubbed “Method of Many People Factorisation” in brilliantrussian.c. While this implementation seems to do its job, we still have iron out a couple of bugs in the recursive PLUQ factorisation codebase.
Finally, some random code snippets I posted on this blog are now also available on bitbucket (e.g., F4, F5, Matrix F5, DES equation system generators, Present equation system generators, an ANF to CNF converter).
Reactions to CPNI-957037: Vulnerability in SSH
- OpenSSH published a “Statement on ‘Plaintext Recovery Attack Against SSH’ (CPNI-957037)” and committed a first fix. Both the statement and the bugfix only address the OpenSSH specific attack from the advisory. “A variant of the attack against OpenSSH in the standard configuration can verifiably recover 14 bits of plaintext with probability $2^{-14}$.” (CPNI-957037)
- SunSSH’s Jan Pechanec writes: “For the first time we increased SunSSH version in OpenSolaris just because of a security vulnerability, to 1.3”. However, it seems they only fixed the $2^{-14}$ attack. Sun also issued a security advisory.
- SSH.com issued a security advisory and acknowledged that their products are vulnerable. However, no probabilities are given. The company claims to have fixed the issue in their latest line of products. I don’t know what their fix is.
- WinSSHD acknowledged that their product is vulnerable and issued an update which successfully (as far as I know) prevents the attack: “Our mitigation in WinSSHD 5.03 attempts to thwart this attack by denying the attacker any means of distinguishing a successful attempt from an unsuccessful one. This only protects data flowing in the direction to WinSSHD (e.g. the client’s password). Clients which do not implement similar mitigation can still allow this attack to succeed, when CBC is used, for data flowing from WinSSHD.”
- Dropbear’s 0.52 release added “counter mode cipher support, which avoids some security problems with the standard CBC mode.” on November 12th.
Update:The US-Cert vulnerability note is VU#958563.
Thu, 27. Nov 2008Talk on Matrix F5
“We will describe the ‘Matrix F5’ algorithm. Although not presented as an algorithm there, this algorithm appears on the first pages of Jean-Charles Faugere’s famous F5 paper to explain the main idea of F5 itself. It was later described in several French PhD theses but never formally published in English. The talk is titled ‘for the working cryptographer’ because we are going to start from the relatively well known XL algorithm and extend it to Matrix F5.”
Fri, 14. Nov 2008CPNI-957037: Vulnerability in SSH
Abstract:Vulnerability found in the network protocol SSH (Secure Shell), that allows data to be exchanged using a secure channel between two networked devices. A design flaw in the SSH specification allows an attacker with control over the network to recover up to 32 bits of plaintext from an SSH-protected connection in the standard configuration. The success probability in recovering 32 plaintext bits is $2^{-18}$ when attacking the OpenSSH implementation of the SSH RFCs. A variant of the attack against the OpenSSH implementation verifiably recovers 14 plaintext bits with probability $2^{-14}$. The recovered bits come from an arbitrary, attacker-selected block of ciphertext. The success probabilities for other implementations are unknown (but are potentially much higher).
See also: SSH.com
Update: Added “verifiably” for the 14-bit attack to point out that this attack is strictly better than guessing 14 bits. CPNI will probably update the advisory too.
Wed, 12. Nov 2008“Efficient Multiplication of Dense Matrices over GF(2)”
“We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense matrices over the field with two elements (GF(2)). In particular we present our implementation - in the M4RI library - of Strassen-Winograd matrix multiplication and the “Method of the Four Russians” multiplication (M4RM) and compare it against other available implementations. Good performance is demonstrated on on AMD’s Opteron and particulary good performance on Intel’s Core 2 Duo. The open-source M4RI library is available stand-alone as well as part of the Sage mathematics software.
In machine terms, addition in GF(2) is logical-XOR, and multiplication is logical-AND, thus a machine word of 64-bits allows one to operate on 64 elements of GF(2) in parallel: at most one CPU cycle for 64 parallel additions or multiplications. As such, element-wise operations over GF(2) are relatively cheap. In fact, in this paper, we conclude that the actual bottlenecks are memory reads and writes and issues of data locality. We present our empirical findings in relation to minimizing these and give an analysis thereof.”
Related News: My shiny new version of Magma 2.14-17 seems to perform better than Magma 2.14-14 for matrix multiplication over $\mathbf{F}_2$ on the Core 2 Duo. So I updated the performance data on the M4RI website. However, the changelog doesn’t mention any improvements in this area. Btw. searching for “Magma 2.14” returns the M4RI website first for me, which feels wrong on so many levels. Finally, M4RI is being packaged for Fedora Core.
Thu, 06. Nov 2008Yet Another Talk on Sage
Today, I gave a talk on Sage to the ISG PhD seminar at Royal Holloway. I think it went alright, although people around here just don’t get excited about computation that much. Anyway, I’ve uploaded the slides and the demo (pdf, worksheet).
Thu, 30. Oct 2008Algebraic Attacks against Block Ciphers
That’s the title of the 15 minute talk I gave yesterday at Cambridge to Part 3 students. It is basically a refurbished version of a talk I gave in March 2007 in Seattle. Also, I was asked to forward this:
“Beyond Part III will take place over 16-18 April 2009 at the Centre for Mathematical Sciences, University of Cambridge.
Our aim is to develop a stronger network between mathematics graduate communities in the UK and internationally. We would like to emphasise that Beyond Part III is open to all mathematical researchers at the start of their career, whether or not they did Part III. The majority of participants are likely to be those who began a PhD in mathematics within the last five years, but this is a guideline only.
A dozen sessions covering different research areas will be run for one and a half days. Most participants will present work and each session will also include a keynote talk from a more senior researcher in that field. Beyond Part III will enable young mathematicians working in similar areas to form new links, and so plenty of social events are also planned. …”
Sun, 26. Oct 2008Matrix F5
I finally fixed my Matrix $F_5$ implementation. I’ll also give a talk about Matrix $F_5$ to the PhD seminar at ISG on November 27th. I’ll the post the slides around then for those interested.
Mon, 20. Oct 2008Bits and Pieces
I spent last week at Sage Days 10 (pictures and more pictures) in Nancy, France which turned out to be a very nice event. I (together with Simon King, Michael Brickenstein and Ludovic Perret) spent most of my time working on various toy implementations of F5 in order to understand the algorithm (better). We also conversed with John Perry who’s pseudcode, Singular code and description of F5 was incredibly helpful (and motivated me to work on this project in the first place, btw). My toy implementation of the polynomial version of F5 is available online and so is Simon King’s Cython implementation for Sage. John Perry now also provides a non-homogeneous version of F5 based on my Sage implementation.
I also implemented a toy version of F5/Matrix which indeed avoids a fair number of zero reductions and returns a Groebner basis if $d_{max}$ is big enough. I don’t think it does avoid as many reductions as the polynomial version which indicates a problem in my code. Note, that F5/Matrix is not described in detail in English literature (if you speak French and want to translate some short notes on this algorithms to English, please let me know).
I didn’t work on M4RI during Sage Days 10 but Clement Pernet and Jean-Guillaume Dumas did. We will probably release a new version with tried and tested TRSM code soon. Btw. I also gave a contributed talk about M4RI at Sage Days 10. My project for a train ride right after Sage Days 10 was to improve univariate polynomials over $\mathbb{F}_2$ in Sage to improve modular composition of polynomials.
I’m off to Santander on Wednesday for the 2nd Workshop on Mathematical Cryptology and once I’m back I’ll give a talk at the Graduate Studies Elsewhere Open Afternoon in Cambridge on “Algebraic Attacks against Block Ciphers”.
One last thing: The DES generator is broken due to a bug in Sage, a fix is available on Trac.
Update:fixed a bug in the F5/Matrix code and removed a nonsense statement about the rank of the matrices.
Thu, 18. Sep 2008Equation System Generator for DES
I’ve uploaded an equation system generator for the “Data Encryption Standard” (DES) hoping to spare others the tedious work of writing one. As far as I know only equation systems but no flexible generators are floating around the net. I implemented three S-Box representations (following Courtois’ and Bard’s paper Algebraic Cryptanalysis of DES): fully cubic equations, quadratic equations with added variables (Matthew Kwan’s bitslice representation) and the former with pre-computed Gröbner bases.
# loading the file
sage: attach des.py
# two rounds only, see DES? for help on the constructor
sage: des = DES(Nr=2)
# default S-Box representation is opns_gb
sage: F,s = des.polynomial_system()
Pre-computing Groebner bases for S-Boxes, this might take a moment.
sage: %time gb = F.groebner_basis()
CPU times: user 15.39 s, sys: 0.03 s, total: 15.42 s
Wall time: 15.78 s
# verify correctness
sage: gb[:14]
[k55, k54, k52, k51 + 1, k50 + 1, k49, k48 + 1, k47 + 1, k46 + 1, k45 + 1, k44 + 1, k43, k42, k41 + 1]
sage: s[des.k[55]]
0
sage: s[des.k[51]]
1
Let me know if you run into any trouble.
Upcoming Workshops
I’m going to attend Sage Days 10 in Nancy, France (October 10-15, 2008). At SD10 I’ll give a contributed talk about matrix multiplication over $\mathbb{F}_2$, i.e. the M4RI library. I’m also going to attend the Second Workshop on Mathematical Cryptology in Santander, Spain (October 23-25, 2008).
Tue, 16. Sep 2008New Papers on eCrypt
Ann Hibner Koblitz (Women and Gender Studies Program, Arizona State University), Neal Koblitz (Dept. of Mathematics, Univ. of Washington), and Afred Menezes (Dept. of Combinatorics & Optimization, Univ. of Waterloo) published an interesting history of elliptic curve cryptography (ECC) titled: Elliptic Curve Cryptography: The Serpentine Course of a Paradigm Shift. They briefly introduce the math but the main part is their presentation and discussion how ECC became widely accepted (which apparently is not just because researchers agreed on a scientific basis). I only skimmed it, but it seems certainly worth a read. Speaking of Neal Koblitz and Alfred Menezis: Here are their papers on “provable security” which caused a fair amount of controversy: Another Look at “Provable Security” and Another Look at “Provable Security” II
Itai Dinur and Adi Shamir put a pre-print of their much hyped and discussed paper Cube Attacks on Tweakable Black Box Polynomials on ePrint. See also David Wagner’s summary.
Fri, 22. Aug 2008Gröbner Bases over $ZZ$
Vanilla Sage does not compute Gröbner bases over $ZZ$ by any definition. However, this feature has been requested several times. The earliest account I could find quickly is this post by Joe Wetherell. Below is a list of options for Gröbner bases over $ZZ$ in Sage.
- Singular’s upcoming release will feature Gröbner bases over rings. In fact, the feature is present in the current Singular release but not enabled by default. An SPKG with that functionality enabled can be found here. ring r = integers,(x,y),lp; declares a ring over the integers where some things work and some things don’t. Note that this ring declaration is not final, i.e. the name integers may change. Also, this SPKG has issues and crashes on me for some operations. We’re working on tracking that issue down.
- Macaulay2 has support for Gröbner bases over rings and a decent Sage interface supporting that functionality. Macaulay2 1.1 is available as an experimental SPKG. First, one needs to install boehm_gc-7.1.p0 and gdbm-1.8.3. Then, since the version in experimental didn’t compile for me, try my new SPKG.
- If you are lucky enough to have Magma installed, I.groebner_basis(“magma:GroebnerBasis”) does the job. If you don’t have Magma installed try magma_free.
- Ginv (also available as a optional package) also supports Gröbner bases over $ZZ$. However, for the example Joe gave in his e-mail it crashes on me and I’ve contacted upstream about it.
- Last and least: I have a toy implementation of the $d$-Gröbner basis algorithm from the Becker-Weispfenning. Don’t hold your breath, it is dead slow.
Hopefully, due to the upcoming Singular release the situation will improve soon and we’ll finally have Gröbner bases over $ZZ$ in Sage.
Mon, 18. Aug 2008Parallel Matrix Elimination
I released a new version of M4RI today which contains a parallel implementation for matrix elimination. Below I reproduce some timings for this code to give a rough idea of the performance of this code.
| Matrix Dimension |
Magma 2.14-13 (64-bit, 1 core) |
M4RI (64-bit, 1 core) |
M4RI (64-bit, 4 cores) |
|---|---|---|---|
| 10,000 x 10,000 | 3.283 | 2.509 | 1.064 |
| 16,384 x 16,384 | 11.204 | 10.741 | 3.918 |
| 20,000 x 20,000 | 16.911 | 19.776 | 7.216 |
| 32,000 x 32,000 | 57.761 | 86.071 | 32.420 |
| 64,000 x 64,000 | 355.477 | 640.742 | 307.213 |
The examples hfe25_5, hfe30_5 and hfe35_5 from the M4RI website take 1.44, 9.29 and 51.56 seconds respectively.
Note that this is work in progress and that the algorithm still has worse complexity than the one implemented in Magma. Also note that the speed-up is far from linear and that the speed-up decreases with the size. This is probably because each thread falls out of L2 more often and the threads clog each other.
Mon, 11. Aug 2008GCC 4.3 and -O3
I recently upgraded an Opteron server to Debian/Lenny to get GCC 4.3 for OpenMP reasons. It turns out that my code, namely matrix multiplication as implemented in the M4RI library, ran much slower than when compiled with GCC 4.1. For instance, to multiply two $20,000 \times 20,000$ random matrices took 18.38 seconds with GCC 4.1 but 21.00 seconds with GCC 4.3.1 and to multiply two $32,000 \times 32,000$ random matrices took 70.24 seconds with GCC 4.1 but 80.00 second with GCC 4.3.1. Eventually, I checked the highlevel changelog and found: “The -ftree-vectorize option is now on by default under -O3. In order to generate code for a SIMD extension, it has to be enabled as well: use -maltivec for PowerPC platforms and -msse/-msse2 for i?86 and x86_64.” However, we don’t use SSE2 on the Opteron since it is slower than the standard instruction set for this application. Passing -no-tree-vectorize to the compiler fixed the problem. However, to my surprise -O2 didn’t come with a speed penalty either, so I settled for this. The final timings on my Opteron server are:
| Matrix Dimension |
M4RI GCC 4.3 (64-bit, 4 cores) |
M4RI GCC 4.3 (64-bit, 1 core) |
M4RI GCC 4.1 (64-bit, 1 core) |
Magma 2.14-13 (64-bit, 1 core) |
|---|---|---|---|---|
| 20000x20000 | 6.36 | 17.81 | 18.38 | 18.35 |
| 32000x32000 | 26.65 | 68.01 | 70.24 | 68.01 |
I suppose the moral of the story is: -O3 isn’t necessarily better than -O2 just because 3>2.
Fri, 18. Jul 2008ISSAC
I’m going to this year’s International Symposium on Symbolic and Algebraic Computation (ISSAC) starting on Sunday. William is giving a plenary lecture on Sage titled “Can There be a Viable Free Open Source Alternative to Magma, Maple, Mathematica, and Matlab?” a.k.a. “Computer algebra community meet Sage; Sage meet the computer algebra community”.
Tue, 08. Jul 2008Scapy and Sage
“Scapy is a powerful interactive packet manipulation program. It is able to forge or decode packets of a wide number of protocols, send them on the wire, capture them, match requests and replies, and much more. It can easily handle most classical tasks like scanning, tracerouting, probing, unit tests, attacks or network discovery (it can replace hping, 85% of nmap, arpspoof, arp-sk, arping, tcpdump, tethereal, p0f, etc.). It also performs very well at a lot of other specific tasks that most other tools can’t handle, like sending invalid frames, injecting your own 802.11 frames, combining technics (VLAN hopping+ARP cache poisoning, VOIP decoding on WEP encrypted channel, …)”
At the end of the day Scapy is one (one!) Python file so it couldn’t be easier to use it from within Sage. As an example let’s assume we have sniffed an SSH connection establishment including a Diffie-Hellmann Group Exchange as described in RFC 4419. Scapy can do live packet capture and injection but that would require root privileges, so I’m working with a pcap file in this example:
from scapy import rdpcap, TCP, IP
SSH2_MSG_KEX_DH_GEX_GROUP = 31
# read packets
packets = [p[IP] for p in rdpcap("/home/malb/example.pcap") \
if p[TCP] and len(p[TCP]) > 32]
# find correct package & payload
for packet in packets:
try:
pl = [ord(e) for e in packet[TCP].payload.load]
if pl[5] == SSH2_MSG_KEX_DH_GEX_GROUP:
break
except AttributeError:
pass
def get_uint(pl, length):
# this is not as generic as it should be since it doesn't work
# with negative numbers
value = ZZ(0)
for i in range(length):
value += pl[i] * 2**(8*(length - i - 1))
return value, pl[length:]
packet_length, pl = get_uint(pl, 4)
padlen, pl = get_uint(pl, 1)
packet_type, pl = get_uint(pl, 1)
assert(packet_type == SSH2_MSG_KEX_DH_GEX_GROUP)
# p
p_length, pl = get_uint(pl, 4)
p, pl = get_uint(pl, p_length)
# g
g_length, pl = get_uint(pl, 4)
g, pl = get_uint(pl, g_length)
assert(len(pl) == padlen)
assert(p.is_prime())
Zp = GF(p)
g = Zp(g)
e = g**ZZ.random_element(0,p)
e.log(g) # yeah, right ;-)
Happy hacking.
Fri, 20. Jun 2008XOR for Fun and Profit
I just gave a talk on linear algebra over GF(2), optimisation techniques and applications to algebraic cryptanalysis. Slides are available online.
libM4RI in Debian Unstable
malb@XXX:~$ apt-cache search m4ri libm4ri-dev - Method of the Four Russians library, development files libm4ri0 - Method of the Four Russians library, shared library
Big thanks to Tim for making that happen!
QOTD
“Mathematics is the art of reducing problems to linear algebra.” Bill Hart was introduced to this definition when he started his undergrad.

