This software is a public-domain implementation following my paper on implementing AES using vector permute instructions. Doing so has several advantages:
This code is an implementation in x86-64 GNU C and assembly language. It can be ported to other architectures, but I haven't gotten to this yet and might not for a while. Here are performance numbers on 3 machines I tested:
CPU family | mode | 128-bit | 192-bit | 256-bit |
---|---|---|---|---|
Nehalem | ctr | 9.45 | 11.23 | 13.65 |
cbc | 9.33 | 11.55 | 12.89 | |
cbc-1 | 11.07 | 13.91 | 15.25 | |
Penryn | ctr | 11.64 | 13.74 | 15.86 |
cbc | 11.35 | 13.49 | 15.53 | |
cbc-1 | 13.80 | 16.41 | 19.18 | |
Conroe | ctr | 19.00 | 25.78 | 29.97 |
cbc | 21.33 | 25.69 | 30.03 | |
cbc-1 | 25.89 | 31.05 | 36.30 |
As you can see, this code is quite fast on recent machines: much faster than OpenSSL, and comparable to Crypto++. The exception here is Conroe, which has a very slow shuffler. vpaes doesn't (yet) implement CTR-mode caching, so CTR mode is not any faster than CBC mode. Furthermore, encryption is faster than decryption due to the more complex MixColumns matrix for decryption.
There are still several "to do" items in vpaes:
I'd also like to try these tricks with Camellia and Fugue, which use the AES core.
Download: vpaes (312 KB).
Note that this is a preliminary release, is only minimally tested, comes with no warranty, etc. Please send questions or comments to Mike Hamburg.
Changelog: