[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Tinycc-devel] vectorize the curent hash implementation
From: |
Vladimir Vissoultchev |
Subject: |
Re: [Tinycc-devel] vectorize the curent hash implementation |
Date: |
Sat, 23 Apr 2016 11:26:26 +0300 |
Hi,
I did some test with the original rolling hash slightly tweaked to this:
#define TOK_HASH_INIT 0
#define TOK_HASH_PRIME 263
#define TOK_HASH_FUNC(h, c) ((h)*TOK_HASH_PRIME + (c))
... and was able to replace this hash calculating loop:
h = TOK_HASH_INIT;
for(i=0;i<len;i++)
h = TOK_HASH_FUNC(h, ((unsigned char *)str)[i]);
h &= (TOK_HASH_SIZE - 1);
... with this equivalent SSE2 calculation with step of 8 bytes:
short *pp = hash_primepow + STRING_MAX_SIZE - (len);
__m128i vh = _mm_setzero_si128(), v;
for (i = 0; i < len; i += 8) {
v = _mm_unpacklo_epi8(_mm_loadl_epi64((const __m128i *)(str + i)),
_mm_setzero_si128());
vh = _mm_add_epi32(vh, _mm_madd_epi16(_mm_loadu_si128((const __m128i
*)(pp + i)), v));
}
vh = _mm_add_epi32(vh, _mm_srli_si128(vh, 8));
vh = _mm_add_epi32(vh, _mm_srli_si128(vh, 4));
h = _mm_cvtsi128_si32(vh) & (TOK_HASH_SIZE - 1);
... where `hash_primepow` is declared as
static short hash_primepow[STRING_MAX_SIZE + 9] = { 0 };
... and is initialized in `preprocess_new` like this
/* init rolling hash prime powers table in reverse */
for (h = 1, i = STRING_MAX_SIZE - 1; i >= 0; i--) {
hash_primepow[i] = h;
h = h * TOK_HASH_PRIME & (TOK_HASH_SIZE - 1);
}
(Note: SSE calc loop oversteps input string so file buffer has to
overallocated by 8 bytes in `tcc_open_bf`)
Unfortunately there was no noticable improvement, which leads me to beleave
that hash calculation is not the main culprit for the performance of
`next_nomacro1`
cheers,
</wqw>
-----Original Message-----
From: Tinycc-devel [mailto:address@hidden
On Behalf Of Sergey Korshunoff
Sent: Saturday, April 23, 2016 10:38 AM
To: address@hidden
Subject: [Tinycc-devel] vectorize the curent hash implementation
Hi!
> IMO, the way to go w/ performance improvement is to vectorize the curent
hash implementation. Instead of computing the hash one character at a time,
take next 4 characters, process these in parallel and compute 4 hashes
How to implement this?
There is some hash optimization for the tcc compiled by gcc.
https://github.com/seyko2/tinycc/commit/b89f0d63af4f494a83c91fb0360d7d37c0f6
f9a3
_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel