bug#39266: Finalization thread hits wrong-type-arg on weak vector (AArch

From: Ludovic Courtès
Subject: bug#39266: Finalization thread hits wrong-type-arg on weak vector (AArch64)
Date: Mon, 09 Mar 2020 15:38:45 +0100
Ludovic Courtès <address@hidden> skribis:

> While building the “guix-system.drv” derivation on AArch64, I got this
> crash (not fully deterministic but quite frequent).  Here the
> finalization thread gets a wrong-type-arg in ‘scm_i_weak_car’ (i.e.,
> accessing a one-element weak vector):

With 3.0.1, I can reproduce the bug on x86_64.  With rr (thanks, Andy!),
I found this (starting from the point where the type cell of the weak
vector is zeroed, and reverse-continuing until its gets its original
value of 0x10f):

--8<---------------cut here---------------start------------->8---
(rr) frame 40
#40 0x00007ffff7f2e66d in scm_i_weak_car (pair=0x7fffe15af690) at 
190       return SCM_CAR (x);
(rr) down
#39 0x00007ffff7f2f576 in scm_c_weak_vector_ref (wv=<optimized out>, 
k=k@entry=0) at weak-vector.c:193
193       SCM_VALIDATE_WEAK_VECTOR (1, wv);
#38 0x00007ffff7ea7ba0 in scm_wrong_type_arg_msg (
    subr=subr@entry=0x7ffff7f56f00 <s_scm_weak_vector_ref> "weak-vector-ref", 
    bad_value=0x7fffec472b90, szMessage=szMessage@entry=0x7ffff7f56e80 "weak 
vector") at error.c:282
282           scm_error (scm_arg_type_key,
(rr) p *((void**)0x7fffec472b90)
$1 = (void *) 0x0
(rr) watch *((void**)0x7fffec472b90)
Hardware watchpoint 1: *((void**)0x7fffec472b90)
(rr) reverse-cont

Thread 1 received signal SIGCONT, Continued.
[Switching to Thread 27074.27074]
__lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:101
101     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: Dosiero aŭ dosierujo 
ne ekzistas.

Thread 1 hit Hardware watchpoint 1: *((void**)0x7fffec472b90)

Old value = (void *) 0x0
New value = (void *) 0x10f
__memset_avx2_unaligned_erms () at 
259     ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Dosiero aŭ 
dosierujo ne ekzistas.
(rr) bt
#0  __memset_avx2_unaligned_erms () at 
#1  0x00007ffff7f1d499 in set_vtable_access_fields 
(vtable=vtable@entry=0x7fffeb48ee80) at struct.c:143
#2  0x00007ffff7f1dd8d in scm_i_struct_inherit_vtable_magic 
    obj=obj@entry=0x7fffeb48ee80) at struct.c:215
#3  0x00007ffff7f1dfea in scm_c_make_structv (vtable=0x7ffff4e32fa0, 
n_tail=<optimized out>, n_init=8, 
    init=0x7fffffff50d0) at struct.c:364
#4  0x00007ffff7f1e0b9 in scm_make_struct_no_tail (vtable=0x7ffff4e32fa0, 
init=0x304) at struct.c:491
--8<---------------cut here---------------end--------------->8---

Bingo!  There’s a mismatch in struct.c:

--8<---------------cut here---------------start------------->8---
  bitmask_size = (nfields + 31U) / 32U;
  unboxed_fields = scm_gc_malloc_pointerless (bitmask_size, "unboxed fields");
  memset (unboxed_fields, 0, bitmask_size * sizeof(*unboxed_fields));
--8<---------------cut here---------------end--------------->8---

Pushed a fix as 7c17655cd3d859bf0c5a86d9782a7788205fc05a.

Thanks, rr!  You made my day!  :-)

Now testing Guix builds on x86_64, i686, ARMv7, and AArch64 to see if
that addresses seemingly related issues.


