Comments

74 Comments

The kernel.s390x scratch build failure appears to be an infrastructure issue:

$ git clone -n https://src.fedoraproject.org/rpms/kernel.git /var/lib/mock/f40-build-side-86585-49916704-5958020/root/chroot_tmpdir/scmroot/kernel
Cloning into '/var/lib/mock/f40-build-side-86585-49916704-5958020/root/chroot_tmpdir/scmroot/kernel'...
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

https://kojipkgs.fedoraproject.org//work/tasks/4215/115504215/checkout.log

The fedora-ci.koji-build.tier0.functionalfailure is not diagnosable because the URL is broken.

Hmph, I see it. I misinterpreted the nature of the ns-slapd bug, and the upstream workaround I pushed does not actually work around it. Hmmph. I guess I'll need another fix.

Before the workaround, glibc had this loop:

 220       while (cmp (run_ptr, tmp_ptr, arg) < 0)
 221         tmp_ptr -= size;

https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/qsort.c;h=ad110e8a892a66e1fc90f850b828e1a2d09e2ac5;hb=HEAD#l220

The loop is known to terminate if the comparison function is correct because eventually, run_ptr == tmp_ptr, and cmp must return zero. If that never happens, we eventually run into non-allocated memory regions. The only access to that memory is from the cmp function here, not from the qsort implementation, so that crash will happen in the comparison callback.

Thanks. The comparison function can never return zero: https://github.com/389ds/389-ds-base/blob/main/ldap/servers/plugins/cos/cos_cache.c#L2933

This is clearly a 389-ds-base bug. The old qsort implementation in glibc did not tickle it because it rarely called the comparison function with equal pointer arguments. We already worked around similar application problems in other places in the new implementation, we can probably do it in the insertion sort phase as well.

With the new approach:

# ipa-getkeytab -p HTTP/x0.cockpit.lan -k /etc/cockpit/krb5.keytab 
Keytab successfully retrieved and stored in: /etc/cockpit/krb5.keytab

I'll do another build, so that the AnyConnect users can test it as well.

But I think I see what's wrong with the current ELF destructor ordering approach. I'll experiment with something else.

This is a Fedora 38 cloud image with some extra packages installed, so you can install debug symbols, run gdb, etc.

Thank you. I got to this point and could reproduce the assert, but the VM with ipa-getkeytab does not have a default route. Any idea how to fix that? DHCP assigns 172.27.0.2 for the eth0 interface, but no default route.

This backtrace is more interesting:

Stack trace of thread 1959:
#0  0x00007f81c8ab0884 __pthread_kill_implementation (libc.so.6 + 0x8e884)
#1  0x00007f81c8a5fafe raise (libc.so.6 + 0x3dafe)
#2  0x00007f81c8a4887f abort (libc.so.6 + 0x2687f)
#3  0x00007f81c8a4879b __assert_fail_base.cold (libc.so.6 + 0x2679b)
#4  0x00007f81c8a58187 __assert_fail (libc.so.6 + 0x36187)
#5  0x00007f81c9030323 krb5int_key_delete (libkrb5support.so.0 + 0x6323)
#6  0x00007f81c86f0e8b gssint_mechglue_fini (libgssapi_krb5.so.2 + 0xee8b)
#7  0x00007f81c91f50f2 _dl_call_fini (ld-linux-x86-64.so.2 + 0x10f2)
#8  0x00007f81c91f8e5e _dl_fini (ld-linux-x86-64.so.2 + 0x4e5e)
#9  0x00007f81c8a621e6 __run_exit_handlers (libc.so.6 + 0x401e6)
#10 0x00007f81c8a6232e exit (libc.so.6 + 0x4032e)
#11 0x00005622a14583bc main (ipa-getkeytab + 0x63bc)
#12 0x00007f81c8a49b8a __libc_start_call_main (libc.so.6 + 0x27b8a)
#13 0x00007f81c8a49c4b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x27c4b)
#14 0x00005622a1459bb5 _start (ipa-getkeytab + 0x7bb5)

It has _dl_fini in it, so it's very likely it's caused by the changes in this update.

@adamwill @martinpitt How can I create a VM (or set of VMs) that reproduces this issue? Thanks.

@adamwill How can we reproduce this in an environment where we can run the failing process under a debugger, or with certain environment variables configured? Thanks.

The bz699724 test is recently added and apparently still under development, so I'm not particularly worried about it. It still needs porting to Python 3.

@adamwill Which failure specific worries you? I have trouble finding it in the results.