FEDORA-2019-7730ed2edf

bugfix update in Fedora 30 for 389-ds-base

Status: obsolete

Bump version to 1.4.1.7

Comments 12

This update has been submitted for testing by mreynolds.

This update's test gating status has been changed to 'waiting'.

This update's test gating status has been changed to 'ignored'.

A FreeIPA test failed. It looks like the web UI errors out immediately after logging in as a regular domain user (though it works OK in a previous test step when logged in as the administrator). I'll try and look into this in more detail in a bit.

karma: -1

This update has been obsoleted.

Note: same failure on prod and staging, so it doesn't look like a flake.

So, I fiddled with openQA a bit and got logs from the server. Note that I did get one successful run of the test, so it seems the bug doesn't happen every time. I now have 4 fails to 1 pass, though.

Here's the /var/log tarball from the server. It seems that the web UI error is due to the directory server not being available, which is kinda what I expected. The journal shows these errors for dirsrv@DOMAIN-LOCAL.service:

Sep 17 18:06:18 ipa001.domain.local ns-slapd[7901]: [17/Sep/2019:21:06:18.258867527 -0400] - ERR - dna-plugin - dna_update_shared_config - Unable to update shared config entry: dnaHostname=ipa001.domain.local+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=domain,dc=local [error 21]
Sep 17 18:06:22 ipa001.domain.local ns-slapd[7901]: [17/Sep/2019:21:06:22.943741546 -0400] - ERR - dna-plugin - dna_update_shared_config - Unable to update shared config entry: dnaHostname=ipa001.domain.local+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=domain,dc=local [error 21]
Sep 17 18:11:04 ipa001.domain.local ns-slapd[7901]: [17/Sep/2019:21:11:04.640917031 -0400] - ERR - dna-plugin - dna_update_shared_config - Unable to update shared config entry: dnaHostname=ipa001.domain.local+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=domain,dc=local [error 21]
Sep 17 18:11:35 ipa001.domain.local ns-slapd[7901]: [17/Sep/2019:21:11:35.299014906 -0400] - ERR - dna-plugin - dna_update_shared_config - Unable to update shared config entry: dnaHostname=ipa001.domain.local+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=domain,dc=local [error 21]
Sep 17 18:14:21 ipa001.domain.local systemd[1]: dirsrv@DOMAIN-LOCAL.service: Main process exited, code=killed, status=11/SEGV
Sep 17 18:14:21 ipa001.domain.local systemd[1]: dirsrv@DOMAIN-LOCAL.service: Failed with result 'signal'.

Note the failed web UI access attempt happens just five seconds after the service decides to shut down. So there's a timing element here; presumably that's why the earlier web UI access as admin works.

Oh, yikes, I just noticed it's actually crashing, isn't it?

Sep 17 18:14:21 ipa001.domain.local audit[7901]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=7901 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1
Sep 17 18:14:21 ipa001.domain.local kernel: show_signal_msg: 56 callbacks suppressed
Sep 17 18:14:21 ipa001.domain.local kernel: ns-slapd[7932]: segfault at 303735d1 ip 00007f76bd8e5db4 sp 00007f76a5df5cd8 error 4 in libslapd.so.0.1.0[7f76bd8d1000+b2000]
Sep 17 18:14:21 ipa001.domain.local kernel: Code: 8d 35 6d e0 09 00 e8 db 50 ff ff b8 ff ff ff ff e9 cb fd ff ff e8 8c 56 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa <8b> 87 a0 00 00 00 21 f0 c3 0f 1f 00 f3 0f 1e fa 09 b7 a0 00 00 00
Sep 17 18:14:21 ipa001.domain.local named-pkcs11[8339]: LDAP error: Can't contact LDAP server: ldap_sync_poll() failed
Sep 17 18:14:21 ipa001.domain.local named-pkcs11[8339]: ldap_syncrepl will reconnect in 60 seconds
Sep 17 18:14:21 ipa001.domain.local systemd[1]: dirsrv@DOMAIN-LOCAL.service: Main process exited, code=killed, status=11/SEGV
Sep 17 18:14:21 ipa001.domain.local systemd[1]: dirsrv@DOMAIN-LOCAL.service: Failed with result 'signal'.

It doesn't seem like we caught a core dump, though, unfortunately - I do have the openQA tests set up to try and capture any core dump caught by coredumpctl or abrt, but it doesn't seem to have found one. I'll have to poke it some more tomorrow and see if I can get a hold of the dump.

It's definitely crashing, and I think I know which issue it might be but we do need a core file to verify. Ideally if you could get us the stack trace from the system after the crash that would be easier: Attach gdb to core file and run "thread apply all bt full". Or, attach gdb to ns-slapd before the test is run and catch the crash live (then run "thread apply all bt full")

Current status: @mreynolds did a scratch build with a potential fix, but it still crashed. I am now knee-deep in hacking up the test to try and attach gdb to ns-slapd and get a backtrace out of it.

OK, so it got weirdly harder to reproduce the crash - I had a bunch of passed tests with both the original update and the scratch build - but I finally got a crash with a full backtrace:

That's with the later scratch build, 37740210.

Add Comment & Feedback

Please login to add feedback.

Content Type
RPM
Status
obsolete
Test Gating
Submitted by
Update Type
bugfix
Update Severity
unspecified
Karma
-1
stable threshold: 1
unstable threshold: -1
Autopush (karma)
Enabled
Autopush (time)
Enabled
Dates
submitted a month ago

Automated Test Results