FEDORA-2019-7730ed2edf created by mreynolds 3 months ago for Fedora 30
obsolete

Bump version to 1.4.1.7

This update has been submitted for testing by mreynolds.

3 months ago

This update's test gating status has been changed to 'waiting'.

3 months ago

This update's test gating status has been changed to 'ignored'.

3 months ago
User Icon adamwill commented & provided feedback 3 months ago
karma

A FreeIPA test failed. It looks like the web UI errors out immediately after logging in as a regular domain user (though it works OK in a previous test step when logged in as the administrator). I'll try and look into this in more detail in a bit.

This update has been obsoleted.

3 months ago

Note: same failure on prod and staging, so it doesn't look like a flake.

So, I fiddled with openQA a bit and got logs from the server. Note that I did get one successful run of the test, so it seems the bug doesn't happen every time. I now have 4 fails to 1 pass, though.

Here's the /var/log tarball from the server. It seems that the web UI error is due to the directory server not being available, which is kinda what I expected. The journal shows these errors for dirsrv@DOMAIN-LOCAL.service:

Sep 17 18:06:18 ipa001.domain.local ns-slapd[7901]: [17/Sep/2019:21:06:18.258867527 -0400] - ERR - dna-plugin - dna_update_shared_config - Unable to update shared config entry: dnaHostname=ipa001.domain.local+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=domain,dc=local [error 21]
Sep 17 18:06:22 ipa001.domain.local ns-slapd[7901]: [17/Sep/2019:21:06:22.943741546 -0400] - ERR - dna-plugin - dna_update_shared_config - Unable to update shared config entry: dnaHostname=ipa001.domain.local+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=domain,dc=local [error 21]
Sep 17 18:11:04 ipa001.domain.local ns-slapd[7901]: [17/Sep/2019:21:11:04.640917031 -0400] - ERR - dna-plugin - dna_update_shared_config - Unable to update shared config entry: dnaHostname=ipa001.domain.local+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=domain,dc=local [error 21]
Sep 17 18:11:35 ipa001.domain.local ns-slapd[7901]: [17/Sep/2019:21:11:35.299014906 -0400] - ERR - dna-plugin - dna_update_shared_config - Unable to update shared config entry: dnaHostname=ipa001.domain.local+dnaPortNum=389,cn=posix-ids,cn=dna,cn=ipa,cn=etc,dc=domain,dc=local [error 21]
Sep 17 18:14:21 ipa001.domain.local systemd[1]: dirsrv@DOMAIN-LOCAL.service: Main process exited, code=killed, status=11/SEGV
Sep 17 18:14:21 ipa001.domain.local systemd[1]: dirsrv@DOMAIN-LOCAL.service: Failed with result 'signal'.

Note the failed web UI access attempt happens just five seconds after the service decides to shut down. So there's a timing element here; presumably that's why the earlier web UI access as admin works.

Oh, yikes, I just noticed it's actually crashing, isn't it?

Sep 17 18:14:21 ipa001.domain.local audit[7901]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=7901 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1
Sep 17 18:14:21 ipa001.domain.local kernel: show_signal_msg: 56 callbacks suppressed
Sep 17 18:14:21 ipa001.domain.local kernel: ns-slapd[7932]: segfault at 303735d1 ip 00007f76bd8e5db4 sp 00007f76a5df5cd8 error 4 in libslapd.so.0.1.0[7f76bd8d1000+b2000]
Sep 17 18:14:21 ipa001.domain.local kernel: Code: 8d 35 6d e0 09 00 e8 db 50 ff ff b8 ff ff ff ff e9 cb fd ff ff e8 8c 56 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa <8b> 87 a0 00 00 00 21 f0 c3 0f 1f 00 f3 0f 1e fa 09 b7 a0 00 00 00
Sep 17 18:14:21 ipa001.domain.local named-pkcs11[8339]: LDAP error: Can't contact LDAP server: ldap_sync_poll() failed
Sep 17 18:14:21 ipa001.domain.local named-pkcs11[8339]: ldap_syncrepl will reconnect in 60 seconds
Sep 17 18:14:21 ipa001.domain.local systemd[1]: dirsrv@DOMAIN-LOCAL.service: Main process exited, code=killed, status=11/SEGV
Sep 17 18:14:21 ipa001.domain.local systemd[1]: dirsrv@DOMAIN-LOCAL.service: Failed with result 'signal'.

It doesn't seem like we caught a core dump, though, unfortunately - I do have the openQA tests set up to try and capture any core dump caught by coredumpctl or abrt, but it doesn't seem to have found one. I'll have to poke it some more tomorrow and see if I can get a hold of the dump.

It's definitely crashing, and I think I know which issue it might be but we do need a core file to verify. Ideally if you could get us the stack trace from the system after the crash that would be easier: Attach gdb to core file and run "thread apply all bt full". Or, attach gdb to ns-slapd before the test is run and catch the crash live (then run "thread apply all bt full")

Current status: @mreynolds did a scratch build with a potential fix, but it still crashed. I am now knee-deep in hacking up the test to try and attach gdb to ns-slapd and get a backtrace out of it.

OK, so it got weirdly harder to reproduce the crash - I had a bunch of passed tests with both the original update and the scratch build - but I finally got a crash with a full backtrace:

That's with the later scratch build, 37740210.


Please login to add feedback.

Metadata
Type
bugfix
Karma
-1
Signed
Content Type
RPM
Test Gating
Settings
Unstable by Karma
-1
Stable by Karma
1
Stable by Time
7 days
Dates
submitted
3 months ago

Automated Test Results