Comments

1508 Comments

Thanks a lot for all the work on this (and the continuing work in the github issue)! Sorry I went AWOL after that last comment, I didn't manage to get the package list on that day and then I was on vacation. Seems like you got along fine without me, though. :D

Oh, sorry, to answer your questions:

"Is it possible to keep the testing virtual machine running after the failed test so we can connect there and debug the problem in the testing environment? Or start it manually somehow?"

The answer is sort of "theoretically yes, practically it's difficult". We can leave the test 'open' (at least for a couple of hours - after that it would time out, I think there's a setting to override that though), but the tricky part is connecting to it. The official openQA instance runs inside Fedora's infrastructure, which is quite heavily firewalled, it would be rather difficult to get things configured such that you could route a VNC connection all the way in there, I think. RH IT is in charge of some of the firewalling, too, so getting any changes made involves filing a ticket and waiting for weeks.

The other option is to reproduce the bug on an instance of openQA sited somewhere more convenient for connecting to, but setting up a pet openQA deployment is kind of a drag, and it's especially a drag for these tests where we have a server/client setup going on (because it involves special networking configuration). I used to keep a pet deployment in my home office but I got rid of it last year. @lruzicka has one, I think, but I don't know if his is set up to do the client/server tests.

What I usually wind up doing instead is just rewriting the openQA tests on the fly to add whatever debugging steps we need. If you can suggest some stuff you'd like done from the broken state that might help identify what the problem is, I can usually manage to tweak the test environment to do it and get the info out, so just let me know.

"Is there any way how to get the list of installed packages to compare them with my environment?"

Yeah, that we can do quite easily, I'll grab it for you later today.

I suppose the other possibility is that we do need to run the client tests as well. The client tests do include one which exercises the FreeIPA web UI, and the shutdown issue is in httpd, so it's plausible we don't hit the bug unless the web UI gets exercised.

The openQA client test that exercises the web UI is the slightly misleadingly-named realmd_join_cockpit (it does enrol via Cockpit, but then it also goes on to do the web UI tests as well). Both the freeipa_webui and freeipa_password_change modules exercise the web UI to some extent. If you can add those steps to your reproducer attempt it may help. You may not need to have an actual second client VM, you may be able to just do it all on the server, I haven't tried that though.

@churchyard sorry, had some other fire to deal with today. The test case is probably more up to date and to the point: https://fedoraproject.org/wiki/QA:Testcase_freeipa_trust_server_installation . The test pretty much automates that. You can see the code for any openQA "test module" (the individual bits of a test, like "_console_wait_login" and "role_deploy_domain_controller" - by clicking on them at the test overview screen, e.g. https://openqa.fedoraproject.org/tests/1059611/modules/role_deploy_domain_controller/steps/1/src shows you the code of role_deploy_domain_controller. Sometimes these call library functions that the web UI can't show you, though. The test repo is https://pagure.io/fedora-qa/os-autoinst-distri-fedora/tree/master , there you can find all the test code under tests/ and the libraries of shared functions under lib/. In this case, though, pretty much everything is in the test code itself, and you should see it matches the test case quite closely (it's perl code, but most of what it's doing is just running console commands, so it should be relatively easy to follow). "role_deploy_domain_controller_check" - the module that fails - first checks ipa.service is running after deployment, then waits for the clients to complete, then the next thing it does is the thing that fails (systemctl stop ipa.service).

This release is broken and shouldn't be pushed stable, a 2.1.5 will follow soonish.

This release is broken and shouldn't be pushed stable, a 2.1.5 will follow soonish.

This release is broken and shouldn't be pushed stable, a 2.1.5 will follow soonish.

This release is broken and shouldn't be pushed stable, a 2.1.5 will follow soonish.

This release is broken and shouldn't be pushed stable, a 2.1.5 will follow soonish.

Well, the test deploys a server and a client, runs through some tests, and then stops the service on the server. I don't know whether it's necessary to have a client and do some stuff with it, or whether the bug would reproduce just by deploying FreeIPA, letting it start up, and then trying to stop it...

The automated test failures here are failing on FreeIPA shutdown. It appears to be a real failure - it's reproducible, and tests of other updates are not hitting it. The shutdown does eventually work, but it's taking way longer than it should.

The holdup seems to be to do with httpd. In a sample test, the test runs systemctl stop ipa.service at 03:10:04, and it doesn't finally finish stopping until 03:11:39 (so it takes 95 seconds to stop). The big jump in the logs is here:

Nov 09 11:10:07 ipa001.test.openqa.fedoraproject.org systemd[1]: pki-tomcatd@pki-tomcat.service: Consumed 41.396s CPU time.
Nov 09 11:11:19 ipa001.test.openqa.fedoraproject.org sssd_be[10225]: GSSAPI client step 1
Nov 09 11:11:19 ipa001.test.openqa.fedoraproject.org sssd_be[10225]: GSSAPI client step 1
Nov 09 11:11:19 ipa001.test.openqa.fedoraproject.org sssd_be[10225]: GSSAPI client step 1
Nov 09 11:11:19 ipa001.test.openqa.fedoraproject.org sssd_be[10225]: GSSAPI client step 2
Nov 09 11:11:36 ipa001.test.openqa.fedoraproject.org systemd[1]: httpd.service: State 'stop-sigterm' timed out. Killing.
Nov 09 11:11:36 ipa001.test.openqa.fedoraproject.org systemd[1]: httpd.service: Killing process 9800 (httpd) with signal SIGKILL.
Nov 09 11:11:36 ipa001.test.openqa.fedoraproject.org systemd[1]: httpd.service: Killing process 9802 (httpd) with signal SIGKILL.
Nov 09 11:11:36 ipa001.test.openqa.fedoraproject.org systemd[1]: httpd.service: Killing process 9803 (httpd) with signal SIGKILL.

From that it looks like it tries to stop httpd with SIGTERM around 11:10:06 and waits 90 seconds for that to work, but it doesn't, so it stops it with SIGKILL at 11:11:36.

/var/log/httpd/error_log ends with this line:

[Tue Nov 09 06:10:06.707303 2021] [mpm_event:notice] [pid 9800:tid 9800] AH00492: caught SIGWINCH, shutting down gracefully

which I think matches the SIGTERM at 11:10:06 in the journal - 11:10:06 is UTC (I used journalctl --utc), 06:10:06 is system local time (the system is in US Eastern time). But it looks like it didn't actually shut down gracefully, though I can't find any messages indicating why.

I checked the timestamps on a test of a different update, also for F34; in that test, systemctl stop ipa.service returns after 8 seconds.

Uh, why is this update down to a single package which has nothing to do with the update description? I'm unpushing it until this is clarified.

karma

openQA tests are failing because this breaks lorax. @bcl , to fix this properly, a build of lorax with fa2e465d51039e6172e5118c432b715316a70a48 needs to be added to the update.

@fuller @geraldosimiao are your issues resolved now?