Update of Python and python3-docs to 3.9.9. Fixed python3-typed_ast for this Python version.
Updates may require up to 24 hours to propagate to mirrors. If the following command doesn't work, please retry later:
sudo dnf upgrade --refresh --advisory=FEDORA-2021-e152ce5f31
Please login to add feedback.
This update has been submitted for testing by thrnciar.
This update's test gating status has been changed to 'failed'.
This update's test gating status has been changed to 'waiting'.
This update's test gating status has been changed to 'failed'.
churchyard edited this update.
The automated test failures here are failing on FreeIPA shutdown. It appears to be a real failure - it's reproducible, and tests of other updates are not hitting it. The shutdown does eventually work, but it's taking way longer than it should.
The holdup seems to be to do with httpd. In a sample test, the test runs
systemctl stop ipa.service
at 03:10:04, and it doesn't finally finish stopping until 03:11:39 (so it takes 95 seconds to stop). The big jump in the logs is here:From that it looks like it tries to stop httpd with SIGTERM around 11:10:06 and waits 90 seconds for that to work, but it doesn't, so it stops it with SIGKILL at 11:11:36.
/var/log/httpd/error_log
ends with this line:which I think matches the SIGTERM at 11:10:06 in the journal - 11:10:06 is UTC (I used
journalctl --utc
), 06:10:06 is system local time (the system is in US Eastern time). But it looks like it didn't actually shut down gracefully, though I can't find any messages indicating why.I checked the timestamps on a test of a different update, also for F34; in that test,
systemctl stop ipa.service
returns after 8 seconds.@churchyard for info.
This update has been pushed to testing.
Bodhi is disabling automatic push to stable due to negative karma. The maintainer may push manually if they determine that the issue is not severe.
churchyard edited this update.
Thanks for the notice. I've disabled autopush on f33 as well.
Will also need to include https://src.fedoraproject.org/rpms/python3-typed_ast/pull-request/2
churchyard edited this update.
New build(s):
Karma has been reset.
This update has been submitted for testing by churchyard.
Adam, could you please provide a reproducer for the FreeIPA shutdown failure? Ideally over bugzilla. Thanks.
Well, the test deploys a server and a client, runs through some tests, and then stops the service on the server. I don't know whether it's necessary to have a client and do some stuff with it, or whether the bug would reproduce just by deploying FreeIPA, letting it start up, and then trying to stop it...
Where can I see how the test does that? I have no idea how to deploy a server and a client. I am looking at https://fedoraproject.org/wiki/FreeIPA_step_by_step but it feels quite old and it gives very ambiguous instructions like "make sure that the host names are set properly" -- it seems I need to know what I am trying to accomplish when I want to setup FreeIPA and unfortunately I don't.
This update has been pushed to testing.
I'm looking into it. No results so far. I'm trying to find someone from the IPA team to help me with debugging.
@churchyard sorry, had some other fire to deal with today. The test case is probably more up to date and to the point: https://fedoraproject.org/wiki/QA:Testcase_freeipa_trust_server_installation . The test pretty much automates that. You can see the code for any openQA "test module" (the individual bits of a test, like "_console_wait_login" and "role_deploy_domain_controller" - by clicking on them at the test overview screen, e.g. https://openqa.fedoraproject.org/tests/1059611/modules/role_deploy_domain_controller/steps/1/src shows you the code of
role_deploy_domain_controller
. Sometimes these call library functions that the web UI can't show you, though. The test repo is https://pagure.io/fedora-qa/os-autoinst-distri-fedora/tree/master , there you can find all the test code undertests/
and the libraries of shared functions underlib/
. In this case, though, pretty much everything is in the test code itself, and you should see it matches the test case quite closely (it's perl code, but most of what it's doing is just running console commands, so it should be relatively easy to follow). "role_deploy_domain_controller_check" - the module that fails - first checksipa.service
is running after deployment, then waits for the clients to complete, then the next thing it does is the thing that fails (systemctl stop ipa.service
).I've filled an upstream issue (https://pagure.io/freeipa/issue/9034) and I'm gonna try to reproduce the problem in the meantime.
I've tried to reproduce the issue in a virtual machine and IPA installed as described in the Wiki page but I was not successful. The start of the service took ~40 s and the stop took ~6 no matter which version of Python I had installed. I'm gonna try it once more carefully following the test case.
Ok, so, now I have exactly the same environment installed following the rules in the
role_deploy_domain_controller
. I've compared the output of the commands in my VM with the screenshots and they are almost the same. And I'm still not able to reproduce the issue, stopping the service takes around 6 seconds with both old and new Python.I'm out of ideas. Is it possible to keep the testing virtual machine running after the failed test so we can connect there and debug the problem in the testing environment? Or start it manually somehow? Is there any way how to get the list of installed packages to compare them with my environment? I might have something different.
I suppose the other possibility is that we do need to run the client tests as well. The client tests do include one which exercises the FreeIPA web UI, and the shutdown issue is in httpd, so it's plausible we don't hit the bug unless the web UI gets exercised.
The openQA client test that exercises the web UI is the slightly misleadingly-named
realmd_join_cockpit
(it does enrol via Cockpit, but then it also goes on to do the web UI tests as well). Both thefreeipa_webui
andfreeipa_password_change
modules exercise the web UI to some extent. If you can add those steps to your reproducer attempt it may help. You may not need to have an actual second client VM, you may be able to just do it all on the server, I haven't tried that though.Oh, sorry, to answer your questions:
"Is it possible to keep the testing virtual machine running after the failed test so we can connect there and debug the problem in the testing environment? Or start it manually somehow?"
The answer is sort of "theoretically yes, practically it's difficult". We can leave the test 'open' (at least for a couple of hours - after that it would time out, I think there's a setting to override that though), but the tricky part is connecting to it. The official openQA instance runs inside Fedora's infrastructure, which is quite heavily firewalled, it would be rather difficult to get things configured such that you could route a VNC connection all the way in there, I think. RH IT is in charge of some of the firewalling, too, so getting any changes made involves filing a ticket and waiting for weeks.
The other option is to reproduce the bug on an instance of openQA sited somewhere more convenient for connecting to, but setting up a pet openQA deployment is kind of a drag, and it's especially a drag for these tests where we have a server/client setup going on (because it involves special networking configuration). I used to keep a pet deployment in my home office but I got rid of it last year. @lruzicka has one, I think, but I don't know if his is set up to do the client/server tests.
What I usually wind up doing instead is just rewriting the openQA tests on the fly to add whatever debugging steps we need. If you can suggest some stuff you'd like done from the broken state that might help identify what the problem is, I can usually manage to tweak the test environment to do it and get the info out, so just let me know.
"Is there any way how to get the list of installed packages to compare them with my environment?"
Yeah, that we can do quite easily, I'll grab it for you later today.
Thanks for your help. Now, I'm able to reproduce the issue. The only thing I need to do is to log out and back in in the FreeIPA web UI and then it takes more than 90 seconds to stop the webserver even when I do it directly via
systemctl stop httpd
orapachectl stop
commands.Unfortunately, the log of httpd does not provide anything useful because it does not generate any messages after the sigkill signal.
I'm gonna try strace.
I'm still not sure where the problem is and I'm reaching my debugging limits but this strace output seems to be important:
Old Python:
New Python:
The strace logs and switched, there is no deadlock in the old Python, of course, sorry for that.
strace logs from
strace -t -fff apachectl stop
are available here:Its unclear from the web server log the time the shutdown was requested. Do you see in those logs a request to stop and the next log line being ~90 seconds later? We sort of see that in the apachectl log but its waiting on a POLL so isn't particularly enlightening. It appears to be just waiting for Apache to stop.
I did a time-consuming bisection of commits in Python between 3.9.7 and 3.9.8 and now I can point to the specific commit causing this issue, the winner is: https://github.com/python/cpython/commit/1ecb641b887af2feb026a266e2fceedee0815ca8
I don't understand mod_wsgi well enough to see the connection between the commit and the issue it causes but I'm a bit closer to the finish.
@vstinner is the author of the code so we might be able to debug it together.
Seems like Python 3.9.8 introduced a regression in argparse which causes complex command-line tools to fail to recognize sub-commands properly. You can find more info here: https://bugs.python.org/issue45235
Upstream already published hotfix release 3.9.9. I think we shouldn't ship 3.9.8 at all and do a 3.9.9 update where we can revert the commit causing freeipa to fail. WDYT?
I've tried to bisect the commit again because I wasn't able to reproduce the issue with reverted commit 1ecb641b887af2feb026a266e2fceedee0815ca8 and now it seems that the issue is caused by an earlier commit: https://github.com/python/cpython/commit/94d19f606fa18a1c4d2faca1caf2f470a8ce6d46
Also, it seems that the only thing one needs to do to reproduce the issue is: start the webserver → login to the FreeIPA web UI → log out from the web UI → stop the webserver.
It makes sense to try Python 3.9.9 even I think this problem has nothing to do with argparse. I'm gonna take a look on the changed files in the new release. I might find something else interesting as well.
By the way, I've also tried to set up a simple Python wsgi application and I'm not able to reproduce the same issue with it. From the FreeIPA configuration, it also seems that it does not use threads in mod_wsgi but uses some threading in its own code so it might be related to FreeIPA and not to mod_wsgi in general.
After one more day of debugging, I've created an upstream issue because it does not seem to me that the problem is in FreeIPA: https://github.com/GrahamDumpleton/mod_wsgi/issues/730
Thanks a lot for all the work on this (and the continuing work in the github issue)! Sorry I went AWOL after that last comment, I didn't manage to get the package list on that day and then I was on vacation. Seems like you got along fine without me, though. :D
churchyard edited this update.
New build(s):
Removed build(s):
Karma has been reset.
This update has been submitted for testing by churchyard.
churchyard edited this update.
New build(s):
Removed build(s):
Karma has been reset.
This update's test gating status has been changed to 'passed'.
This update has been pushed to testing.
churchyard edited this update.
We have reverted the problematic commit. The test passed.
I've verified it also manually.
wfm
This update can be pushed to stable now if the maintainer wishes
Works for me.
This update has been submitted for stable by bodhi.
This update has been pushed to stable.