When running with race detection enabled several tests have suffered from
timeouts recently, with no obvious commit which could be blamed for this.
Let's use a common constant and make it larger.
The ci-kubernetes-local-e2e job has been flaky (~40-45% success rate)
with intermittent DNS/service connectivity failures. The root cause is
that bridge CNI requires br_netfilter and bridge-nf-call-iptables
kernel settings, which don't work reliably in docker-in-docker.
This switches to ptp (point-to-point) CNI, which creates direct veth
pairs between pods and host namespace. No bridge means no br_netfilter
dependency. This is the same approach KIND uses and it works reliably.
Changes:
- Replace bridge CNI with ptp CNI plugin
- Configure kernel network parameters for DIND (route_localnet,
arp_ignore, ip_forward) required for ptp and iptables-based kube-proxy
- Remove CoreDNS pod delete/restart workaround from 1168b11875 that was
masking the underlying networking issues (no longer needed)
- Add CoreDNS log capture during cleanup for debugging DNS issues
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
When aborting an integration test with CTRL-C while it runs,
the current test fails and etcd exits. But additional tests were still being
started and the failed slowly because they couldn't connect to etcd.
It's better to fail additional tests in ktesting.Init when the test run has
already been interrupted.
While at it, also make it a bit more obvious that testing was interrupted by
logging it and update one comment about this and clean up the naming of
contexts in the code.
Example:
$ go test -v ./test/integration/quota
...
I1106 11:42:48.857162 147325 etcd.go:416] "Not using watch cache" resource="events.events.k8s.io"
I1106 11:42:48.857204 147325 handler.go:286] Adding GroupVersion events.k8s.io v1 to ResourceManager
W1106 11:42:48.857209 147325 genericapiserver.go:765] Skipping API events.k8s.io/v1beta1 because it has no resources.
^C
INFO: canceling test context: received interrupt signal
{"level":"warn","ts":"2024-11-06T11:42:48.984676+0100","caller":"embed/serve.go:160","msg":"stopping insecure grpc server due to error","error":"accept tcp 127.0.0.1:44177: use of closed network connection"}
...
I1106 11:42:50.042430 147325 handler.go:142] kube-apiserver: GET "/apis/rbac.authorization.k8s.io/v1/clusterroles" satisfied by gorestful with webservice /apis/rbac.authorization.k8s.io/v1
test_server.go:241: timed out waiting for the condition
--- FAIL: TestQuota (11.45s)
=== RUN TestQuotaLimitedResourceDenial
quota_test.go:292: testing has been interrupted: received interrupt signal
--- FAIL: TestQuotaLimitedResourceDenial (0.00s)
=== RUN TestQuotaLimitService
quota_test.go:418: testing has been interrupted: received interrupt signal
--- FAIL: TestQuotaLimitService (0.00s)
FAIL
When cleaning up the progress channel properly (stop signal delivery, closing
the channel), the loop dumping progress reports no longer needs to check for
the separate shutdown context. Instead, it can distinguish between "signal
received" and "channel closed".
The signal context was getting cleanup by canceling it, but a channel is better
because it avoids the slightly misleading "received interrupt signal"
cancellation when the test was only shutting down.
The "received interrupt signal" is useful also when running with "go test"
without -v because it shows that the shutdown has started.
But more important is that a progress report gets shown because that feature is
useful in particular when "go test" produces no output while it runs.