This leverages ktesting as wrapper around Ginkgo and testing.T to make all
helper code that is needed to deploy a DRA driver available to Go unit
tests and thus integration tests.
How to proceed with unifying helper code for integration and E2E testing is
open. This is just a minimal first step in that direction. Ideally, such
code should be in separate packages where usage of Ginkgo, e2e/framework
and gomega.Expect/Eventually/Consistently are forbidden.
While at it, the builder gets extended to make cleanup optional.
This will be needed for upgrade/downgrade testing with sub-tests.
(cherry picked from commit 7c7b1e1018)
It turned out that ginkgo.GinkgoT() wasn't as cheap as it should have been (fix
coming in Ginkgo 2.27.5). When instantiated once for each framework.Framework
instance during init by all workers at the same time, the resulting spike in
overall memory usage within the container caused OOM killing of workers in Prow
jobs like ci-kubernetes-e2e-gci-gce with very tight memory limits.
Even with the upcoming fix in Ginkgo it makes sense to set the TB field only
while it really is needed, i.e. while a test runs. This is conceptually similar
to setting and unsetting the test namespace. It may help to flush out incorrect
usage of TB outside of tests.
This makes it possible to call helper packages which expect a TContext from E2E
tests.
The implementation uses GinkgoT as TB and supports registering cleanup
callbacks which expect a context. These callbacks then run with a context that
comes from ginkgo.DeferCleanup, just as if they had called that directly.
(cherry picked from commit 47b613eded)
Refactoring the DRA upgrade/downgrade testing such that it runs as Go test
depended on supporting ktesting in the E2E framework. That change worked during
presubmit testing, but broke some periodic jobs. Therefore the relevant commits
from https://github.com/kubernetes/kubernetes/pull/135664/commits get reverted:
c47ad64820 DRA e2e+integration: test ResourceSlice controller
047682908d ktesting: replace Begin/End with TContext.Step
de47714879 DRA upgrade/downgrade: rewrite as Go unit test
7c7b1e1018 DRA e2e: make driver deployment possible in Go unit tests
65ef31973c DRA upgrade/downgrade: split out individual test steps
47b613eded e2e framework: support creating TContext
The last one is what must have caused the problem, but the other commits depend
on it.
This leverages ktesting as wrapper around Ginkgo and testing.T to make all
helper code that is needed to deploy a DRA driver available to Go unit
tests and thus integration tests.
How to proceed with unifying helper code for integration and E2E testing is
open. This is just a minimal first step in that direction. Ideally, such
code should be in separate packages where usage of Ginkgo, e2e/framework
and gomega.Expect/Eventually/Consistently are forbidden.
While at it, the builder gets extended to make cleanup optional.
This will be needed for upgrade/downgrade testing with sub-tests.
Previously it was necessary to use the Ginkgo wrappers when
using any of the custom arguments like WithSlow(). Now the
hook within Ginkgo for modifying arguments is used such that
e.g. the original ginkgo.It also works.
This makes it possible to call helper packages which expect a TContext from E2E
tests.
The implementation uses GinkgoT as TB and supports registering cleanup
callbacks which expect a context. These callbacks then run with a context that
comes from ginkgo.DeferCleanup, just as if they had called that directly.
Example:
I1208 16:01:05.852628 243 upgradedowngrade_test.go:239] get source code version: bring up v1.34: cluster is running, use KUBECONFIG=/var/run/kubernetes/admin.kubeconfig to access it
I1208 16:01:05.869679 243 reflector.go:446] "Caches populated" type="*v1.ServiceAccount" reflector="k8s.io/client-go/tools/watch/informerwatcher.go:162"
The first line is printed via framework.Logf, which is meant to emulate the
format used by the klog text logger in the second line. The difference is that
klog formats the pid with 7 characters, padding on the left with spaces.
Consistency trumps brevity here, so let's format exactly as in klog.
This fixes some issues found in Kubernetes (data race in ginkgo CLI, gomega
formatting) and helps with diagnosing OOM killing in CI jobs (exit status of
processes).
The modified gomega formatting shows up in some of the output tests for the E2E
framework. They get updated accordingly.
This avoids the risk of having a slow test started towards the end of a run,
which then would cause the run to take longer. When started early they can run
in parallel to other tests. In serial runs it doesn't matter.
The implementation maps the Slow label to the new ginkgo.SpecPriority. The
default is 0. Tests with priority 1 run first.
This reverts commit cff07e7551.
The commit caused several kubeadm jobs to fail while executing all conformance
tests (including slow ones) in parallel. Sometimes execution took longer and
ran into the overall timeout, sometimes there was:
[FAILED] Expected
<int>: 440
to be ==
<int>: 400
In [It] at: k8s.io/kubernetes/test/e2e/apimachinery/chunking.go:202
It looks like the tests are flaky and/or reveal a real bug when slow tests run
all in parallel at the same time.
This should work, but doesn't right now, so let's revert until that problem is fixed.
This avoids the risk of having a slow test started towards the end of a run,
which then would cause the run to take longer. When started early they can run
in parallel to other tests. In serial runs it doesn't matter.
The implementation maps the Slow label to the new ginkgo.SpecPriority. The
default is 0. Tests with priority 1 run first.
This reports and fixes for test/e2e:
ERROR: E2E suite initialization was faulty, these errors must be fixed:
ERROR: apimachinery/mutatingadmissionpolicy.go:184: full test name is not unique: "[sig-api-machinery] MutatingAdmissionPolicy [Privileged:ClusterAdmin] [Feature:MutatingAdmissionPolicy] [FeatureGate:MutatingAdmissionPolicy] [Beta] [Feature:OffByDefault] should support MutatingAdmissionPolicy API operations" (/nvme/gopath/src/k8s.io/kubernetes/test/e2e/apimachinery/mutatingadmissionpolicy.go:184, /nvme/gopath/src/k8s.io/kubernetes/test/e2e/apimachinery/mutatingadmissionpolicy.go:606)
ERROR: apimachinery/mutatingadmissionpolicy.go:412: full test name is not unique: "[sig-api-machinery] MutatingAdmissionPolicy [Privileged:ClusterAdmin] [Feature:MutatingAdmissionPolicy] [FeatureGate:MutatingAdmissionPolicy] [Beta] [Feature:OffByDefault] should support MutatingAdmissionPolicyBinding API operations" (/nvme/gopath/src/k8s.io/kubernetes/test/e2e/apimachinery/mutatingadmissionpolicy.go:412, /nvme/gopath/src/k8s.io/kubernetes/test/e2e/apimachinery/mutatingadmissionpolicy.go:834)
ERROR: common/node/pod_level_resources.go:250: full test name is not unique: "[sig-node] Pod Level Resources [Serial] [Feature:PodLevelResources] [FeatureGate:PodLevelResources] [Beta] Guaranteed QoS pod with container resources" (/nvme/gopath/src/k8s.io/kubernetes/test/e2e/common/node/pod_level_resources.go:250 (2x))
ERROR: dra/dra.go:1899: full test name is not unique: "[sig-node] [DRA] kubelet [Feature:DynamicResourceAllocation] [FeatureGate:DRAConsumableCapacity] [Alpha] [Feature:OffByDefault] [FeatureGate:DynamicResourceAllocation] must allow multiple allocations and consume capacity [KubeletMinVersion:1.34]" (/nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/dra.go:1899 (2x))
ERROR: storage/testsuites/volume_group_snapshottable.go:173: full test name is not unique: "[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: (delete policy)] volumegroupsnapshottable [Feature:volumegroupsnapshot] VolumeGroupSnapshottable should create snapshots for multiple volumes in a pod" (/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volume_group_snapshottable.go:173 (2x))
ERROR: storage/testsuites/volume_group_snapshottable.go:173: full test name is not unique: "[sig-storage] CSI Volumes [Driver: pd.csi.storage.gke.io] [Serial] [Testpattern: (delete policy)] volumegroupsnapshottable [Feature:volumegroupsnapshot] VolumeGroupSnapshottable should create snapshots for multiple volumes in a pod" (/nvme/gopath/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volume_group_snapshottable.go:173 (2x))
And for test/e2e_node:
ERROR: cpu_manager_test.go:1622: full test name is not unique: "[sig-node] CPU Manager [Serial] [Feature:CPUManager] when checking the CFS quota management should disable for guaranteed pod with exclusive CPUs assigned" (/nvme/gopath/src/k8s.io/kubernetes/test/e2e_node/cpu_manager_test.go:1622, /nvme/gopath/src/k8s.io/kubernetes/test/e2e_node/cpu_manager_test.go:1642)
ERROR: eviction_test.go:800: full test name is not unique: "[sig-node] LocalStorageCapacityIsolationFSQuotaMonitoring [Slow] [Serial] [Disruptive] [Feature:LocalStorageCapacityIsolationQuota] [Feature:LSCIQuotaMonitoring] [Feature:UserNamespacesSupport] when we run containers that should cause use quotas for LSCI monitoring (quotas enabled: true) should eventually evict all of the correct pods" (/nvme/gopath/src/k8s.io/kubernetes/test/e2e_node/eviction_test.go:800 (2x))
When building the test binary without race detection, we don't
need the post-processing of the JUnit file because it cannot
contain data race reports. This can be done via build tags.
Promoting real tests turned out to be harder than expected (should be rewritten
to be self-contained, additional reviews, etc.).
They would not achieve 100% endpoint+operation coverage because real tests only
use some of the operations. Therefore each API type has to be covered with
CRUD-style tests which only exercise the apiserver, then maybe additional
functional tests can be added later (depending on time and motivation).
The machinery for testing different API types is meant to be reusable, so it
gets added in the new e2e/framework/conformance helper package.
Ginkgo itself doesn't do this, in which case prun-junit-xml drops the output
and Spyglass wouldn't show the test as failed. If the data race warning is
captured, we now treat that as the failure of a test if it hasn't already
failed for other reasons.
While at it, the entire report cleanup gets moved to our junit package.
This doubles the termination timeout for the eviction test from 5min to
10min. Reason for that is that the eviction manager relies on pod stats
metrics, which may not be acceessible during a period of time because of
the kubelet API unreachable. This could be reasoned in hardware or
network pressure when multiple tests run in parallel.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>