The hard-coded verbosity in `make test-e2e-node` is 4
(17e2eda611/hack/make-rules/test-e2e-node.sh (L248)).
Pre-pending -v4 emulates that behavior, with the difference that an explicit
-v passed by the caller (typically kubetest2) could be used to override it.
`make test-e2e-node` sets the -results-dir based on the ARTIFACTS Prow job env
variable. When e2e_node.test gets invoked directly, it should do the same,
otherwise JUnit and log files are not captured for the job.
This test verifies that pods with pre-allocated CPUs (from the checkpoint file)
are not rejected after kubelet restart when SMT alignment is enabled.
Regression test for the fix where the container presence check was moved
before the SMT alignment check.
The key is to request enough CPUs so that if pre-allocated CPUs are not
counted, the SMT alignment check would fail due to insufficient available
physical CPUs.
Calculate the maximum SMT-aligned CPUs we can request
We need to request most of the allocatable CPUs to trigger the bug.
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Update node-problem-detector from v1.34.0 to v1.35.2 and remove all
related addon manifests and install logic that is no longer needed:
- Update version in build/dependencies.yaml, test/e2e_node/image_list.go
and test/kubemark/resources/hollow-node_template.yaml.
- Remove cluster/addons/node-problem-detector/ entirely. No e2e tests
depend on these manifests: e2e_node tests create NPD pods inline and
GCE standalone mode runs NPD as a systemd service.
- Remove install-node-problem-detector function and DEFAULT_NPD_* vars
from cluster/gce/gci/configure.sh along with the conditional that
invoked it, since NPD is no longer installed as a standalone binary
via this script.
- Remove the setup-addon-manifests calls for node-problem-detector from
cluster/gce/gci/configure-helper.sh since the source directory no
longer exists.
- Remove stale refPaths in build/dependencies.yaml that pointed to the
deleted addon files.
Signed-off-by: Humble Devassy Chirammal <humble.devassy@gmail.com>
This is not usable through "make test-e2e-node", which (while feasible) would
be a bit pointless because the Kubernetes source could would still be needed
for the make rules.
Instead, "kubetest2 noop -test=node" gets extended to invoke `e2e_node.test
remote` with flags that tell e2e_node.test where to find the binaries and
flags that were provided by the caller of kubetest2.
The additional commands (mounter, gcp-credentials-provider) are needed for E2E
node testing. This change makes e2e_node.test entirely self-contained.
Copying the commands' code into separate packages is temporary and only done to
avoid touching them while it is still unclear whether this approach will work
out.
Besides avoiding changes to the build rules, bundling the functionality also has a
slight size advantage: the size of e2e_node.test increases by 10KB, whereas
the other two separate commands would add 10MB.
The caller does not need to enable or disable CGO explicitly, the build rules
do that automatically:
$ make WHAT="cmd/kubelet cluster/gce/gci/mounter"
+++ [0515 17:02:56] Building go targets for linux/amd64
k8s.io/kubernetes/cluster/gce/gci/mounter (static)
k8s.io/kubernetes/cmd/kubelet (non-static)
BuildGo builds the same targets as before. BuildTargets gets changed
to accept a list of targets from the caller, which is a more useful
package API.
The DisableCPUQuotaWithExclusiveCPUs FG is now locked to true,
so we can remove all the tests referring to it.
Some of them were backward compatibility tests - no longer
needed if the FG is locked;
some other tests explicitly set the FG to true - no longer
needed either as the default is true and can't be changed anymore.
Signed-off-by: Francesco Romani <fromani@redhat.com>
DisableCPUQuotaWithExclusiveCPUs is locked to its default (true) since v1.37, so any KubeletConfiguration that sets it to false is rejected and crash-loops the kubelet at startup. configureCPUManagerInKubelet wrote the gate unconditionally and the field defaults to false, so every CPU Manager test that reconfigured the kubelet hit it. Only set the gate when true, and skip the "CFS quota can be disabled" block that exercised the false path.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Pass a logger into ParseContainerID instead of creating a klog.TODO inside the helper. This lets kubelet, prober, and node e2e call sites use their available contextual logger when container ID parsing fails.
Verify container_memory_events_high_total and
container_memory_events_max_total are reported by cadvisor.
These counters were added in cadvisor v0.57.0 to expose
cgroup v2 memory.events for MemoryQoS observability.
KEP: https://github.com/kubernetes/enhancements/issues/2570
Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
e2e_node.test depends on test/e2e_node/builder and test/e2e_node/remote because
test/e2e_node/services/ uses some small helper functions from those two
packages. But e2e_node.test itself never builds any Go binaries, nor does it
run remote testing - that functionality is provided by the separate
test/e2e_node/runner commands.
Therefore these two packages should not put their command line flags into
flag.CommandLine because then they show up in the command line of e2e_node test
unnecessarily.
This change removes the following flags from the e2e_node.test command line:
diff -r before/e2e_node after/e2e_node
7,8d6
< --build-only If true, build e2e_node_test.tar.gz and exit.
< --cleanup If true remove files from remote hosts and delete temporary instances (default true)
20d17
< --delete-instances If true, delete any instances created (default true)
42d38
< --ginkgo-flags string Passed to ginkgo to specify additional flags such as --skip=.
95d90
< --gubernator If true, output Gubernator link to view logs
97d91
< --hosts string hosts to test
99,100d92
< --image-config-dir string (optional) path to image config files
< --image-config-file string yaml file describing images to run
103d94
< --images string images to test
105,106d95
< --instance-name-prefix string prefix for instance names
< --k8s-bin-dir string Directory containing k8s kubelet binaries.
120d108
< --mode string Mode to operate in. One of gce|ssh. Defaults to gce (default "gce")
133d120
< --results-dir string Directory to scp test results to. (default "/tmp/")
142,145d128
< --ssh-env string Use predefined ssh options for environment. Options: gce
< --ssh-key string Path to ssh private key.
< --ssh-options string Commandline options passed to ssh.
< --ssh-user string Use predefined user for ssh.
160,161d142
< --target-build-arch string Target architecture for the test artifacts for dockerized build (default "linux/amd64")
< --test-timeout duration How long (in golang duration format) to wait for ginkgo tests to complete. (default 45m0s)
196d176
< --test_args string Space-separated list of arguments to pass to Ginkgo test runner.
198d177
< --use-dockerized-build Use dockerized build for test artifacts
Pass a logger into GetBootTime so the Linux fallback path no longer creates a local context.TODO() only to derive a logger.
This keeps boot time lookup behavior unchanged and updates the node startup latency tracker constructor to accept a logger instead of a context, matching contextual logging migration guidelines.
Address review feedback to use the standard updateKubeletConfig helper
instead of manual WriteKubeletConfigFile + restartKubelet + waitForKubeletToStart.
Overlayfs does not support cgroupv2 writeback accounting, so buffered
writes (even with conv=fsync) get attributed to the root cgroup instead
of the container's cgroup. This causes cadvisor to see an empty io.stat
for the container, making container_blkio_device_usage_total,
container_fs_reads_bytes_total, and container_fs_writes_bytes_total
permanently absent.
Switch to oflag=direct for writes and add iflag=direct reads to bypass
the page cache entirely. Direct I/O is always attributed to the issuing
process's cgroup regardless of filesystem type.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
After replacing the command to increase UsageNanoCores, to fix a previous flaky test,
in some test environments, UsageNanoCores exceeds the limit 2e+09, this commit
attempts to fix this by ncreasing UsageNanoCores limit to 2e+10.
When MemoryQoS is disabled after being previously enabled, stale
memory.min and memory.low values persist on QoS-class cgroups because
systemd re-applies stored properties on every SetUnitProperties call.
Fix this by including memory.min=0 and memory.low=0 in the existing
startup dbus calls (enforceNodeAllocatableCgroups for the root cgroup,
qosContainerManager.Start for the burstable cgroup). This overwrites
systemd's stored stale values so subsequent realizations re-apply 0.
Fixes https://github.com/kubernetes/kubernetes/issues/138436
KEP: https://github.com/kubernetes/enhancements/issues/2570
Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
Replace the small echo write with a dd that uses conv=fsync to force
data through the block layer. Without fsync, the 11-byte echo writes
stay in page cache and never reach the block device within the
60-second test window. This leaves the cgroup io.stat empty, so
cadvisor does not emit container_blkio_device_usage_total,
container_fs_reads_bytes_total, or container_fs_writes_bytes_total
for the container.
The conv=fsync call guarantees block device I/O on every loop
iteration. Once io.stat has an entry for a device, all fields
(rbytes, wbytes, rios, wios) are present, even if zero, so all
cadvisor metrics pass their boundedSample(0, ...) checks.
Also increase the UsageCoreNanoSeconds upper bound from 1e11 to 1e12
for the container and pod-level CPU checks. The cumulative CPU time
can exceed 100s on slower architectures like ppc64le where the dd
CPU burner loop accumulates faster than expected.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
One podresources test, was not waiting for Pod Resources V1 to be serving.
This can lead to flaky tests in a next step.
This change attempts to fix this flaky test, by adding waitForPodResourcesV1Serving(ctx)
as done on remaining tests. In addition ExpectNoError was added to all closing connection
attempts, to improve troubleshooting.