Commit graph

3221 commits

Author SHA1 Message Date
Kubernetes Prow Robot
63011fe547
Merge pull request #132277 from KevinTMtz/pod-level-resources-eviction-manager
[PodLevelResources] Pod Level Resources Eviction Manager
2025-07-24 16:44:34 -07:00
Kubernetes Prow Robot
ebbebe8be6
Merge pull request #133157 from haircommander/cgroup-driver-cri-ga
KEP 4033: Add metric for out of support CRI and bump feature to GA
2025-07-24 13:05:04 -07:00
Kubernetes Prow Robot
e4e13c1e80
Merge pull request #132818 from ffromani/e2e-node-cpumanager-cgroupv1-compat
e2e: node: cpumanager cgroup v1 compatibility
2025-07-24 13:04:41 -07:00
Kevin Torres
add7132a6d E2E tests for pod level resources Kubelet Preemption 2025-07-24 17:08:13 +00:00
Kevin Torres
976a617d05 E2E tests for pod level resources eviction manager 2025-07-24 17:07:09 +00:00
Peter Hunt
83a0d0c660 kubelet: add metric for version CRI implementation will lose support
Signed-off-by: Peter Hunt <pehunt@redhat.com>
2025-07-24 11:42:59 -04:00
Kubernetes Prow Robot
d21da29c9e
Merge pull request #133170 from ffromani/e2e-node-podres-memmgr
e2e: podresources: disable memory manager integration
2025-07-24 07:56:48 -07:00
Francesco Romani
449763fb11 e2e: podresources: disable memory manager integration
As part of the PR 132028 we added more e2e test coverage to validate
the fix, and check as much as possible there are no regressions.

The issue and the fix become evident largely when inspecting
memory allocation with the Memory Manager static policy enabled.
Quoting the commit message of bc56d0e45a
```
The podresources API List implementation uses the internal data of the
resource managers as source of truth.
Looking at the implementation here:
https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60
we take care of syncing the device allocation data before querying the
device manager to return its pod->devices assignment.
This is needed because otherwise the device manager (and all the other
resource managers) would do the cleanup asynchronously, so the `List` call
will return incorrect data.

But we don't do this syncing neither for CPUs or for memory,
so when we report these we will get stale data as the issue #132020 demonstrates.

For CPU manager, we however have the reconcile loop which cleans the stale data periodically.
Turns out this timing interplay was actually the reason the existing issue #119423 seemed fixed
(see: #119423 (comment)).
But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time
very high (>= 5 minutes), then the issue still reproduces against current master branch
(https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983).
```

The missing actor here is memory manager. Memory manager has no
reconcile loop (implicit fixing the stale data problem) no explicit
synchronization, so it is the unlucky one which reported stale data,
leading to the eventual understanding of the problem.

For this reason it was (and still is) important to exercise it during
the test.
Turns out the test is however wrong, likely because a hidden dependency
between the test expectations and the lane configuration (notably
machine specs), so we disable the memory manager activation for the time
being, until we figure out a safe way to enable it.

Note this significantly weakens the signal for this specific test.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-24 12:35:45 +02:00
Patrick Ohly
5c4f81743c DRA: use v1 API
As before when adding v1beta2, DRA drivers built using the
k8s.io/dynamic-resource-allocation helper packages remain compatible with all
Kubernetes release >= 1.32. The helper code picks whatever API version is
enabled from v1beta1/v1beta2/v1.

However, the control plane now depends on v1, so a cluster configuration where
only v1beta1 or v1beta2 are enabled without the v1 won't work.
2025-07-24 08:33:45 +02:00
Kubernetes Prow Robot
dd6fa8bafd
Merge pull request #133129 from ffromani/podres-get-add-tests
node: podresources: improve test coverage for the `Get` endpoint
2025-07-23 19:56:40 -07:00
Kubernetes Prow Robot
aee92cd6c3
Merge pull request #132968 from wongchar/uncore-e2e-beta
cpumanager: expand test coverage for prefer-align-cpus-by-uncore-cache
2025-07-22 13:40:50 -07:00
Francesco Romani
303a7056ff e2e: node: podresources: enable multi-container tests
fix the utilities to enable multi-app-container tests,
which were previously quite hard to implement.

Add a consumer of the new utility to demonstrate the usage
and to initiate the basic coverage.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-22 19:58:29 +02:00
Francesco Romani
38a9a8a59d e2e: node: podresources: add tests for missing pod
add a e2e test to ensure that if the Get endpoint is asked
about a non-existing pod, it returns error.
Likewise, add a e2e test for terminated pods, which should
not be returned because they don't consume nor hold resources,
much like `List` does.

The expected usage patterns is to iterate over the list of
pods returned by `List`, but nevertheless the endpoint must
handle this case.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-22 19:55:09 +02:00
Charles Wong
545b36ba29 fix uncore e2e check 2025-07-22 09:50:18 -05:00
Kubernetes Prow Robot
4dfb8523fc
Merge pull request #128239 from HirazawaUi/fix-e2e-tests
Fix container lifecycle flaking e2e tests
2025-07-21 18:08:25 -07:00
Kubernetes Prow Robot
b8eda18fc9
Merge pull request #132198 from natasha41575/mirror-obs-gen
add generation / observedGeneration test for mirror pods
2025-07-21 16:30:25 -07:00
Natasha Sarkar
c659b41826 e2e test for mirror pod with pod generation 2025-07-21 22:27:13 +00:00
Kubernetes Prow Robot
47d9d86326
Merge pull request #133028 from saschagrunert/deviceplugin-proto
Convert `k8s.io/kubelet/pkg/apis/deviceplugin` from gogo to protoc
2025-07-21 14:14:55 -07:00
Kubernetes Prow Robot
7d758620bc
Merge pull request #132083 from carlory/cleanup-GAed-fg-DevicePluginCDIDevices
remove general avaliable feature-gate DevicePluginCDIDevices
2025-07-21 13:06:27 -07:00
Charles Wong
ccc82775f4 expand test coverage for uncore alignment
add feature compatibility

check uncore cpuset alignment

check shared uncores
2025-07-21 11:19:25 -05:00
Francesco Romani
ea326373ef e2e: node: cpumanager cgroup v1 compatibility
While we support cgroup v1, we want some test coverage.
This patch enables v1 coverage for most of the testcases.
We intentionally rule out the CFS quota tests because we
want to support this change only on cgroup v2.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-21 13:57:50 +02:00
Sascha Grunert
3026020b44
Convert k8s.io/kubelet/pkg/apis/deviceplugin from gogo to protoc
Use standard protoc for the device plugin API instead of gogo.

Part of kubernetes#96564

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-07-21 10:04:01 +02:00
Kubernetes Prow Robot
5e83b9c2c2
Merge pull request #129942 from bart0sh/PR171-migrate-some-kubelet-components-to-contextual-logging
Migrate kubelet/{apis,kubeletconfig,nodeshutdown,pod,preemption} to contextual logging
2025-07-18 20:28:25 -07:00
Kubernetes Prow Robot
daee8efa4d
Merge pull request #132811 from ffromani/e2e-serial-cpumanager-tests-cleanup
e2e: node: cpumanager: fix cpu quota non-regression tests
2025-07-18 15:24:38 -07:00
Kubernetes Prow Robot
7fa6cdde88
Merge pull request #127630 from dshebib/e2eNode_UpdateToAgnhost
[e2e_node] containers_lifecycle update from busybox to agnhost
2025-07-18 15:24:25 -07:00
Kubernetes Prow Robot
9212246d78
Merge pull request #132827 from guptaNswati/e2e-podresourcesGet-featuregate
Add feature gate enable test for KubeletPodResourcesGet
2025-07-18 12:12:25 -07:00
Swati Gupta
14a5ef56a3 fix pipeline failure
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-17 23:21:26 +00:00
Sascha Grunert
532d48fe6a
Convert k8s.io/kubelet/pkg/apis/podresources from gogo to protoc
Use standard protoc for the pod resources instead of gogo.

Part of kubernetes#96564

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-07-17 14:56:44 +02:00
Ed Bartosh
75ccd69bab migrate pkg/kubelet/kubeletconfig to contextual logging 2025-07-17 10:16:03 +03:00
Kubernetes Prow Robot
8f312e6fbf
Merge pull request #132348 from iholder101/swap/add-container-swap-limit-metric
[KEP-2400] Add a container_swap_limit_bytes metric
2025-07-16 20:02:30 -07:00
Kubernetes Prow Robot
9f545c5b46
Merge pull request #130992 from dshebib/addRegularContainerImageChangeToE2E_reverted
E2E Node Tests: Remove failing test from reverted PR
2025-07-16 20:02:23 -07:00
Swati Gupta
8f4a624a59 Fix pipeline errors
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-16 22:56:59 +00:00
Ed Bartosh
e4320fe25c e2e_node: DRA: test handling fatal serving failures
Added an e2e_node test to verify that the DRA plugin and
registration services cancel provided context when handling
fatal gRPC serving errors.
2025-07-16 15:49:41 +03:00
Ed Bartosh
ea05ad8887 e2e_node: DRA: add errorOnCloseListener
Introduce a mock net.Listener for tests that triggers a controlled
error on Close, enabling reliable simulation of gRPC server failures
in test scenarios.
2025-07-16 15:49:41 +03:00
Ed Bartosh
1981c985b1 e2e: DRA: support test and public options
Refactor StartPlugin and related test helpers to accept a variadic
list of options of any type, allowing both public and test-specific
options to be passed.
2025-07-16 15:49:41 +03:00
Ed Bartosh
169965350c e2e_node: Refactor DRA tests to use variadic options
Refactor the DRA e2e_node test helpers and test cases to accept
variadic kubeletplugin.Option arguments.

This change improves test flexibility and maintainability, allowing
new options to be passed in the future without requiring widespread
code changes.

There are no functional changes to test coverage or behavior.
2025-07-16 15:42:12 +03:00
Swati Gupta
d460611e77 Add more checks
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-15 21:51:36 +00:00
Kubernetes Prow Robot
20344f9aba
Merge pull request #132345 from ffromani/e2e-podresourcesapi-labels
e2e: node: fix podresources API feature label
2025-07-15 13:16:29 -07:00
Kubernetes Prow Robot
394f412767
Merge pull request #132617 from aramase/aramase/f/kep_4412_pod_cache_key_type
Add ServiceAccountTokenCacheType support to credential provider plugin
2025-07-15 10:56:45 -07:00
Francesco Romani
05e1c4b489 e2e: node: fix podresources API feature label
We want to fix and enhance lanes which exercise
the podresources API tests. The first step is to clarify
the label and made it specific to podresources API,
minimzing the clash and the ambiguity with the "PodLevelResources"
feature.

Note we change the label names, but the label name is backward
compatible (filtering for "Feature:PodResources" will still
get the tests). This turns out to be not a problem because
these tests are no longer called out explicitly in the lane
definitions. We want to change this ASAP.

The new name is more specific and allows us to clearly
call out tests for this feature in the lane definitions.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-15 14:15:00 +02:00
carlory
bd30b0adef remove general avaliable feature-gate DevicePluginCDIDevices
Signed-off-by: carlory <baofa.fan@daocloud.io>
2025-07-15 16:55:12 +08:00
Kubernetes Prow Robot
bf0be9fb56
Merge pull request #132028 from ffromani/podresources-list-active-pods
podresources: list: use active pods
2025-07-14 12:06:24 -07:00
Charles Wong
98c4514eae add e2e_node tests for uncore alignment 2025-07-11 10:32:01 -05:00
Anish Ramasekar
4d2566eb5a
credentialprovider: wire in service account mode cache type
Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>
2025-07-10 14:50:54 -05:00
Swati Gupta
bb6bd52012 Add feature gate enable test for KubeletPodResourcesGet
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-08 23:49:34 +00:00
Francesco Romani
8f92a81787 node: e2e: podresources: add more e2e tests
add more e2e tests to cover the interaction with
core resource managers (cpu, memory) and to ensure
proper reporting.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-08 17:18:34 +02:00
Francesco Romani
380ed8d9b3 e2e: node: memory manager: build everywhere, run only on linux
Since the KEP 4885
(https://github.com/kubernetes/enhancements/blob/master/keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md)
memory manager is supported also on windows.

Plus, we want to add podresources e2e tests which configure
the memory manager. Both these facts suggest it's useful to build
the e2e memory manager tests on all OSes, not just on linux;

However, since we are not sure we are ready to run these tests
everywhere, we tag them LinuxOnly to keep preserve most of the
old behavior.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-08 17:18:34 +02:00
Francesco Romani
006b2a3b52 e2e: node: cpumanager: fix cpu quota non-regression tests
The non regression tests should check the quota management
introduced in #127525 can be disabled, so we need to verify
the previous behaviour using the integer quotas.

It seems the problem was just a bad rebase that wrongly duplicated
the tests. We fix removing the incorrect duplicates.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-08 16:10:16 +02:00
Itamar Holder
25d9d8d9ba refactor: use getLocalNode() to avoid code duplication
Signed-off-by: Itamar Holder <iholder@redhat.com>
2025-07-08 15:48:35 +03:00
Itamar Holder
bc9e8e1a91 add a context argument to prePodCreationModificationFunc()
Signed-off-by: Itamar Holder <iholder@redhat.com>
2025-07-08 15:45:42 +03:00