Commit graph

7208 commits

Author SHA1 Message Date
Nour
58cbde2aff
Pass individual informers, move DRA controllers to resource.go, simplify retry logic and metric tests
Signed-off-by: Nour <nurmn3m@gmail.com>
2026-03-19 16:50:03 +02:00
Nour
8b9159baa4
Drop CSR analogy, mark ObjectMeta +required,reduce limits (maxItems=500, maxLength=128) for etcd safety, add Errors printer column
Signed-off-by: Nour <nurmn3m@gmail.com>
2026-03-19 16:50:03 +02:00
Nour
4dffbf5b2a
Add tests for ResourcePoolStatusRequest
Add unit tests for handwritten and declarative validation, controller
logic, metrics, table printer output, controller-manager registration,
etcd storage round-trip, and an integration test for the full RPSR
lifecycle. Also add an e2e test exercising the DRA test driver with
RPSR and the example manifest.
2026-03-19 16:50:03 +02:00
Nour
30fe79df21
Add ResourcePoolStatusRequest controller, registry, and RBAC
Implement the RPSR controller that watches ResourcePoolStatusRequest
objects and aggregates pool status from DRA drivers. Add the API server
registry (strategy, storage), handwritten validation, RBAC bootstrap
policy for the controller, kube-controller-manager wiring, table
printer columns, and storage factory registration.
2026-03-19 16:50:02 +02:00
Kubernetes Prow Robot
9d02f5f918
Merge pull request #137032 from helayoty/helayoty/5547-workload-job-integration
KEP-5547: Implement Workload APIs integration with Job controller
2026-03-19 17:10:31 +05:30
Kubernetes Prow Robot
98bb6823a8
Merge pull request #137862 from gnufied/pvc-unused-since-condition
Report PVC unused time via PVC condition
2026-03-19 07:08:49 +05:30
Kubernetes Prow Robot
b865748c1c
Merge pull request #135118 from johanneswuerbach/scaletozero
KEP-2021: HPA condition based scaling to zero
2026-03-19 03:36:30 +05:30
Roman Bednar
58f1520a03 update resize e2e tests to check only resize conditions 2026-03-18 17:08:11 -04:00
Roman Bednar
6c087b2724 add unused condition to persistent volume claims 2026-03-18 17:08:08 -04:00
helayoty
68e30095de
Implement Workload and PodGroup integration with Job controller
Signed-off-by: helayoty <heelayot@microsoft.com>
2026-03-18 20:32:37 +00:00
Kubernetes Prow Robot
1b5bcf309c
Merge pull request #137641 from helayoty/helayoty/protection-controller-podgroup
KEP-5832: Add protection controller for PodGroup
2026-03-19 01:03:00 +05:30
Kubernetes Prow Robot
92767f8e32
Merge pull request #137746 from dims/dsrinivas/issue-137740-nodeipam-resync
nodeipam: buffer TestNodeSyncResync report notifications
2026-03-18 23:21:21 +05:30
helayoty
1b90507cfa
move protectionutil pkg to controller/util
Signed-off-by: helayoty <heelayot@microsoft.com>
2026-03-18 15:27:56 +00:00
helayoty
0ef8d78d1d
Add new protection controller for PodGroup
Signed-off-by: helayoty <heelayot@microsoft.com>
2026-03-18 15:27:17 +00:00
Alay Patel
b9729e8197 kep-5304: add DeviceMetadata API 2026-03-18 08:29:42 -04:00
Johannes Würbach
6bebe8d3a2
KEP-2021: HPA condition based scaling to zero 2026-03-17 09:18:18 +01:00
Kubernetes Prow Robot
2d7979b985
Merge pull request #136367 from bhope/metrics-beta-job-controller
Promote job controller metrics to beta
2026-03-17 04:47:44 +05:30
Davanum Srinivas
5d3e8d9db6
nodeipam: buffer TestNodeSyncResync report notifications
TestNodeSyncResync closes opChan after observing the first resync and then
waits for the loop to exit. There is still a small window where the 1ms
resync timer fires again before the select notices the closed channel.
When that happens ReportResult sends a second notification on the
unbuffered reportChan, the loop blocks in the send, and the test waits
forever on doneChan.

Allow one queued notification so the loop can drain that race and reach
the closed opChan case. The test still validates that a resync happened;
it just stops depending on exact scheduling between two ready events.

Tested:
go test -race ./pkg/controller/nodeipam/ipam/sync -run TestNodeSyncResync -count=200
2026-03-14 11:58:31 -04:00
Kubernetes Prow Robot
4e2bbc78bf
Merge pull request #137170 from pohly/dra-device-taints-beta
DRA device taints: graduate to beta
2026-03-13 00:13:38 +05:30
Patrick Ohly
566dc7f3f3 DRA device taints: graduate to beta
The fields become beta, enabled by default. DeviceTaintRule gets
added to the v1beta2 API, but support for it must remain off by default
because that API group is also off by default.

The v1beta1 API is left unchanged. No-one should be using it
anymore (deprecated in 1.33, could be removed now if it wasn't for
reading old objects and version emulation).

To achieve consistent validation, declarative validation must be enabled also
for v1alpha3 (was already enabled for other versions). Otherwise,
TestVersionedValidationByFuzzing fails:

    --- FAIL: TestVersionedValidationByFuzzing (0.09s)
        --- FAIL: TestVersionedValidationByFuzzing/resource.k8s.io/v1beta2,_Kind=DeviceTaintRule (0.00s)
            validation_test.go:109: different error count (0 vs. 1)
                resource.k8s.io/v1alpha3: <no errors>
                resource.k8s.io/v1beta2: "spec.taint.effect: Unsupported value: \"幤HxÒQP¹¬永唂ȳ垞ş]嘨鶊\": supported values: \"NoExecute\", \"NoSchedule\", \"None\""
            ...
2026-03-12 18:26:02 +01:00
Prathamesh Bhope
408c17e102 job/metrics: promote metrics to beta and add test 2026-03-12 04:16:29 -07:00
Kubernetes Prow Robot
6d92449054
Merge pull request #134290 from huww98/kcm-no-get-pv
Do not get PV for externally deleting volume
2026-03-12 05:13:35 +05:30
Kubernetes Prow Robot
e3c05bfa4e
Merge pull request #136700 from Jefftree/cra-fix
simplify cluster role aggregation and remove update path
2026-03-12 00:45:35 +05:30
Kubernetes Prow Robot
38940f0222
Merge pull request #135297 from michaelasp/svmUpdateCRD
Remove CRD stored versions from status upon SVM migration
2026-03-11 08:03:09 +05:30
Michael Aspinwall
d274e05cc9 Remove CRD stored versions from status upon SVM migration 2026-03-11 00:50:27 +00:00
Kubernetes Prow Robot
aa5abdd371
Merge pull request #136817 from kairosci/fix-gc-notfound-136525
Handle NotFound errors in garbage collector
2026-03-11 03:53:09 +05:30
Alessio Attilio
8ed40e7ae7 test: add unit tests for deleteObject NotFound handling in garbage collector
When deleteObject returns a NotFound error (the object was externally deleted
between the GET and the DELETE), attemptToDeleteItem should enqueue a virtual
delete event and return enqueuedVirtualDeleteEventErr.

Cover both code paths:
- default (background propagation): item with dangling owner
- waitingForDependentsDeletion: item whose owner is foreground-deleting
2026-03-10 20:43:30 +01:00
Kubernetes Prow Robot
21b427c299
Merge pull request #136827 from atombrella/feature/fix_nilness_controller
Fix cases of nilness under pkg/controller.
2026-03-10 15:15:11 +05:30
Kubernetes Prow Robot
3d6026d2fd
Merge pull request #136178 from omerap12/promote-hpa-metrics
promote HPA metrics to beta
2026-03-10 01:19:13 +05:30
Kubernetes Prow Robot
2bbb175707
Merge pull request #137461 from ahmedharabi/fix/statefulset-error-wrapping
statefulset: wrap errors with %w in StatefulPodControl
2026-03-07 00:08:25 +05:30
Jordan Liggitt
45900a1deb
Fix vet error 2026-03-05 18:11:02 -05:00
ahmedharabi
a0dee17c1d statefulset: wrap errors with %w in StatefulPodControl
Signed-off-by: ahmedharabi <harabiahmed88@gmail.com>
2026-03-05 23:02:16 +01:00
Kubernetes Prow Robot
c6f70e3a38
Merge pull request #136399 from tico88612/feat/storage-metric-beta
Rename metric `volume_operation_total_errors` to `volume_operation_errors_total`
2026-03-06 00:46:18 +05:30
Omer Aplatony
3799fc9942
Add unit tests for HPA metrics (#136670)
* Add unit tests for HPA metrics

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

* removed mock monitor

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

* fmt

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

* spelling

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

* lint

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

* lint

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

---------

Signed-off-by: Omer Aplatony <omerap12@gmail.com>
2026-03-05 19:10:26 +05:30
Kubernetes Prow Robot
8bd1505fc0
Merge pull request #137108 from pohly/logtools-update
golangci-lint: bump to logtools v0.10.1
2026-03-05 10:14:16 +05:30
Kubernetes Prow Robot
8275484dcf
Merge pull request #137297 from atombrella/feature/pkg_forvar_modernize
Remove redundant variable re-assignment in for-loops under pkg
2026-03-05 00:28:20 +05:30
xigang
9d10b1f799 refactor: remove unused desiredStateOfWorld parameter from DetermineVolumeAction
Signed-off-by: xigang <wangxigang2014@gmail.com>
2026-03-04 22:01:43 +08:00
Kubernetes Prow Robot
9d7dda7186
Merge pull request #137245 from atombrella/feature/slices_contains_pkg_controller
Update `pkg/controller` to use slices.Contains
2026-03-04 18:04:20 +05:30
Patrick Ohly
b895ce734f golangci-lint: bump to logtools v0.10.1
This fixes a bug that caused log calls involving `klog.Logger` to not be
checked.

As a result we have to fix some code that is now considered faulty:

    ERROR: pkg/controller/serviceaccount/tokens_controller.go:382:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (e *TokensController) generateTokenIfNeeded(ctx context.Context, logger klog.Logger, serviceAccount *v1.ServiceAccount, cachedSecret *v1.Secret) ( /* retry */ bool, error) {
    ERROR: ^
    ERROR: pkg/controller/storageversionmigrator/storageversionmigrator.go:299:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (svmc *SVMController) runMigration(ctx context.Context, logger klog.Logger, gvr schema.GroupVersionResource, resourceMonitor *garbagecollector.Monitor, toBeProcessedSVM *svmv1beta1.StorageVersionMigration, listResourceVersion string) (err error, failed bool) {
    ERROR: ^
    ERROR: pkg/proxy/node.go:121:3: logging function "Error" should not use format specifier "%q" (logcheck)
    ERROR: 		klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to exist", nodeName)
    ERROR: 		^
    ERROR: pkg/proxy/node.go:123:3: logging function "Error" should not use format specifier "%q" (logcheck)
    ERROR: 		klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to be assigned IPs", nodeName)
    ERROR: 		^
    ERROR: pkg/scheduler/backend/queue/scheduling_queue.go:610:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (p *PriorityQueue) runPreEnqueuePlugin(ctx context.Context, logger klog.Logger, pl fwk.PreEnqueuePlugin, pInfo *framework.QueuedPodInfo, shouldRecordMetric bool) *fwk.Status {
    ERROR: ^
    ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:286:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (pl *DynamicResources) deleteClaim(ctx context.Context, claim *resourceapi.ResourceClaim, logger klog.Logger) error {
    ERROR: ^
    ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:499:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (pl *DynamicResources) waitForExtendedClaimInAssumeCache(
    ERROR: ^
    ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:528:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (pl *DynamicResources) createExtendedResourceClaimInAPI(
    ERROR: ^
    ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:592:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (pl *DynamicResources) unreserveExtendedResourceClaim(ctx context.Context, logger klog.Logger, pod *v1.Pod, state *stateData) {
    ERROR: ^
    ERROR: pkg/scheduler/framework/runtime/batch.go:171:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (b *OpportunisticBatch) batchStateCompatible(ctx context.Context, logger klog.Logger, pod *v1.Pod, signature fwk.PodSignature, cycleCount int64, state fwk.CycleState, nodeInfos fwk.NodeInfoLister) bool {
    ERROR: ^
    ERROR: staging/src/k8s.io/component-base/featuregate/feature_gate.go:890:4: Additional arguments to Info should always be Key Value pairs. Please check if there is any key or value missing. (logcheck)
    ERROR: 			logger.Info("Warning: SetEmulationVersionAndMinCompatibilityVersion will change already queried feature", "featureGate", feature, "oldValue", oldVal, newVal)
    ERROR: 			^
    ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:108:2: logging function "Info" should not use format specifier "%s" (logcheck)
    ERROR: 	logger.Info("pluginSocksDir: %s", pluginSocksDir)
    ERROR: 	^
    ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:123:2: logging function "Info" should not use format specifier "%s" (logcheck)
    ERROR: 	logger.Info("CDI_ENABLED: %s", cdiEnabled)
    ERROR: 	^

While waiting for this to merge, another call was added which also doesn't
follow conventions:

    ERROR: pkg/kubelet/kubelet.go:2454:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (kl *Kubelet) deletePod(ctx context.Context, logger klog.Logger, pod *v1.Pod) error {
    ERROR: ^

Contextual logging has been beta and enabled by default for several releases
now. It's mostly just a matter of wrapping up and declaring it GA. Therefore
the calls which directly call WithName or WithValues (always have an effect)
are left as-is instead of converting them to use the klog wrappers (support
disabling the effect). To allow that, the linter gets reconfigured to not
complain about this anymore, anywhere.

The calls which would have to be fixed otherwise are:

    ERROR: pkg/kubelet/cm/dra/claiminfo.go:170:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger = logger.WithName("dra-claiminfo")
    ERROR: 	         ^
    ERROR: pkg/kubelet/cm/dra/healthinfo.go:45:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger = logger.WithName("dra-healthinfo")
    ERROR: 	         ^
    ERROR: pkg/kubelet/cm/dra/healthinfo.go:89:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger = logger.WithName("dra-healthinfo")
    ERROR: 	         ^
    ERROR: pkg/kubelet/cm/dra/healthinfo.go:157:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger = logger.WithName("dra-healthinfo")
    ERROR: 	         ^
    ERROR: pkg/kubelet/cm/dra/manager.go:175:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager.go:239:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager.go:593:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager.go:781:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(context.Background()).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager.go:898:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager_test.go:1638:15: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 				logger := klog.FromContext(streamCtx).WithName(st.Name())
    ERROR: 				          ^
    ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:77:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-plugin")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:108:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-plugin")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:161:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-plugin")
    ERROR: 	          ^
    ERROR: staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/tracker.go:695:14: function "WithValues" should be called through klogr.LoggerWithValues (logcheck)
    ERROR: 			logger := logger.WithValues("device", deviceID)
    ERROR: 			          ^
    ERROR: test/integration/apiserver/watchcache_test.go:42:54: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	etcd0URL, stopEtcd0, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd0"), "etcd_watchcache0", etcdArgs)
    ERROR: 	                                                    ^
    ERROR: test/integration/apiserver/watchcache_test.go:47:54: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	etcd1URL, stopEtcd1, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd1"), "etcd_watchcache1", etcdArgs)
    ERROR: 	                                                    ^
    ERROR: test/integration/scheduler_perf/scheduler_perf.go:1149:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 		logger = logger.WithName(tCtx.Name())
    ERROR: 		         ^
2026-03-04 12:08:18 +01:00
Kubernetes Prow Robot
5941fed3d6
Merge pull request #136912 from dfajmon/selinux-ga
Promote SELinuxChangePolicy & SELinuxMountReadWriteOncePod to GA
2026-03-03 22:07:29 +05:30
Kubernetes Prow Robot
11c10dc5a0
Merge pull request #136939 from pohly/dra-device-taints-unit-test-improvements
DRA device taints: update unit tests
2026-03-03 02:48:54 +05:30
Mads Jensen
f11bb48738 Remove redundant re-assignment in for-loops under pkg
This the forvar rule from modernize. The semantics of the for-loop
changed from Go 1.22 to make this pattern obsolete.
2026-03-02 08:47:43 +01:00
ChengHao Yang
5c88906dca
Rename volume_operation_total_errors to volume_operation_errors_total
Raname this because facing lint error, counter metrics should have
"_total" suffix. Add the test `volume_operation_errors_total`
Marked `volume_operation_total_errors` as deprecated

Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
2026-02-28 20:08:07 +08:00
Kubernetes Prow Robot
330950ca52
Merge pull request #137254 from michaelasp/statefulConsistency
Add the ability for the statefulset controller to read its own writes
2026-02-28 01:39:30 +05:30
Michael Aspinwall
c8e8bd5085 Add the ability for the statefulset controller to read its own writes 2026-02-27 18:21:30 +00:00
Daniel Fajmon
b0919d81a0 Promote SELinuxChangePolicy & SELinuxMountReadWriteOncePod to GA 2026-02-27 14:58:14 +01:00
Patrick Ohly
29e92367db DRA device taints: avoid unnecessary Pod lookup
When rapidly processing informer events it can happen that a pod gets scheduled
twice (seen only in the TestEviction/update unit test):

- Claim update observed, pod from informer cache with NodeName from update -> queue pod for eviction.
- Pod update observed, claim from informer cache -> queue pod again.

The effect is one additional Get call to the apiserver. We can avoid it by
maintaining an LRU cache with the UIDs of the pods which we have evicted and
thus don't need to do anything for.
2026-02-27 14:38:30 +01:00
Patrick Ohly
017a53a1a9 DRA device taints: simplify more tests with synctest
In these cases it's certain that no time needs to pass, so Wait can
replace polling with Eventually. This also means that locking is
not necessary to prevent data races.
2026-02-27 07:47:28 +01:00
Patrick Ohly
4521c34276 DRA device taints: remove usage of testify for unit test
In particular with the builtin tCtx.Assert/Expect the assertions are also short
when using gomega and often more readable (no more confusion in Equal which one
is the expected and which the actual value).
2026-02-27 07:47:28 +01:00
Patrick Ohly
fb94a99d2f DRA device taints: artificially delay pod deletion during test
We can observe the delay in the metric histogram. Because we run in a synctest
bubble, the delay is 100% predictable.

Unfortunately we cannot use the reactor mechanism of the fake client: that
delays while holding the fake's mutex. When some other goroutine (in this case,
the event recorder) calls the client, it gets blocked without being considered
durably blocked by synctest, so time does not advance and the test gets stuck.
2026-02-27 07:47:28 +01:00