kubernetes

mirror of https://github.com/kubernetes/kubernetes.git synced 2026-04-22 14:48:29 -04:00

Author	SHA1	Message	Date
Patrick Ohly	85bca3b684	DRA device taints: fix beta-enabled, alpha-disable configurations DeviceTaintRule is off by default because the corresponding v1beta2 API group is off. When enabled, the potentially still disabled v1alpha3 API version was used instead of the new v1beta2, causing the scheduler to fail while setting up informers and then not scheduling pods.	2026-03-13 09:20:57 +01:00
Tsubasa Watanabe	30b811a99b	DRA Device Binding Conditions: add metrics for prebind flow This commit introduces metrics and improves log outputs for DRA Device Binding Conditions (KEP-5007): - scheduler_dra_bindingconditions_allocations_total Counts the number of per-device scheduling attempts during PreBind where BindingConditions are in use - scheduler_dra_bindingconditions_wait_duration_seconds Observes the time spent waiting for BindingConditions to be satisfied during PreBind. Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>	2026-03-12 17:19:13 +09:00
Kubernetes Prow Robot	69144c9081	Merge pull request #137371 from pohly/dra-bind-claim-panic DRA scheduler: fix potential panic when DRABindingConditions are enabled	2026-03-11 03:03:25 +05:30
Patrick Ohly	f33176fc00	DRA scheduler: add unit tests for AllocationTimestamp The code paths for adding AllocationTimestamp were not tested well. None of the test cases verified that an AllocationTimestamp gets added at all because go-cmp was instructed to ignore the unpredictable field. We can do better than that and at least check for existence by normalizing all non-nil time stamps to the empty time. This affects all tests where the binding conditions and thus AllocationTimestamp support is enabled. The retry loop for status updates was untested. The fake client has to return a conflict status error to trigger it. This enables writing a test case where a concurrent deallocation would have caused the nil panic without the previous fix. For binding conditions, one test case gets added which runs through the full flow of allocating a claim and trying to bind it. All other test cases seem to have started with the claim already allocated. Altogether this increases coverage from 82.4% to 83.7%.	2026-03-10 16:25:53 +01:00
Troy Chiu	1d2165b29c	Fix scheduler flaky test: wait for DeviceClass cache sync in dynamicresources tests When DRAExtendedResource is enabled, the dynamicresources test setup registers an event handler for DeviceClasses but was not waiting for it to sync. This can lead to flaky tests where the cache is not fully populated when the test starts. This change captures the event handler registration and includes its DoneChecker in a WaitFor call.	2026-03-09 19:24:13 +00:00
Rita Zhang	c4f88de33e	Move DRAAdminAccess feature to GA (#137373 ) * Move DRAAdminAccess feature to GA Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com> * address comments Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com> --------- Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com>	2026-03-05 23:42:21 +05:30
Kubernetes Prow Robot	8bd1505fc0	Merge pull request #137108 from pohly/logtools-update golangci-lint: bump to logtools v0.10.1	2026-03-05 10:14:16 +05:30
Kubernetes Prow Robot	8275484dcf	Merge pull request #137297 from atombrella/feature/pkg_forvar_modernize Remove redundant variable re-assignment in for-loops under pkg	2026-03-05 00:28:20 +05:30
Patrick Ohly	b895ce734f	golangci-lint: bump to logtools v0.10.1 This fixes a bug that caused log calls involving `klog.Logger` to not be checked. As a result we have to fix some code that is now considered faulty: ERROR: pkg/controller/serviceaccount/tokens_controller.go:382:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck) ERROR: func (e TokensController) generateTokenIfNeeded(ctx context.Context, logger klog.Logger, serviceAccount v1.ServiceAccount, cachedSecret v1.Secret) ( / retry / bool, error) { ERROR: ^ ERROR: pkg/controller/storageversionmigrator/storageversionmigrator.go:299:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck) ERROR: func (svmc SVMController) runMigration(ctx context.Context, logger klog.Logger, gvr schema.GroupVersionResource, resourceMonitor garbagecollector.Monitor, toBeProcessedSVM svmv1beta1.StorageVersionMigration, listResourceVersion string) (err error, failed bool) { ERROR: ^ ERROR: pkg/proxy/node.go:121:3: logging function "Error" should not use format specifier "%q" (logcheck) ERROR: klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to exist", nodeName) ERROR: ^ ERROR: pkg/proxy/node.go:123:3: logging function "Error" should not use format specifier "%q" (logcheck) ERROR: klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to be assigned IPs", nodeName) ERROR: ^ ERROR: pkg/scheduler/backend/queue/scheduling_queue.go:610:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck) ERROR: func (p PriorityQueue) runPreEnqueuePlugin(ctx context.Context, logger klog.Logger, pl fwk.PreEnqueuePlugin, pInfo framework.QueuedPodInfo, shouldRecordMetric bool) fwk.Status { ERROR: ^ ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:286:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck) ERROR: func (pl DynamicResources) deleteClaim(ctx context.Context, claim resourceapi.ResourceClaim, logger klog.Logger) error { ERROR: ^ ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:499:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck) ERROR: func (pl DynamicResources) waitForExtendedClaimInAssumeCache( ERROR: ^ ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:528:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck) ERROR: func (pl DynamicResources) createExtendedResourceClaimInAPI( ERROR: ^ ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:592:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck) ERROR: func (pl DynamicResources) unreserveExtendedResourceClaim(ctx context.Context, logger klog.Logger, pod v1.Pod, state stateData) { ERROR: ^ ERROR: pkg/scheduler/framework/runtime/batch.go:171:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck) ERROR: func (b OpportunisticBatch) batchStateCompatible(ctx context.Context, logger klog.Logger, pod v1.Pod, signature fwk.PodSignature, cycleCount int64, state fwk.CycleState, nodeInfos fwk.NodeInfoLister) bool { ERROR: ^ ERROR: staging/src/k8s.io/component-base/featuregate/feature_gate.go:890:4: Additional arguments to Info should always be Key Value pairs. Please check if there is any key or value missing. (logcheck) ERROR: logger.Info("Warning: SetEmulationVersionAndMinCompatibilityVersion will change already queried feature", "featureGate", feature, "oldValue", oldVal, newVal) ERROR: ^ ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:108:2: logging function "Info" should not use format specifier "%s" (logcheck) ERROR: logger.Info("pluginSocksDir: %s", pluginSocksDir) ERROR: ^ ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:123:2: logging function "Info" should not use format specifier "%s" (logcheck) ERROR: logger.Info("CDI_ENABLED: %s", cdiEnabled) ERROR: ^ While waiting for this to merge, another call was added which also doesn't follow conventions: ERROR: pkg/kubelet/kubelet.go:2454:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck) ERROR: func (kl Kubelet) deletePod(ctx context.Context, logger klog.Logger, pod v1.Pod) error { ERROR: ^ Contextual logging has been beta and enabled by default for several releases now. It's mostly just a matter of wrapping up and declaring it GA. Therefore the calls which directly call WithName or WithValues (always have an effect) are left as-is instead of converting them to use the klog wrappers (support disabling the effect). To allow that, the linter gets reconfigured to not complain about this anymore, anywhere. The calls which would have to be fixed otherwise are: ERROR: pkg/kubelet/cm/dra/claiminfo.go:170:11: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger = logger.WithName("dra-claiminfo") ERROR: ^ ERROR: pkg/kubelet/cm/dra/healthinfo.go:45:11: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger = logger.WithName("dra-healthinfo") ERROR: ^ ERROR: pkg/kubelet/cm/dra/healthinfo.go:89:11: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger = logger.WithName("dra-healthinfo") ERROR: ^ ERROR: pkg/kubelet/cm/dra/healthinfo.go:157:11: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger = logger.WithName("dra-healthinfo") ERROR: ^ ERROR: pkg/kubelet/cm/dra/manager.go:175:12: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger := klog.FromContext(ctx).WithName("dra-manager") ERROR: ^ ERROR: pkg/kubelet/cm/dra/manager.go:239:12: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger := klog.FromContext(ctx).WithName("dra-manager") ERROR: ^ ERROR: pkg/kubelet/cm/dra/manager.go:593:12: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger := klog.FromContext(ctx).WithName("dra-manager") ERROR: ^ ERROR: pkg/kubelet/cm/dra/manager.go:781:12: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger := klog.FromContext(context.Background()).WithName("dra-manager") ERROR: ^ ERROR: pkg/kubelet/cm/dra/manager.go:898:12: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger := klog.FromContext(ctx).WithName("dra-manager") ERROR: ^ ERROR: pkg/kubelet/cm/dra/manager_test.go:1638:15: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger := klog.FromContext(streamCtx).WithName(st.Name()) ERROR: ^ ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:77:12: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger := klog.FromContext(ctx).WithName("dra-plugin") ERROR: ^ ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:108:12: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger := klog.FromContext(ctx).WithName("dra-plugin") ERROR: ^ ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:161:12: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger := klog.FromContext(ctx).WithName("dra-plugin") ERROR: ^ ERROR: staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/tracker.go:695:14: function "WithValues" should be called through klogr.LoggerWithValues (logcheck) ERROR: logger := logger.WithValues("device", deviceID) ERROR: ^ ERROR: test/integration/apiserver/watchcache_test.go:42:54: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: etcd0URL, stopEtcd0, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd0"), "etcd_watchcache0", etcdArgs) ERROR: ^ ERROR: test/integration/apiserver/watchcache_test.go:47:54: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: etcd1URL, stopEtcd1, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd1"), "etcd_watchcache1", etcdArgs) ERROR: ^ ERROR: test/integration/scheduler_perf/scheduler_perf.go:1149:12: function "WithName" should be called through klogr.LoggerWithName (logcheck) ERROR: logger = logger.WithName(tCtx.Name()) ERROR: ^	2026-03-04 12:08:18 +01:00
Patrick Ohly	dd6f4d3a16	DRA scheduler: avoid panic during PreBind It can happen that a claim gets deallocated in parallel to adding a new pod to ReservedFor. Without binding conditions, that was caught by the apiserver validation. With binding conditions, the code which checks and sets AllocationTimestamp panics with a nil pointer access. This has been observed in the TestDRA/all/ShareResourceClaimSequentially integration test, but couldn't be reproduced locally: E0303 07:43:20.158261 39037 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=< goroutine 554266 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x69ce9f0, 0xc017bc00f0}, {0x59381a0, 0x91c6570}) /home/prow/go/src/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:132 +0xbc k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x69ce8d8, 0x93d87a0}, {0x59381a0, 0x91c6570}, {0xc020506f00, 0x0, 0x200?}) /home/prow/go/src/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:107 +0x116 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00685ddc0?}) /home/prow/go/src/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:64 +0x17b panic({0x59381a0?, 0x91c6570?}) /usr/local/go/src/runtime/panic.go:783 +0x132 k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources.(DynamicResources).bindClaim.func2() /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go:1247 +0x730 k8s.io/client-go/util/retry.OnError.func1() /home/prow/go/src/k8s.io/kubernetes/staging/src/k8s.io/client-go/util/retry/util.go:51 +0x30 k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection(0x9ebcca?) /home/prow/go/src/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go:150 +0x3e k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff({0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x2, 0x0}, 0xc020507410) /home/prow/go/src/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/backoff.go:477 +0x5a k8s.io/client-go/util/retry.OnError({0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0}, 0xc02c2d5380?, 0x4?) /home/prow/go/src/k8s.io/kubernetes/staging/src/k8s.io/client-go/util/retry/util.go:50 +0x96 k8s.io/client-go/util/retry.RetryOnConflict(...) /home/prow/go/src/k8s.io/kubernetes/staging/src/k8s.io/client-go/util/retry/util.go:104 k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources.(DynamicResources).bindClaim(0xc0024adb20, {0x69cea28, 0xc0061acbe0}, 0xc011876800, 0x0, 0xc025163408, {0xc021909d80, 0x8}) /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go:1207 +0x845 k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources.(DynamicResources).PreBind-range1(...) /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go:1073 k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources.(DynamicResources).PreBind.(claimStore).all.func2(...) /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources/claims.go:72 k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources.(DynamicResources).PreBind(0xc0024adb20, {0x69cea28, 0xc0061acbe0}, {0x69fc840?, 0xc0286f2540?}, 0xc025163408, {0xc021909d80, 0x8}) /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go:1071 +0x246 k8s.io/kubernetes/pkg/scheduler/framework/runtime.(frameworkImpl).runPreBindPlugin(0xc009628dc8, {0x69cea28, 0xc0061acbe0}, {0x7eabfddddb30, 0xc0024adb20}, {0x69fc840, 0xc0286f2540}, 0xc025163408, {0xc021909d80, 0x8}) /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:1532 +0x2e2 k8s.io/kubernetes/pkg/scheduler/framework/runtime.(frameworkImpl).RunPreBindPlugins.func2({0x7eabfddddb30, 0xc0024adb20}) /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:1461 +0x1cf k8s.io/kubernetes/pkg/scheduler/framework/runtime.(frameworkImpl).RunPreBindPlugins(0xc009628dc8, {0x69cea28, 0xc0061ac690}, {0x69fc840, 0xc0286f2540}, 0xc025163408, {0xc021909d80, 0x8}) /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:1484 +0x623 k8s.io/kubernetes/pkg/scheduler.(Scheduler).bindingCycle(0xc02a6a7500, {0x69cea28, 0xc00f22a690}, {0x69fc840, 0xc0286f2540}, {0x6a32e60, 0xc009628dc8}, {{0xc021909d80, 0x8}, 0x8, ...}, ...) /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:457 +0x72a k8s.io/kubernetes/pkg/scheduler.(*Scheduler).runBindingCycle(0xc02a6a7500, {0x69ce9f0?, 0xc0027e5470?}, {0x69fc840, 0xc0286f2540}, {0x6a32e60, 0xc009628dc8}, {{0xc021909d80, 0x8}, 0x8, ...}, ...) /home/prow/go/src/k8s.io/kubernetes/pkg/scheduler/schedule_one.go:164 +0x1e8	2026-03-03 16:21:38 +01:00
Sunyanan Choochotkaew	e035c41256	DRA: Promote DRAConsumableCapacity to Beta Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>	2026-03-03 18:30:26 +09:00
Mads Jensen	f11bb48738	Remove redundant re-assignment in for-loops under pkg This the forvar rule from modernize. The semantics of the for-loop changed from Go 1.22 to make this pattern obsolete.	2026-03-02 08:47:43 +01:00
Patrick Ohly	2b18086a25	DRA scheduler: update logging The assume cache logs adding assumed claims at V(4) but there wasn't anything about in-flight claims in the log for a scheduling failure where the same device was allocated twice (https://github.com/kubernetes/kubernetes/issues/133602). Debugging that issue depends on seeing all changes related to assume cache changes (not just the single "Assumed object") and in-flight claims. We could make them all V(4) (= "debug level") but they seem more appropriate for V(5) (= "trace level"), so the assume cache verbosity gets toned down to that.	2026-02-20 08:05:23 +01:00
Patrick Ohly	e6d9cbf729	DRA scheduler plugin: increase test coverage Line coverage isn't much better (81.3% -> 81.8%) but it's not clear whether "in flight claims" were considered by any test case.	2026-02-20 07:58:58 +01:00
Stephen Kitt	d42d1e3d1f	Deprecate obsolete slice utility functions ... and update users to use standard library functions. Signed-off-by: Stephen Kitt <skitt@redhat.com>	2026-02-16 10:04:33 +01:00
Sunyanan Choochotkaew	e1a7952a6c	DRA: Rename GetBaseDeviceID to GetDeviceID for SharedDeviceID Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>	2026-02-12 13:47:11 +09:00
Kubernetes Prow Robot	1bb2e12490	Merge pull request #136734 from sunya-ch/sunya-ch/fix-gather-shared-id Fix missing GetSharedDeviceIDs bug in GatherAllocatedState	2026-02-11 01:24:00 +05:30
Sunyanan Choochotkaew	92fc98de6f	DRA: Fix missing GetSharedDeviceIDs bug in GatherAllocatedState with consumable capacity Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>	2026-02-09 10:04:29 +09:00
Mads Jensen	95616cecda	Use slices.Sort instead of sort.Slice. There were only two instances of this in the entire code-base. Hence, I have enabled the modernize rule/linter in golangci-lint.	2026-02-06 22:46:08 +01:00
Patrick Ohly	5c19239290	DRA allocator: promote experimental -> incubating -> stable The previous incubating becomes stable, experimental the new incubating. Now experimental and incubating are identical until we merge more experimental changes again. Specifically, these commands where used: rm -rf stable mv incubating stable mv stable/allocator_incubating.go stable/allocator_stable.go mv stable/pools_incubating.go stable/pools_stable.go sed -i -e 's/package incubating/package stable/' stable/.go cp -a experimental incubating mv incubating/allocator_experimental.go incubating/allocator_incubating.go mv incubating/pools_experimental.go incubating/pools_incubating.go sed -i -e 's/package experimental/package incubating/' incubating/.go Some other packages then need to be adapted, in particular the TestAllocatorSelection test.	2026-01-29 12:52:57 +01:00
Patrick Ohly	581ee0a2ec	DRA scheduler: fix another root cause of double device allocation GatherAllocatedState and ListAllAllocatedDevices need to collect information from different sources (allocated devices, in-flight claims), potentially even multiple times (GatherAllocatedState first gets allocated devices, then the capacities). The underlying assumption that nothing bad happens in parallel is not always true. The following log snippet shows how an update of the assume cache (feeding the allocated devices tracker) and in-flight claims lands such that GatherAllocatedState doesn't see the device in that claim as allocated: dra_manager.go:263: I0115 15:11:04.407714 18778] scheduler: Starting GatherAllocatedState ... allocateddevices.go:189: I0115 15:11:04.407945 18066] scheduler: Observed device allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-094" claim="testdra-all-usesallresources-hvs5d/claim-0553" dynamicresources.go:1150: I0115 15:11:04.407981 89109] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" dra_manager.go:201: I0115 15:11:04.408008 89109] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 version="1211" dynamicresources.go:1157: I0115 15:11:04.408044 89109] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" allocation=< { "devices": { "results": [ { "request": "req-1", "driver": "testdra-all-usesallresources-hvs5d.driver", "pool": "worker-5", "device": "worker-5-device-094" } ] }, "nodeSelector": { "nodeSelectorTerms": [ { "matchFields": [ { "key": "metadata.name", "operator": "In", "values": [ "worker-5" ] } ] } ] }, "allocationTimestamp": "2026-01-15T14:11:04Z" } > dra_manager.go:280: I0115 15:11:04.408085 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-095" claim="testdra-all-usesallresources-hvs5d/claim-0086" dra_manager.go:280: I0115 15:11:04.408137 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-096" claim="testdra-all-usesallresources-hvs5d/claim-0165" default_binder.go:69: I0115 15:11:04.408175 89109] scheduler: Attempting to bind pod to node pod="testdra-all-usesallresources-hvs5d/my-pod-0553" node="worker-5" dra_manager.go:265: I0115 15:11:04.408264 18778] scheduler: Finished GatherAllocatedState allocatedDevices=<map[string]interface {} \| len:2>: { Initial state: "worker-5-device-094" is in-flight, not in cache - goroutine #1: starts GatherAllocatedState, copies cache - goroutine #2: adds to assume cache, removes from in-flight - goroutine #1: checks in-flight => device never seen as allocated This is the second reason for double allocation of the same device in two different claims. The other was timing in the assume cache. Both were tracked down with an integration test (separate commit). It did not fail all the time, but enough that regressions should show up as flakes.	2026-01-26 15:44:48 +01:00
Antoni Zawodny	833b7205fc	Run PreBind plugins in parallel if feasible	2026-01-11 14:19:18 +01:00
Patrick Ohly	dfa6aa22b2	DRA scheduler: fix unit test flakes Test_isSchedulableAfterClaimChange was sensitive to system load because of the arbitrary delay when waiting for the assume cache to catch up. Running inside a synctest bubble avoids this. While at it, the unit tests get converted to ktesting (nicer failure output, no extra indention needed for tCtx.SyncTest). TestPlugin/prebind-fail-with-binding-timeout relied on setting up a claim with certain time stamps and then getting that test case tested within a certain real-world time window. It's surprising that this didn't flake more often because test execution order is random. Now the time stamp gets set right before the test case is about to be tested. Conversion to a synctest would be nicer, but synctests cannot have sub-tests, which are used here to track where log output and failures come from within the larger test case. Inside the plugin itself some log output gets added to explain why a claim is unavailable on a node in case of a binding timeout or error during Filter.	2025-12-30 11:45:02 +01:00
Patrick Ohly	5d536bfb8e	DRA: log more information For debugging double allocation of the same device (https://github.com/kubernetes/kubernetes/issues/133602) it is necessary to have information about pools, devices and in-flight claims. Log calls get extended and the config for DRA CI jobs updated to enable higher verbosity for relevant source files. Log output in such a cluster at verbosity 6 looks like this: I1215 10:28:54.166872 1 allocator_incubating.go:130] "Gathered pool information" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" pools={"count":1,"devices":["dra-8841.k8s.io/kind-worker2/device-00"],"meta":[{"InvalidReason":"","id":"dra-8841.k8s.io/kind-worker2","isIncomplete":false,"isInvalid":false}]} I1215 10:28:54.166941 1 allocator_incubating.go:254] "Gathered information about devices" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" allocatedDevices={"count":2,"devices":["dra-8841.k8s.io/kind-worker/device-00","dra-8841.k8s.io/kind-worker3/device-00"]} minDevicesToBeAllocated=1	2025-12-16 09:58:05 +01:00
bwsalmon	854e67bb51	KEP 5598: Opportunistic Batching (#135231 ) * First version of batching w/out signatures. * First version of pod signatures. * Integrate batching with signatures. * Fix merge conflicts. * Fixes from self-review. * Test fixes. * Fix a bug that limited batches to size 2 Also add some new high-level logging and simplify the pod affinity signature. * Re-enable batching on perf tests for now. * fwk.NewStatus(fwk.Success) * Review feedback. * Review feedback. * Comment fix. * Two plugin specific unit tests.: * Add cycle state to the sign call, apply to topo spread. Also add unit tests for several plugi signature calls. * Review feedback. * Switch to distinct stats for hint and store calls. * Switch signature from string to []byte * Revert cyclestate in signs. Update node affinity. Node affinity now sorts all of the various nested arrays in the structure. CycleState no longer in signature; revert to signing fewer cases for pod spread. * hack/update-vendor.sh * Disable signatures when extenders are configured. * Update pkg/scheduler/framework/runtime/batch.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update staging/src/k8s.io/kube-scheduler/framework/interface.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Review feedback. * Disable node resource signatures when extended DRA enabled. * Review feedback. * Update pkg/scheduler/framework/plugins/imagelocality/image_locality.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update pkg/scheduler/framework/interface.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update pkg/scheduler/framework/plugins/nodedeclaredfeatures/nodedeclaredfeatures.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update pkg/scheduler/framework/runtime/batch.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Review feedback. * Fixes for review suggestions. * Add integration tests. * Linter fixes, test fix. * Whitespace fix. * Remove broken test. * Unschedulable test. * Remove go.mod changes. --------- Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>	2025-11-12 21:51:37 -08:00
Kubernetes Prow Robot	0cfbf89e70	Merge pull request #134189 from mortent/NewUpdatePartitionableDevices Updates to DRA Partitionable Devices feature	2025-11-06 16:10:53 -08:00
Kubernetes Prow Robot	6232175b94	Merge pull request #134935 from alaypatel07/refactor-dra-extended-resources refactor dra extended resources implementation in scheduler plugin	2025-11-06 15:18:59 -08:00
Morten Torkildsen	38b5750e33	DRA: Update allocator for Partitionable Devices	2025-11-06 21:30:01 +00:00
Alay Patel	f8ccc4c4d7	dra scheduler plugin: refactor extendeddynamicresources.go for readibility Signed-off-by: Alay Patel <alayp@nvidia.com>	2025-11-06 15:49:33 -05:00
Kubernetes Prow Robot	22962087ec	Merge pull request #135186 from pohly/dra-scheduler-unit-test-flake DRA: fix for scheduler unit test flake + logging	2025-11-06 12:43:23 -08:00
Alay Patel	da9f1d8eed	dra scheduler plugin: move extended resources functions into separate file Signed-off-by: Alay Patel <alayp@nvidia.com>	2025-11-06 14:58:59 -05:00
Patrick Ohly	1c4cab9dda	DRA scheduler unit test: fix race with ResourceSlice informer The test started without waiting for the ResourceSlice informer to have synced. As a result, the "CEL-runtime-error-for-one-of-three-nodes" test case failed randomly with a very low flake rate (less than 1% in local runs) because CEL expressions never got evaluated due to not having the slices (yet). Other tests also were less reliable, but not known to fail.	2025-11-06 18:40:35 +01:00
Ed Bartosh	edbc32fa60	DRA: implement scoring for extended resources Updated extended resource allocation scorer to calculate allocatable and requested values for DRA-backed resources.	2025-11-06 10:40:52 +02:00
Kubernetes Prow Robot	7537d52c2e	Merge pull request #134882 from yliaog/initcon Fix non-sidecar init container device requests	2025-11-05 21:57:04 -08:00
Kubernetes Prow Robot	f025bcace9	Merge pull request #135068 from pohly/dra-device-taints-1.35-full DRA device taint eviction: several improvements	2025-11-05 18:52:58 -08:00
yliao	6676982316	fixed non-sidecar init container device requests and mappings	2025-11-05 22:48:50 +00:00
Kubernetes Prow Robot	cf37f0bf49	Merge pull request #135037 from yliaog/extendedresourcecache pick one device class deterministically for extended resource	2025-11-05 14:16:58 -08:00
Patrick Ohly	eaee6b6bce	DRA device taints: add separate feature gate for rules Support for DeviceTaintRules depends on a significant amount of additional code: - ResourceSlice tracker is a NOP without it. - Additional informers and corresponding permissions in scheduler and controller. - Controller code for handling status. Not all users necessarily need DeviceTaintRules, so adding a second feature gate for that code makes it possible to limit the blast radius of bugs in that code without having to turn off device taints and tolerations entirely.	2025-11-05 20:03:17 +01:00
Morten Torkildsen	fbfeb33231	DRA: Add scoring for Prioritized List feature	2025-11-05 17:18:38 +00:00
yliao	949be1d132	fixed comments due to switch from class name to class for GetDeviceClass	2025-11-05 15:08:38 +00:00
Ayato Tokubi	902c2e0c15	Fix lint errors in dynamicresources_test.go Signed-off-by: Ayato Tokubi <atokubi@redhat.com>	2025-11-05 10:44:50 +00:00
Ayato Tokubi	c5b1493925	Add test case for claim creation failure in DRAExtendedResources Extend the `setup` function to support API reactors, allowing custom reactions in tests. Signed-off-by: Ayato Tokubi <atokubi@redhat.com>	2025-11-05 09:55:28 +00:00
Ayato Tokubi	ea7561b243	Implement scheduler_resourceclaim_creates_total metrics for DRAExtendedResources	2025-11-05 09:53:33 +00:00
yliao	c67937dd35	switched from storing name to storing a pointer to the device class.	2025-11-04 17:51:12 +00:00
fj-naji	c438f8a983	scheduler: Add BindingTimeout args to DynamicResources plugin Add a new `bindingTimeout` field to DynamicResources plugin args and wire it into PreBind. Changes: - API: add `bindingTimeout` to DynamicResourcesArgs (staging + internal types). - Defaults: default to 600 seconds when BOTH DRADeviceBindingConditions and DRAResourceClaimDeviceStatus are enabled. - Validation: require >= 1s; forbid when either feature gate is disabled. - Plugin: plumbs args into `pl.bindingTimeout` and uses it in `wait.PollUntilContextTimeout` for binding-condition wait logic. - Plugin: remove legacy `BindingTimeoutDefaultSeconds`. Tests: - Add/adjust unit tests for validation and PreBind timeout path. - Ensure <1s and negative values are rejected; forbids when gates disabled.	2025-11-04 17:15:19 +00:00
yliao	b83a6a83f0	pick the device class created latest, or with name alphabetically sorted earlier	2025-11-03 19:28:18 +00:00
yliao	3eab698884	fixed unit test and integration test failures Fix minor nits Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>	2025-11-03 20:07:01 +05:30
Patrick Ohly	12a0c8ce17	DRA extended resource: chain event handlers The cache and scheduler event handlers cannot be registered separately in the informer, that leads to a race (scheduler might schedule based on event before cache is updated). Chaining event handlers (cache first, then scheduler) avoids this. This also ensures that the cache is up-to-date before the scheduler starts (HasSynced of the handler registration for the cache is checked). Other changes: - renamed package to avoid clash with other "cache" packages - clarified nil handling - feature gate check before instantiating the cache - per-test logging - utilruntime.HandleErrorWithLogger - simpler cache.DeletedFinalStateUnknown Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>	2025-11-03 12:31:17 +05:30
Sai Ramesh Vanka	d8c66ffb63	Add a global cache to support DRA's extended resource to the device class mapping - Add a new interface "DeviceClassResolver" in the scheduler framework - Add a global cache of mapping between the extended resource and the device class - Cache can be leveraged by the k8s api-server, controller-manager along with the scheduler - This change helps in delegating the requests to the dynamicresource plugin based on the mapping during the node update events and thus avoiding an extra scheduling cycle Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>	2025-11-03 12:31:16 +05:30
Kubernetes Prow Robot	808d320de1	Merge pull request #134956 from yliaog/blockowner removed BlockOwnerDeletion	2025-10-30 01:26:11 -07:00

1 2 3 4 5

207 commits