kubernetes

mirror of https://github.com/kubernetes/kubernetes.git synced 2026-02-15 00:37:52 -05:00

Author	SHA1	Message	Date
Antoni Zawodny	833b7205fc	Run PreBind plugins in parallel if feasible	2026-01-11 14:19:18 +01:00
Patrick Ohly	dfa6aa22b2	DRA scheduler: fix unit test flakes Test_isSchedulableAfterClaimChange was sensitive to system load because of the arbitrary delay when waiting for the assume cache to catch up. Running inside a synctest bubble avoids this. While at it, the unit tests get converted to ktesting (nicer failure output, no extra indention needed for tCtx.SyncTest). TestPlugin/prebind-fail-with-binding-timeout relied on setting up a claim with certain time stamps and then getting that test case tested within a certain real-world time window. It's surprising that this didn't flake more often because test execution order is random. Now the time stamp gets set right before the test case is about to be tested. Conversion to a synctest would be nicer, but synctests cannot have sub-tests, which are used here to track where log output and failures come from within the larger test case. Inside the plugin itself some log output gets added to explain why a claim is unavailable on a node in case of a binding timeout or error during Filter.	2025-12-30 11:45:02 +01:00
Patrick Ohly	5d536bfb8e	DRA: log more information For debugging double allocation of the same device (https://github.com/kubernetes/kubernetes/issues/133602) it is necessary to have information about pools, devices and in-flight claims. Log calls get extended and the config for DRA CI jobs updated to enable higher verbosity for relevant source files. Log output in such a cluster at verbosity 6 looks like this: I1215 10:28:54.166872 1 allocator_incubating.go:130] "Gathered pool information" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" pools={"count":1,"devices":["dra-8841.k8s.io/kind-worker2/device-00"],"meta":[{"InvalidReason":"","id":"dra-8841.k8s.io/kind-worker2","isIncomplete":false,"isInvalid":false}]} I1215 10:28:54.166941 1 allocator_incubating.go:254] "Gathered information about devices" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" allocatedDevices={"count":2,"devices":["dra-8841.k8s.io/kind-worker/device-00","dra-8841.k8s.io/kind-worker3/device-00"]} minDevicesToBeAllocated=1	2025-12-16 09:58:05 +01:00
bwsalmon	854e67bb51	KEP 5598: Opportunistic Batching (#135231 ) * First version of batching w/out signatures. * First version of pod signatures. * Integrate batching with signatures. * Fix merge conflicts. * Fixes from self-review. * Test fixes. * Fix a bug that limited batches to size 2 Also add some new high-level logging and simplify the pod affinity signature. * Re-enable batching on perf tests for now. * fwk.NewStatus(fwk.Success) * Review feedback. * Review feedback. * Comment fix. * Two plugin specific unit tests.: * Add cycle state to the sign call, apply to topo spread. Also add unit tests for several plugi signature calls. * Review feedback. * Switch to distinct stats for hint and store calls. * Switch signature from string to []byte * Revert cyclestate in signs. Update node affinity. Node affinity now sorts all of the various nested arrays in the structure. CycleState no longer in signature; revert to signing fewer cases for pod spread. * hack/update-vendor.sh * Disable signatures when extenders are configured. * Update pkg/scheduler/framework/runtime/batch.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update staging/src/k8s.io/kube-scheduler/framework/interface.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Review feedback. * Disable node resource signatures when extended DRA enabled. * Review feedback. * Update pkg/scheduler/framework/plugins/imagelocality/image_locality.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update pkg/scheduler/framework/interface.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update pkg/scheduler/framework/plugins/nodedeclaredfeatures/nodedeclaredfeatures.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update pkg/scheduler/framework/runtime/batch.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Review feedback. * Fixes for review suggestions. * Add integration tests. * Linter fixes, test fix. * Whitespace fix. * Remove broken test. * Unschedulable test. * Remove go.mod changes. --------- Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>	2025-11-12 21:51:37 -08:00
Kubernetes Prow Robot	0cfbf89e70	Merge pull request #134189 from mortent/NewUpdatePartitionableDevices Updates to DRA Partitionable Devices feature	2025-11-06 16:10:53 -08:00
Kubernetes Prow Robot	6232175b94	Merge pull request #134935 from alaypatel07/refactor-dra-extended-resources refactor dra extended resources implementation in scheduler plugin	2025-11-06 15:18:59 -08:00
Morten Torkildsen	38b5750e33	DRA: Update allocator for Partitionable Devices	2025-11-06 21:30:01 +00:00
Alay Patel	f8ccc4c4d7	dra scheduler plugin: refactor extendeddynamicresources.go for readibility Signed-off-by: Alay Patel <alayp@nvidia.com>	2025-11-06 15:49:33 -05:00
Kubernetes Prow Robot	22962087ec	Merge pull request #135186 from pohly/dra-scheduler-unit-test-flake DRA: fix for scheduler unit test flake + logging	2025-11-06 12:43:23 -08:00
Alay Patel	da9f1d8eed	dra scheduler plugin: move extended resources functions into separate file Signed-off-by: Alay Patel <alayp@nvidia.com>	2025-11-06 14:58:59 -05:00
Patrick Ohly	1c4cab9dda	DRA scheduler unit test: fix race with ResourceSlice informer The test started without waiting for the ResourceSlice informer to have synced. As a result, the "CEL-runtime-error-for-one-of-three-nodes" test case failed randomly with a very low flake rate (less than 1% in local runs) because CEL expressions never got evaluated due to not having the slices (yet). Other tests also were less reliable, but not known to fail.	2025-11-06 18:40:35 +01:00
Ed Bartosh	edbc32fa60	DRA: implement scoring for extended resources Updated extended resource allocation scorer to calculate allocatable and requested values for DRA-backed resources.	2025-11-06 10:40:52 +02:00
Kubernetes Prow Robot	7537d52c2e	Merge pull request #134882 from yliaog/initcon Fix non-sidecar init container device requests	2025-11-05 21:57:04 -08:00
Kubernetes Prow Robot	f025bcace9	Merge pull request #135068 from pohly/dra-device-taints-1.35-full DRA device taint eviction: several improvements	2025-11-05 18:52:58 -08:00
yliao	6676982316	fixed non-sidecar init container device requests and mappings	2025-11-05 22:48:50 +00:00
Kubernetes Prow Robot	cf37f0bf49	Merge pull request #135037 from yliaog/extendedresourcecache pick one device class deterministically for extended resource	2025-11-05 14:16:58 -08:00
Patrick Ohly	eaee6b6bce	DRA device taints: add separate feature gate for rules Support for DeviceTaintRules depends on a significant amount of additional code: - ResourceSlice tracker is a NOP without it. - Additional informers and corresponding permissions in scheduler and controller. - Controller code for handling status. Not all users necessarily need DeviceTaintRules, so adding a second feature gate for that code makes it possible to limit the blast radius of bugs in that code without having to turn off device taints and tolerations entirely.	2025-11-05 20:03:17 +01:00
Morten Torkildsen	fbfeb33231	DRA: Add scoring for Prioritized List feature	2025-11-05 17:18:38 +00:00
yliao	949be1d132	fixed comments due to switch from class name to class for GetDeviceClass	2025-11-05 15:08:38 +00:00
Ayato Tokubi	902c2e0c15	Fix lint errors in dynamicresources_test.go Signed-off-by: Ayato Tokubi <atokubi@redhat.com>	2025-11-05 10:44:50 +00:00
Ayato Tokubi	c5b1493925	Add test case for claim creation failure in DRAExtendedResources Extend the `setup` function to support API reactors, allowing custom reactions in tests. Signed-off-by: Ayato Tokubi <atokubi@redhat.com>	2025-11-05 09:55:28 +00:00
Ayato Tokubi	ea7561b243	Implement scheduler_resourceclaim_creates_total metrics for DRAExtendedResources	2025-11-05 09:53:33 +00:00
yliao	c67937dd35	switched from storing name to storing a pointer to the device class.	2025-11-04 17:51:12 +00:00
fj-naji	c438f8a983	scheduler: Add BindingTimeout args to DynamicResources plugin Add a new `bindingTimeout` field to DynamicResources plugin args and wire it into PreBind. Changes: - API: add `bindingTimeout` to DynamicResourcesArgs (staging + internal types). - Defaults: default to 600 seconds when BOTH DRADeviceBindingConditions and DRAResourceClaimDeviceStatus are enabled. - Validation: require >= 1s; forbid when either feature gate is disabled. - Plugin: plumbs args into `pl.bindingTimeout` and uses it in `wait.PollUntilContextTimeout` for binding-condition wait logic. - Plugin: remove legacy `BindingTimeoutDefaultSeconds`. Tests: - Add/adjust unit tests for validation and PreBind timeout path. - Ensure <1s and negative values are rejected; forbids when gates disabled.	2025-11-04 17:15:19 +00:00
yliao	b83a6a83f0	pick the device class created latest, or with name alphabetically sorted earlier	2025-11-03 19:28:18 +00:00
yliao	3eab698884	fixed unit test and integration test failures Fix minor nits Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>	2025-11-03 20:07:01 +05:30
Patrick Ohly	12a0c8ce17	DRA extended resource: chain event handlers The cache and scheduler event handlers cannot be registered separately in the informer, that leads to a race (scheduler might schedule based on event before cache is updated). Chaining event handlers (cache first, then scheduler) avoids this. This also ensures that the cache is up-to-date before the scheduler starts (HasSynced of the handler registration for the cache is checked). Other changes: - renamed package to avoid clash with other "cache" packages - clarified nil handling - feature gate check before instantiating the cache - per-test logging - utilruntime.HandleErrorWithLogger - simpler cache.DeletedFinalStateUnknown Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>	2025-11-03 12:31:17 +05:30
Sai Ramesh Vanka	d8c66ffb63	Add a global cache to support DRA's extended resource to the device class mapping - Add a new interface "DeviceClassResolver" in the scheduler framework - Add a global cache of mapping between the extended resource and the device class - Cache can be leveraged by the k8s api-server, controller-manager along with the scheduler - This change helps in delegating the requests to the dynamicresource plugin based on the mapping during the node update events and thus avoiding an extra scheduling cycle Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>	2025-11-03 12:31:16 +05:30
Kubernetes Prow Robot	808d320de1	Merge pull request #134956 from yliaog/blockowner removed BlockOwnerDeletion	2025-10-30 01:26:11 -07:00
yliao	4f647b3f3d	removed BlockOwnerDeletion	2025-10-29 22:41:10 +00:00
Ed Bartosh	1cb45e2a27	DRA: fix scheduling of pods with extended resources Previously, the scheduler assumed an extended resource was maintained by a device plugin if its name was present in the node's Allocatable map, even if its value was zero. This blocked scheduling when a device plugin was disconnected or uninstalled, because Kubelet still reported the resource with Allocatable=0. This change adds a check for the actual allocatable value in addition to a key presence check, allowing nodes with uninstalled device plugins to be considered for scheduling.	2025-10-27 16:24:29 +02:00
Kubernetes Prow Robot	4a1558c545	Merge pull request #133967 from pohly/dra-allocator-selection DRA: allocator selection	2025-09-30 08:24:18 -07:00
Patrick Ohly	60eeaa6ebd	DRA scheduler: add unit test for allocator selection This prevents the mistake from 1.34 where the default-on DRAResourceClaimDeviceStatus feature caused the use of the experimental allocator implementation. The test fails without a fix for that.	2025-09-30 16:53:38 +02:00
Patrick Ohly	7f57730ba4	DRA scheduler: fix selection of "incubating" allocator implementation In 1.34, the default feature gate selection picked the "experimental" allocator implementation when it should have used the "incubating" allocator. No harm came from that because the experimental allocator has all the necessary if checks to disable the extra code and no bugs were introduced when implementing it, but it means that our safety net wasn't there when we expected it to be. The reason is that the "DRAResourceClaimDeviceStatus" feature gate is on by default and was only listed as supported by the experimental implementation. This could be fixed by listing it as supported also by the other implementation, but that would be a bit odd because there is nothing to support for it (the reason why this was missed in 1.34!). Instead, the allocator features are now only indirectly related to feature gates, with a single boolean controlling the implementation of binding conditions.	2025-09-30 16:53:38 +02:00
Patrick Ohly	b5bcac998d	DRA scheduler: clean up feature gate handling Copying from feature.Features to new fields in the plugin got a bit silly with the long list of features that we have now. Embedding feature.Features is simpler. Two fields in feature.Features weren't named according to the feature gate, now they are named consistently and the fields are sorted.	2025-09-30 16:53:38 +02:00
hojinchoi	7028ba09db	fix: duplicated 'the' in comment	2025-09-18 18:11:44 +09:00
yliao	74cf1db218	sort the device requests in the extended resource claim spec. removed the sortClaim in the unit test.	2025-09-11 16:55:58 +00:00
yliao	79f8d1b1c5	fixed bug such that implicit extended resource name can always be used, no matter the explicit extendedResourceName field in device class is set or not.	2025-09-10 14:10:40 +00:00
Ania Borowiec	fadb40199f	Move interfaces: Handle and Plugin and related types from kubernetes/kubernetes to staging repo kube-scheduler	2025-09-02 09:42:53 +00:00
yliao	bf13cd1b81	added resourceClaimModified to bindClaim to decide whether to update assume cache	2025-08-29 16:12:55 +00:00
Abu Kashem	747a295cac	fix flake in dra test 'TestPlugin' TestPlugin/multi-claims-binding-conditions-all-success/PreEnqueue flakes due to the assumed cache not been synced with the initial store. The test waits until the registered handler used by the assumed cache has synced before proceeding with the test	2025-08-18 15:54:03 -04:00
Abu Kashem	c8ab780edb	dra plugin: assume claim after api call in bindClaim	2025-08-13 16:35:35 -04:00
yliao	2a026f6d65	1/ added retries to AssumeClaimAfterAPICall for the object which is not present in the cache (dynamicresources.go) 2/ modified the assume cache verification to not error out as long as the expected claim is in the cache, no matter its latest and api object are different or not. (dynamicresources_test.go). 3/ fixed nil panic as seen from https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/133321/pull-kubernetes-integration/1952472629470302208	2025-08-06 07:08:58 +00:00
yliao	0a12f00e9d	fix nil panic in hasBindingConditions, it cannot assume claim has allocations	2025-07-30 14:44:41 +09:00
Sunyanan Choochotkaew	7f052afaef	KEP 5075: implement scheduler Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>	2025-07-30 09:52:49 +09:00
yliao	34a64db2c7	extended resource backed by DRA: implementation	2025-07-29 18:55:21 +00:00
Kobayashi,Daisuke	e8c3af1f5c	KEP-5007 DRA Device Binding Conditions: Implement scheduler logic	2025-07-29 11:34:30 +00:00
Kubernetes Prow Robot	a11bc701e8	Merge pull request #132457 from ania-borowiec/depends_on_cluster_move_podinfo Moving Scheduler interfaces to staging: Move PodInfo and NodeInfo interfaces (together with related types) to staging repo, leaving internal implementation in kubernetes/kubernetes/pkg/scheduler	2025-07-24 09:38:27 -07:00
Ania Borowiec	aecd37e6fb	Moving Scheduler interfaces to staging: Move PodInfo and NodeInfo interfaces (together with related types) to staging repo, leaving internal implementation in kubernetes/kubernetes/pkg/scheduler	2025-07-24 12:10:58 +00:00
Kubernetes Prow Robot	89a01ec72a	Merge pull request #133019 from pohly/dra-scheduler-plugin-owners DRA scheduler plugin: add pohly as approver	2025-07-24 03:42:33 -07:00

1 2 3 4

186 commits