kubernetes

mirror of https://github.com/kubernetes/kubernetes.git synced 2026-02-14 16:25:09 -05:00

Author	SHA1	Message	Date
Antoni Zawodny	833b7205fc	Run PreBind plugins in parallel if feasible	2026-01-11 14:19:18 +01:00
Antoni Zawodny	16b375e4ef	Generalize ErrorChannel to other underlying types	2026-01-11 13:58:06 +01:00
Kubernetes Prow Robot	b54554b72d	Merge pull request #135955 from utam0k/async-metrics scheduler: align the meaning of victim metrics between async preemption and sync preemption	2026-01-08 20:39:41 +05:30
utam0k	44e0c79406	Align the meaning of victim metrics between async preemption and sync preemption Signed-off-by: utam0k <k0ma@utam0k.jp>	2026-01-08 21:02:17 +09:00
Kubernetes Prow Robot	8ab1bc1633	Merge pull request #135725 from bart0sh/PR211-add-extended-resources-test-cases Fix extended resource handling for DRA-backed resources on pod admission	2026-01-08 04:03:42 +05:30
Kubernetes Prow Robot	4e69edd0ee	Merge pull request #135392 from brejman/issue-134393-nominated-nodes Fix queue hint for plugins on change to pods with nominated nodes	2026-01-07 20:05:38 +05:30
Kubernetes Prow Robot	b2ac9e206f	Merge pull request #130231 from Barakmor1/updateimagelocality Update ImageLocality plugin to account ImageVolume images	2026-01-05 12:28:37 +05:30
Ed Bartosh	c2361491f5	Fix extended resource handling for DRA-backed resources In kubelet admission: - Remove extended resources from pod requirements if they are either backed by DRA or not present in node's allocatable resources In scheduler (fit.go): - Remove fallback logic that delegated all resources to DRA when draManager is nil These changes ensure that: - DRA-backed extended resources are properly handled during pod admission - DevicePlugin-backed extended resources still follow standard admission rules	2026-01-02 16:08:49 +02:00
Patrick Ohly	dfa6aa22b2	DRA scheduler: fix unit test flakes Test_isSchedulableAfterClaimChange was sensitive to system load because of the arbitrary delay when waiting for the assume cache to catch up. Running inside a synctest bubble avoids this. While at it, the unit tests get converted to ktesting (nicer failure output, no extra indention needed for tCtx.SyncTest). TestPlugin/prebind-fail-with-binding-timeout relied on setting up a claim with certain time stamps and then getting that test case tested within a certain real-world time window. It's surprising that this didn't flake more often because test execution order is random. Now the time stamp gets set right before the test case is about to be tested. Conversion to a synctest would be nicer, but synctests cannot have sub-tests, which are used here to track where log output and failures come from within the larger test case. Inside the plugin itself some log output gets added to explain why a claim is unavailable on a node in case of a binding timeout or error during Filter.	2025-12-30 11:45:02 +01:00
Kubernetes Prow Robot	3226fe520d	Merge pull request #135948 from pohly/dra-scheduler-resource-plugin-unit-test-fix DRA extended resources: fix flake in unit tests	2025-12-30 16:12:35 +05:30
Kubernetes Prow Robot	2a3a6605ac	Merge pull request #135330 from sujalshah-bit/fix-mem-leak scheduler: Fix memory leak in scheduler cache	2025-12-29 15:56:34 +05:30
Patrick Ohly	7a4d650125	DRA extended resources: fix flake in unit tests The tests assumed that instantiating a DRAManager followed by informerFactory.WaitForCacheSync would be enough to have the manager up-to-date, but that's not correct: the test only waits for informer caches to be synced, but syncing event handlers like the one in the manager may still be going on. The flake rate is low, though: $ GOPATH/bin/stress -p 256 ./noderesources.test 5s: 0 runs so far, 0 failures, 256 active 10s: 256 runs so far, 0 failures, 256 active 15s: 256 runs so far, 0 failures, 256 active 20s: 512 runs so far, 0 failures, 256 active 25s: 567 runs so far, 0 failures, 256 active 30s: 771 runs so far, 0 failures, 256 active /tmp/go-stress-20251226T181044-974980161 --- FAIL: TestCalculateResourceAllocatableRequest (0.81s) --- FAIL: TestCalculateResourceAllocatableRequest/DRA-backed-resource-with-shared-device-allocation (0.00s) extendedresourcecache.go:197: I1226 18:11:14.431337] Updated extended resource cache for explicit mapping extendedResource="extended.resource.dra.io/something" deviceClass="device-class-name" extendedresourcecache.go:204: I1226 18:11:14.431380] Updated extended resource cache for default mapping extendedResource="deviceclass.resource.kubernetes.io/device-class-name" deviceClass="device-class-name" extendedresourcecache.go:220: I1226 18:11:14.431394] Updated device class mapping deviceClass="device-class-name" extendedResource="extended.resource.dra.io/something" resource_allocation_test.go:595: Expected requested=2, but got requested=1 FAIL It becomes higher when changing WaitForCacheSync such that it doesn't poll and therefore returns more promptly, which is where this flake was first observed. The fix is to run the test in a syntest bubble where Wait can be used to wait for all background activity, including event handling, to be finished before proceeding with the test. synctest is less forgiving about lingering goroutines. A synctest bubble must wait for gouroutines to stop, which in this case means that there has to be a way to wait for the metric recorder shutdown. Event handlers have to be removed. This could be done with plain Go, but here test/utils/ktesting is used instead because it offers some advantages: - less boilerplate code - automatic cancellation of the context (i.e. less manual context.WithCancel) - tCtx.SyncTest is a direct substitute for t.Run, which avoids re-indenting sub-tests. synctest itself needs another anonymous function, which makes the line too long and forced re-indention: t.Run(... func(...) { synctest.Test(... func() { }) }) For the sake of consistency all tests get updated. While at it, some code gets improved: - t.Fatal(err) is not a good way to report an error because there is no additional markup in the test output that indicates that there was an unexpected error. It just logs err.Error(), which might not be very informative and/or obvious. - newTestDRAManager aborts in case of a failure instead of returning an error.	2025-12-27 09:47:56 +01:00
Bartosz	3b4f0be6e3	Check NominatedNodeName to decide if a pod is scheduled	2025-12-19 12:30:06 +00:00
Patrick Ohly	ad79e479c2	build: remove deprecated '// +build' tag This has been replaced by `//build:...` for a long time now. Removal of the old build tag was automated with: for i in $(git grep -l '^// +build' \| grep -v -e '^vendor/'); do if ! grep -q '^// Code generated' "$i"; then sed -i -e '/^\/\/ +build/d' "$i"; fi; done	2025-12-18 12:16:21 +01:00
Kubernetes Prow Robot	a504b1b4eb	Merge pull request #135755 from pohly/dra-logging DRA: log more information	2025-12-18 02:10:38 -08:00
bmordeha	6f57f1e95b	Update imageLocality plugin to account for ImageVolume images when scoring and prioritizing nodes with required pod images Signed-off-by: bmordeha <bmordeha@redhat.com>	2025-12-18 09:28:39 +02:00
Kubernetes Prow Robot	4a1cbabadd	Merge pull request #135495 from tosi3k/skip-last-pod-deletion Skip last victim in async preemption if any prior Pod preemption failed	2025-12-17 22:36:28 -08:00
Kubernetes Prow Robot	62db4db266	Merge pull request #135489 from ania-borowiec/update_comment Update async preemption comment to reflect the current state of the code	2025-12-17 22:36:13 -08:00
Kubernetes Prow Robot	c5a0c31294	Merge pull request #135484 from bart0sh/PR209-improve-balanced-allocation-coverage Extended resources unit tests: cover DRA resources	2025-12-17 22:36:06 -08:00
Kubernetes Prow Robot	1a3d8712f3	Merge pull request #135394 from brejman/adhoc-interpodaffinity-pending-pod-update Fix queue hint for interpodaffinity when target pod is updated	2025-12-17 21:42:46 -08:00
Kubernetes Prow Robot	285eb9fdba	Merge pull request #135325 from brejman/issue-134393 Fix queue hint for inter-pod anti-affinity	2025-12-17 20:01:02 -08:00
Bartosz	d6d8639349	Fix queue hint for interpod antiaffinity	2025-12-16 13:01:15 +00:00
Bartosz	145adcd522	Fix queue hint for interpodaffinity when target pod is updated	2025-12-16 12:57:50 +00:00
Patrick Ohly	5d536bfb8e	DRA: log more information For debugging double allocation of the same device (https://github.com/kubernetes/kubernetes/issues/133602) it is necessary to have information about pools, devices and in-flight claims. Log calls get extended and the config for DRA CI jobs updated to enable higher verbosity for relevant source files. Log output in such a cluster at verbosity 6 looks like this: I1215 10:28:54.166872 1 allocator_incubating.go:130] "Gathered pool information" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" pools={"count":1,"devices":["dra-8841.k8s.io/kind-worker2/device-00"],"meta":[{"InvalidReason":"","id":"dra-8841.k8s.io/kind-worker2","isIncomplete":false,"isInvalid":false}]} I1215 10:28:54.166941 1 allocator_incubating.go:254] "Gathered information about devices" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" allocatedDevices={"count":2,"devices":["dra-8841.k8s.io/kind-worker/device-00","dra-8841.k8s.io/kind-worker3/device-00"]} minDevicesToBeAllocated=1	2025-12-16 09:58:05 +01:00
Ed Bartosh	1820dc7535	Fit tests: add DRA-aware test cases	2025-12-12 15:48:18 +02:00
Ed Bartosh	7860effc2c	resourceAllocationScorer: add unit test for DRA nodeMatches	2025-12-12 15:48:13 +02:00
Ed Bartosh	02a39d6c1e	Balanced allocation tests: cover DRA resources - Added DRA-aware test cases - Pulled shared DRA setup out into helper to keep tests DRY - Added SignPod test	2025-12-12 13:51:19 +02:00
Antoni Zawodny	7577f84e79	Skip last victim in async preemption if any prior Pod preemption failed	2025-12-10 14:44:06 +01:00
Ania Borowiec	0cf3d0e20a	Update comment to reflect the current state of the code	2025-11-27 22:10:02 +00:00
Mohammad Varmazyar	4c2fff1934	Address comments, log level, test assersion consistency and remove unnecessary locks in TestFlushUnschedulablePodsLeftoverSetsFlag	2025-11-26 14:08:05 +01:00
Mohammad Varmazyar	4f455c9c0d	Refactor plugin clearing to use ClearRejectorPlugins method	2025-11-26 09:54:32 +01:00
Mohammad Varmazyar	bc632c72d0	scheduler: add metric for pods scheduled after flush Add counter metric to track pods that schedule immediately after being flushed from unschedulablePods due to timeout. Uses a boolean flag that is cleared when pods return to queue or move via events.	2025-11-24 09:38:41 +01:00
Mohammad Varmazyar	b2a399cf30	scheduler: add metric for pods scheduled after flush This metric tracks pods that successfully schedule after being flushed from unschedulablePods due to timeout. High values may indicate missing queue hint optimizations or event handling issues.	2025-11-24 09:38:40 +01:00
Ravi Sastry Kadali	9dc5683c56	scheduler: Fix memory leak in scheduler cache The `removeSlice` function was leaving behind references to the removed element, preventing it from being garbage-collected. This commit ensures that removed entries are fully cleared, eliminating the memory leak. Co-authored-by: ravisastryk <ravisastryk@gmail.com> Signed-off-by: Sujal Shah <sujalshah28092004@gmail.com>	2025-11-20 02:18:38 +05:30
bwsalmon	854e67bb51	KEP 5598: Opportunistic Batching (#135231 ) * First version of batching w/out signatures. * First version of pod signatures. * Integrate batching with signatures. * Fix merge conflicts. * Fixes from self-review. * Test fixes. * Fix a bug that limited batches to size 2 Also add some new high-level logging and simplify the pod affinity signature. * Re-enable batching on perf tests for now. * fwk.NewStatus(fwk.Success) * Review feedback. * Review feedback. * Comment fix. * Two plugin specific unit tests.: * Add cycle state to the sign call, apply to topo spread. Also add unit tests for several plugi signature calls. * Review feedback. * Switch to distinct stats for hint and store calls. * Switch signature from string to []byte * Revert cyclestate in signs. Update node affinity. Node affinity now sorts all of the various nested arrays in the structure. CycleState no longer in signature; revert to signing fewer cases for pod spread. * hack/update-vendor.sh * Disable signatures when extenders are configured. * Update pkg/scheduler/framework/runtime/batch.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update staging/src/k8s.io/kube-scheduler/framework/interface.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Review feedback. * Disable node resource signatures when extended DRA enabled. * Review feedback. * Update pkg/scheduler/framework/plugins/imagelocality/image_locality.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update pkg/scheduler/framework/interface.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update pkg/scheduler/framework/plugins/nodedeclaredfeatures/nodedeclaredfeatures.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Update pkg/scheduler/framework/runtime/batch.go Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com> * Review feedback. * Fixes for review suggestions. * Add integration tests. * Linter fixes, test fix. * Whitespace fix. * Remove broken test. * Unschedulable test. * Remove go.mod changes. --------- Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>	2025-11-12 21:51:37 -08:00
ndixita	5ac2ffcc1e	Enabling NodeDeclaredFeatures in unit tests Signed-off-by: ndixita <ndixita@google.com>	2025-11-12 08:26:15 +00:00
ndixita	7645eb70e9	Scheduler changes to support pod level resources in place resize	2025-11-11 18:15:22 +00:00
Heba	aceb89debc	KEP-5471: Extend tolerations operators (#134665 ) * Add numeric operations to tolerations Signed-off-by: Heba Elayoty <heelayot@microsoft.com> * code review feedback Signed-off-by: Heba Elayoty <heelayot@microsoft.com> * add default feature gate Signed-off-by: Heba Elayoty <heelayot@microsoft.com> * Add integration tests Signed-off-by: Heba Elayoty <heelayot@microsoft.com> * Add toleration value validation Signed-off-by: Heba Elayoty <heelayot@microsoft.com> * Add validate options for new operators Signed-off-by: helayoty <heelayot@microsoft.com> * Remove log Signed-off-by: helayoty <heelayot@microsoft.com> * Update feature gate check Signed-off-by: helayoty <heelayot@microsoft.com> * emove IsValidNumericString func Signed-off-by: helayoty <heelayot@microsoft.com> * Implement IsDecimalInteger Signed-off-by: helayoty <heelayot@microsoft.com> * code review feedback Signed-off-by: helayoty <heelayot@microsoft.com> * Add logs to v1/toleration Signed-off-by: Heba Elayoty <heelayot@microsoft.com> Signed-off-by: helayoty <heelayot@microsoft.com> * Update integration tests and address code review feedback Signed-off-by: helayoty <heelayot@microsoft.com> * Add feature gate to the scheduler framework Signed-off-by: helayoty <heelayot@microsoft.com> * Remove extra test Signed-off-by: helayoty <heelayot@microsoft.com> * Fix integration test Signed-off-by: helayoty <heelayot@microsoft.com> * pass feature gate via TolerationsTolerateTaint Signed-off-by: helayoty <heelayot@microsoft.com> --------- Signed-off-by: Heba Elayoty <heelayot@microsoft.com> Signed-off-by: helayoty <heelayot@microsoft.com>	2025-11-10 12:42:54 -08:00
Kubernetes Prow Robot	0cfbf89e70	Merge pull request #134189 from mortent/NewUpdatePartitionableDevices Updates to DRA Partitionable Devices feature	2025-11-06 16:10:53 -08:00
Kubernetes Prow Robot	6232175b94	Merge pull request #134935 from alaypatel07/refactor-dra-extended-resources refactor dra extended resources implementation in scheduler plugin	2025-11-06 15:18:59 -08:00
Morten Torkildsen	38b5750e33	DRA: Update allocator for Partitionable Devices	2025-11-06 21:30:01 +00:00
Alay Patel	f8ccc4c4d7	dra scheduler plugin: refactor extendeddynamicresources.go for readibility Signed-off-by: Alay Patel <alayp@nvidia.com>	2025-11-06 15:49:33 -05:00
Kubernetes Prow Robot	22962087ec	Merge pull request #135186 from pohly/dra-scheduler-unit-test-flake DRA: fix for scheduler unit test flake + logging	2025-11-06 12:43:23 -08:00
Alay Patel	da9f1d8eed	dra scheduler plugin: move extended resources functions into separate file Signed-off-by: Alay Patel <alayp@nvidia.com>	2025-11-06 14:58:59 -05:00
Kubernetes Prow Robot	14134e03a8	Merge pull request #134058 from bart0sh/PR200-DRA-scoring-extended-resources Implement scoring for extended resources backed up by DRA	2025-11-06 11:50:52 -08:00
Patrick Ohly	1c4cab9dda	DRA scheduler unit test: fix race with ResourceSlice informer The test started without waiting for the ResourceSlice informer to have synced. As a result, the "CEL-runtime-error-for-one-of-three-nodes" test case failed randomly with a very low flake rate (less than 1% in local runs) because CEL expressions never got evaluated due to not having the slices (yet). Other tests also were less reliable, but not known to fail.	2025-11-06 18:40:35 +01:00
Ed Bartosh	fc404b6a3d	Cache DRA state for scoring extended resources Extend Fit and BalancedAllocation PreScore state with the the allocated state, the list of ResourceSlices and the device class mapping. Gather these once during PreScore and pass them through the scoring path instead of re-fetching for every scoring call. This should speed up scoring of DRA extended resources, lowering scheduling overhead. Co-authored-by: Patrick Ohly <patrick.ohly@intel.com> Co-authored-by: Maciej Skoczeń <mskoczen@google.com> Co-authored-by: Dominik Marciński <gmidon@gmail.com>	2025-11-06 18:09:11 +02:00
Maciej Skoczeń	8d67173de0	Implement Gang scheduling in kube-scheduler	2025-11-06 10:47:29 +00:00
Kubernetes Prow Robot	b869afe68d	Merge pull request #133389 from pravk03/node-capabilities Introduce node declared features framework	2025-11-06 01:32:54 -08:00
Ed Bartosh	edbc32fa60	DRA: implement scoring for extended resources Updated extended resource allocation scorer to calculate allocatable and requested values for DRA-backed resources.	2025-11-06 10:40:52 +02:00

1 2 3 4 5 ...

1990 commits