Commit graph

5860 commits

Author SHA1 Message Date
Kubernetes Prow Robot
8de4a11252
Merge pull request #136156 from pohly/dra-upgrade-downgrade-refactor-2
DRA: upgrade/downgrade refactor, II
2026-01-16 23:31:15 +05:30
Kubernetes Prow Robot
08764697f4
Merge pull request #135381 from kannon92/mutable-pod-replacement-policy
[KEP-5440]: Add integration test for MutablePodResourcesForSuspendedJobs with Pod Replacement Policy = Failed
2026-01-16 19:29:16 +05:30
Patrick Ohly
1847d5b1a2 DRA e2e+integration: test ResourceSlice controller
The "create 100 slices" E2E sometimes flaked with timeouts (e.g. 95 out of 100
slices created). It created too much load for an E2E test.

The same test now uses ktesting as API, which makes it possible to run it as
integration test with the original 100 slices and with more moderate 10 slices
as E2E test.

(cherry picked from commit c47ad64820)
2026-01-16 08:10:37 +01:00
Kubernetes Prow Robot
0ba578f91f
Merge pull request #135393 from tosi3k/parallel-prebind
Run PreBind plugins in parallel
2026-01-15 12:39:34 +05:30
Ed Bartosh
d966d9b89d scheduler_perf: use -benchtime=1x in the test examples
Update scheduler performance test examples to use `-benchtime=1x`
instead of `-benchtime=1ns` for explicitly running each benchmark
exactly once. This makes the intent clearer and aligns the examples
with recommended Go benchmark usage.

Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2026-01-14 11:07:32 +02:00
Kubernetes Prow Robot
d68d48073f
Merge pull request #136112 from danwinship/network-1.36-cleanup
Drop TopologyAwareHints and ServiceTraficDistribution feature gates
2026-01-13 07:43:36 +05:30
Karthik Bhat
8962f08815 Remove deprecated test methods 2026-01-12 16:15:04 +05:30
Antoni Zawodny
833b7205fc Run PreBind plugins in parallel if feasible 2026-01-11 14:19:18 +01:00
Patrick Ohly
e999d595b1 testing: partial revert of E2E + DRA upgrade/downgrade
Refactoring the DRA upgrade/downgrade testing such that it runs as Go test
depended on supporting ktesting in the E2E framework. That change worked during
presubmit testing, but broke some periodic jobs. Therefore the relevant commits
from https://github.com/kubernetes/kubernetes/pull/135664/commits get reverted:

c47ad64820 DRA e2e+integration: test ResourceSlice controller
047682908d ktesting: replace Begin/End with TContext.Step
de47714879 DRA upgrade/downgrade: rewrite as Go unit test
7c7b1e1018 DRA e2e: make driver deployment possible in Go unit tests
65ef31973c DRA upgrade/downgrade: split out individual test steps
47b613eded e2e framework: support creating TContext

The last one is what must have caused the problem, but the other commits depend
on it.
2026-01-11 09:55:17 +01:00
Dan Winship
f278b47ecd Drop TopologyAwareHints and ServiceTraficDistribution feature gates 2026-01-09 12:42:34 -05:00
Kubernetes Prow Robot
407b1de3bf
Merge pull request #136076 from kannon92/fix-flake-mutable-job
[flake] wait for job suspended condition for JobMutable test cases
2026-01-09 16:39:39 +05:30
Kevin Hannon
2a9c44b329 wait for job suspended condition 2026-01-08 15:57:00 -05:00
Kubernetes Prow Robot
e551ea5ea5
Merge pull request #133678 from mortent/AllocatorPerfImprovements
DRA: Avoid unnecessary work in allocator
2026-01-09 01:19:41 +05:30
Morten Torkildsen
9562aa8ba5 DRA: Avoid unnecessary work in allocator 2026-01-08 16:52:44 +00:00
Patrick Ohly
c47ad64820 DRA e2e+integration: test ResourceSlice controller
The "create 100 slices" E2E sometimes flaked with timeouts (e.g. 95 out of 100
slices created). It created too much load for an E2E test.

The same test now uses ktesting as API, which makes it possible to run it as
integration test with the original 100 slices and with more moderate 10 slices
as E2E test.
2026-01-07 14:11:33 +01:00
Patrick Ohly
551cf6f171 ktesting: reimplement without interface
The original implementation was inspired by how context.Context is handled via
wrapping a parent context. That approach had several issues:

- It is useful to let users call methods (e.g. tCtx.ExpectNoError)
  instead of ktesting functions with a tCtx parameters, but that only
  worked if all implementations of the interface implemented that
  set of methods. This made extending those methods cumbersome (see
  the commit which added Require+Assert) and could potentially break
  implementations of the interface elsewhere, defeating part of the
  motivation for having the interface in the first place.

- It was hard to see how the different TContext wrappers cooperated
  with each other.

- Layering injection of "ERROR" and "FATAL ERROR" on top of prefixing
  with the klog header caused post-processing of a failed unit test to
  remove that line because it looked like log output. Other log output
  lines where kept because they were not indented.

- In Go <=1.25, the `go vet sprintf` check only works for functions and
  methods if they get called directly and themselves directly pass their
  parameters on to fmt.Sprint. The check does not work when calling
  methods through an interface. Support for that is coming in Go 1.26,
  but will depend on bumping the Go version also in go.mod and thus
  may not be immediately possible in Kubernetes.

- Interface documentation in
  https://pkg.go.dev/k8s.io/kubernetes@v1.34.2/test/utils/ktesting#TContext
  is a monolithic text block. Documentation for methods is more readable and allows
  referencing those methods with [] (e.g. [TC.Errorf] works, [TContext.Errorf]
  didn't).

The revised implementation is a single struct with (almost) no exported
fields. The two exceptions (embedded context.Context and TB) are useful because
it avoids having to write wrappers for several functions resp. necessary
because Helper cannot be wrapped. Like a logr.LogSink, With* methods can make a
shallow copy and then change some fields in the cloned instance.

The former `ktesting.TContext` interface is now a type alias for
`*ktesting.TC`. This ensures that existing code using ktesting doesn't need to
be updated and because that code is a bit more compact (`tCtx
ktesting.TContext` instead of `tCtx *ktesting.TContext` when not using such an
alias). Hiding that it is a pointer might discourage accessing the exported
fields because it looks like an interface.

Output gets fixed and improved such that:
- "FATAL ERROR" and "ERROR" are at the start of the line, followed by the klog header.
- The failure message follows in the next line.
- Continuation lines are always indented.

The set of methods exposed via TB is now a bit more complete (Attr, Chdir).

All former stand-alone With* functions are now also available as methods and
should be used instead of the functions. Those will be removed.

Linting of log calls now works and found some issues.
2026-01-05 13:45:03 +01:00
Kubernetes Prow Robot
8d1296caf2
Merge pull request #135912 from pohly/scheduler-plugin-test-data-race
scheduler: plugin test DATA RACE fix
2025-12-29 14:46:35 +05:30
Kubernetes Prow Robot
ed4b5ee317
Merge pull request #134350 from macsko/add_scheduling_duration_collector
Add scheduling duration collector to scheduler_perf
2025-12-28 05:50:33 +05:30
Patrick Ohly
f758d0850b scheduler: plugin test DATA RACE fix
Reading numPreFilterCalled races with writing it in the scheduler, at least as
far as the data race detector is concerned. That the test waits for pod
scheduling is too indirect. enqueuePlugin.called has the same problem,
but hasn't triggered the race detector (yet).

We need to protect against concurrent access. The easiest way to enforce that
is via atomic.Int64. In contrast to a mutex it is impossible to use it wrong.

Shutting down the scheduler first was also tried, but didn't work out because
"teardown" does more than just stopping the scheduler, it also cancels a
context that is needed during test shutdown.
2025-12-23 19:13:53 +01:00
Kubernetes Prow Robot
b9d491f56e
Merge pull request #134556 from carlory/fix-133160
lock the feature-gate VolumeAttributesClass to default (true)
2025-12-18 15:13:17 -08:00
Patrick Ohly
ad79e479c2 build: remove deprecated '// +build' tag
This has been replaced by `//build:...` for a long time now.

Removal of the old build tag was automated with:

    for i in $(git grep -l '^// +build' | grep -v -e '^vendor/'); do if ! grep -q '^// Code generated' "$i"; then sed -i -e '/^\/\/ +build/d' "$i"; fi; done
2025-12-18 12:16:21 +01:00
carlory
f8e8e55f1d
locked the feature-gate VolumeAttributesClass to default (true) and switch storage version from v1beta1 to v1
Signed-off-by: carlory <baofa.fan@daocloud.io>
2025-12-18 15:59:33 +08:00
Kubernetes Prow Robot
d9c281159a
Merge pull request #135494 from Argh4k/readme-fix
Fix example with profiling in README
2025-12-17 22:36:21 -08:00
Kubernetes Prow Robot
43cfcac7cc
Merge pull request #135434 from yliaog/quota_abuse
Fixes the loophole that allows users to workaround resource quota set by system admin
2025-12-17 22:35:28 -08:00
Kubernetes Prow Robot
a2a97119bb
Merge pull request #135361 from Karthik-K-N/cel-test-imporvements
CEL test imporvements to use test context across test instead of generic context
2025-12-17 21:41:45 -08:00
Kubernetes Prow Robot
fefd7ddc37
Merge pull request #135348 from brejman/issue-134393-perf
Add perf test for scheduling pods matching existing pods antiaffinity
2025-12-17 21:41:29 -08:00
Kubernetes Prow Robot
285eb9fdba
Merge pull request #135325 from brejman/issue-134393
Fix queue hint for inter-pod anti-affinity
2025-12-17 20:01:02 -08:00
Kubernetes Prow Robot
f9761d1319
Merge pull request #135301 from bwsalmon/bsalmon-batch-after
Fix a bug in scheduler_perf integration test
2025-12-17 20:00:39 -08:00
yliao
3e34de29c4 fixed the loophole that allows user to get around resource quota set by system admin 2025-12-18 00:56:20 +00:00
Bartosz
49035d1404
Add perf test for scheduling pods matching existing pods antiaffinity 2025-12-16 13:02:11 +00:00
Bartosz
d6d8639349
Fix queue hint for interpod antiaffinity 2025-12-16 13:01:15 +00:00
Maciej Skoczeń
bfc44a42d5 Allow to change scheduler_perf threshold data bucket 2025-12-15 14:39:56 +00:00
Antonio Ojea
51f614a156 ipallocator: handle errors correctly
The ipallocator was blindly assuming that all errors are retryable, that
causes that the allocator tries to exhaust all the possibilities to
allocate an IP address.

If the error is not retryable this means the allocator will generate as
many API calls as existing available IPs are in the allocator, causing
CPU exhaustion since this requests are coming from inside the apiserver.

In addition to handle the error correctly, this patch also interpret the
error to return the right status code depending on the error type.

Co-authored-by: carlory <baofa.fan@daocloud.io>
2025-12-03 10:39:57 +00:00
Maciej Skoczeń
e22a30a13e Add scheduling duration collector to scheduler_perf 2025-12-02 14:48:22 +00:00
Maciej Wyrzuc
9a8c2a4001 Fix example with profiling in README 2025-12-01 10:44:15 +00:00
Morten Torkildsen
c33c0464db DRA: Fix flaky integration test 2025-11-25 18:13:00 +00:00
Kevin Hannon
ba7637c194 Add integration test for MutablePodResourcesForSuspendedJobs with PodReplacementPolicy=Failed
This test verifies that when a job with PodReplacementPolicy=Failed is
suspended and pods are terminating, resource updates can be made to the
job but new pods are only created after terminating pods are removed.
The new pods should have the updated resources.
2025-11-20 22:43:53 -05:00
Kubernetes Prow Robot
5bcb759973
Merge pull request #135304 from macsko/fix_failing_sched_perf_tests_on_featuregates
Fix failing scheduler_perf test cases that don't set any feature gate
2025-11-20 10:26:40 -08:00
Kubernetes Prow Robot
0f093c9f49
Merge pull request #134921 from Karthik-K-N/cel-test
Improve CEL Policy Admission test
2025-11-20 10:26:32 -08:00
Karthik Bhat
0ffcac7fc7 Use test context instead of generic context 2025-11-20 12:06:03 +05:30
Karthik Bhat
3e19cc5160 Address review comments 2025-11-19 21:07:28 +05:30
Maciej Skoczeń
04eb121d32 Fix failing scheduler_perf test cases that don't set any feature gate 2025-11-19 10:48:51 +00:00
bwsalmon
a48b189025 Fix a bug in scheduler_perf. 2025-11-14 04:03:03 +00:00
bwsalmon
854e67bb51
KEP 5598: Opportunistic Batching (#135231)
* First version of batching w/out signatures.

* First version of pod signatures.

* Integrate batching with signatures.

* Fix merge conflicts.

* Fixes from self-review.

* Test fixes.

* Fix a bug that limited batches to size 2
Also add some new high-level logging and
simplify the pod affinity signature.

* Re-enable batching on perf tests for now.

* fwk.NewStatus(fwk.Success)

* Review feedback.

* Review feedback.

* Comment fix.

* Two plugin specific unit tests.:

* Add cycle state to the sign call, apply to topo spread.
Also add unit tests for several plugi signature
calls.

* Review feedback.

* Switch to distinct stats for hint and store calls.

* Switch signature from string to []byte

* Revert cyclestate in signs. Update node affinity.
Node affinity now sorts all of the various
nested arrays in the structure. CycleState no
longer in signature; revert to signing fewer
cases for pod spread.

* hack/update-vendor.sh

* Disable signatures when extenders are configured.

* Update pkg/scheduler/framework/runtime/batch.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Update staging/src/k8s.io/kube-scheduler/framework/interface.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Review feedback.

* Disable node resource signatures when extended DRA enabled.

* Review feedback.

* Update pkg/scheduler/framework/plugins/imagelocality/image_locality.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Update pkg/scheduler/framework/interface.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Update pkg/scheduler/framework/plugins/nodedeclaredfeatures/nodedeclaredfeatures.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Update pkg/scheduler/framework/runtime/batch.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Review feedback.

* Fixes for review suggestions.

* Add integration tests.

* Linter fixes, test fix.

* Whitespace fix.

* Remove broken test.

* Unschedulable test.

* Remove go.mod changes.

---------

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
2025-11-12 21:51:37 -08:00
Heba
aceb89debc
KEP-5471: Extend tolerations operators (#134665)
* Add numeric operations to tolerations

Signed-off-by: Heba Elayoty <heelayot@microsoft.com>

* code review feedback

Signed-off-by: Heba Elayoty <heelayot@microsoft.com>

* add default feature gate

Signed-off-by: Heba Elayoty <heelayot@microsoft.com>

* Add integration tests

Signed-off-by: Heba Elayoty <heelayot@microsoft.com>

* Add toleration value validation

Signed-off-by: Heba Elayoty <heelayot@microsoft.com>

* Add validate options for new operators

Signed-off-by: helayoty <heelayot@microsoft.com>

* Remove log

Signed-off-by: helayoty <heelayot@microsoft.com>

* Update feature gate check

Signed-off-by: helayoty <heelayot@microsoft.com>

* emove IsValidNumericString func

Signed-off-by: helayoty <heelayot@microsoft.com>

* Implement IsDecimalInteger

Signed-off-by: helayoty <heelayot@microsoft.com>

* code review feedback

Signed-off-by: helayoty <heelayot@microsoft.com>

* Add logs to v1/toleration

Signed-off-by: Heba Elayoty <heelayot@microsoft.com>
Signed-off-by: helayoty <heelayot@microsoft.com>

* Update integration tests and address code review feedback

Signed-off-by: helayoty <heelayot@microsoft.com>

* Add feature gate to the scheduler framework

Signed-off-by: helayoty <heelayot@microsoft.com>

* Remove extra test

Signed-off-by: helayoty <heelayot@microsoft.com>

* Fix integration test

Signed-off-by: helayoty <heelayot@microsoft.com>

* pass feature gate via TolerationsTolerateTaint

Signed-off-by: helayoty <heelayot@microsoft.com>

---------

Signed-off-by: Heba Elayoty <heelayot@microsoft.com>
Signed-off-by: helayoty <heelayot@microsoft.com>
2025-11-10 12:42:54 -08:00
Kubernetes Prow Robot
183892b2c9
Merge pull request #134870 from pmengelbert/pmengelbert/kuberc/4
Add client-go credential plugin to kuberc
2025-11-09 17:26:53 -08:00
Peter Engelbert
fab280950d
Add client-go credential plugin to kuberc
Remove reference to internal types in kuberc types

* Remove unserialized types from public APIs

Also remove defaulting

* Don't do conversion gen for plugin policy types

Because the plugin policy types are explicitly allowed to be empty, they
should not affect conversion. The autogenerated conversion functions for
the `Preference` type will leave those fields empty.

* Remove defaulting tests

Comments and simplifications (h/t jordan liggitt)

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
2025-11-09 14:24:53 -05:00
Kubernetes Prow Robot
c3aee79946
Merge pull request #134942 from ttsuuubasa/dra-bc-integration-tests
scheduler: KEP-5007 add integration tests on DeviceBindingConditions
2025-11-07 07:40:53 -08:00
Kubernetes Prow Robot
55ac11aad0
Merge pull request #135149 from ania-borowiec/nnn_test
KEP-5278 Add integration tests for setting and clearing NominatedNodeName
2025-11-07 06:06:53 -08:00
Tsubasa Watanabe
1225ce509e scheduler: KEP-5007 add integration tests
- Rescheduling on binding failure using taints or device removal
- Triggering binding timeout when BindingConditions remain unmet
- Recovery to a device without BindingConditions after binding timeout

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
2025-11-07 21:15:13 +09:00