Commit graph

69 commits

Author SHA1 Message Date
Anson Qian
a816a7b1d8
Make ConcurrentResourceClaimSyncs configurable (#134701)
* DRA resource claim controller: configurable number of workers

It might never be necessary to change the default, but it is hard to be sure.
It's better to have the option, just in case.

* generate files

* resourceclaimcontroller: normalize validation error message

* Update cmd/kube-controller-manager/app/options/resourceclaimcontroller.go

Co-authored-by: Jordan Liggitt <jordan@liggitt.net>

---------

Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
2026-01-08 19:31:39 +05:30
Goend
13e46ffc45 Fix the issue of slow creation of ResourceClaim in specific scenarios 2026-01-06 19:18:58 +08:00
Ayato Tokubi
320987ead3 Addressed comments 2025-11-05 10:44:50 +00:00
Ayato Tokubi
5102591a6b Refactor resource claim metrics to use structured labels and add "source" dimension.
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
2025-11-05 09:52:47 +00:00
Kubernetes Prow Robot
41673c7198
Merge pull request #134910 from tchap/kcm-controllers-thread-mgmt
pkg/controller: Improve goroutine management
2025-11-03 17:58:03 -08:00
yliao
4f647b3f3d removed BlockOwnerDeletion 2025-10-29 22:41:10 +00:00
Ondra Kupka
63c15cbe83 controller/resourceclaim: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:07:10 +01:00
Mayank Agrawal
5e216ae34d Replace HandleCrash and HandleError calls to use context-aware alternatives 2025-10-07 22:40:10 -07:00
Alay Patel
8a03067211 fix resource claims deallocation for extended resource when pod is completed
Signed-off-by: Alay Patel <alayp@nvidia.com>
2025-09-29 15:15:40 -04:00
Kubernetes Prow Robot
d7bd2b0343
Merge pull request #134030 from richabanker/update-metrics-docs
Update metrics docs list for v1.34
2025-09-18 08:04:15 -07:00
Aditi Gupta
af231d2153 Replace WaitForNamedCacheSync with WaitForNamedCacheSyncWithContext in pkg/controller/ 2025-09-16 14:51:34 -07:00
Richa Banker
c51a8734b1 Update documented metrics list 2025-09-16 11:52:14 -07:00
Patrick Ohly
5c4f81743c DRA: use v1 API
As before when adding v1beta2, DRA drivers built using the
k8s.io/dynamic-resource-allocation helper packages remain compatible with all
Kubernetes release >= 1.32. The helper code picks whatever API version is
enabled from v1beta1/v1beta2/v1.

However, the control plane now depends on v1, so a cluster configuration where
only v1beta1 or v1beta2 are enabled without the v1 won't work.
2025-07-24 08:33:45 +02:00
Rita Zhang
d42a1d58d0
DRAAdminAccess: add metrics
Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com>
2025-07-18 07:15:41 -07:00
Jon Huhn
f1845218e2 fixup! DRA: fix deleting orphaned ResourceClaim on startup 2025-06-26 23:21:18 -05:00
Jon Huhn
ef117edf35 DRA: fix deleting orphaned ResourceClaim on startup 2025-06-25 11:11:43 -05:00
Rita Zhang
0301e5a9f8
DRA: AdminAccess validate based on namespace label
Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com>
2025-03-18 22:56:54 -07:00
Morten Torkildsen
36d8a44b9c DRA: Update controller for Prioritized Alternatives in Device Requests 2025-02-28 19:32:59 +00:00
Patrick Ohly
4638ba9716 client-go/tools/cache: add APIs with context parameter
The context is used for cancellation and to support contextual logging.

In most cases, alternative *WithContext APIs get added, except for
NewIntegerResourceVersionMutationCache where code searches indicate that the
API is not used downstream.

An API break around SharedInformer couldn't be avoided because the
alternative (keeping the interface unchanged and adding a second one with
the new method) would have been worse. controller-runtime needs to be updated
because it implements that interface in a test package. Downstream consumers of
controller-runtime will work unless they use those test package.

Converting Kubernetes to use the other new alternatives will follow. In the
meantime, usage of the new alternatives cannot be enforced via logcheck
yet (see https://github.com/kubernetes/kubernetes/issues/126379 for the
process).

Passing context through and checking it for cancellation is tricky for event
handlers. A better approach is to map the context cancellation to the normal
removal of an event handler via a helper goroutine. Thanks to the new
HandleErrorWithLogr and HandleCrashWithLogr, remembering the logger is
sufficient for handling problems at runtime.
2024-12-18 18:45:02 +01:00
Patrick Ohly
33ea278c51 DRA: use v1beta1 API
No code is left which depends on the v1alpha3, except of course the code
implementing that version.
2024-11-06 13:03:19 +01:00
Davanum Srinivas
2b0592ee77
Use k8s.io/utils/lru instead of github.com/golang/groupcache/lru
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-11-04 10:51:13 -05:00
Kubernetes Prow Robot
daef8c2419
Merge pull request #127266 from pohly/dra-admin-access-in-status
DRA API: AdminAccess in DeviceRequestAllocationResult + DRAAdminAccess feature gate
2024-10-30 03:41:25 +00:00
Kubernetes Prow Robot
c5ccf59974
Merge pull request #128379 from pohly/dra-owners-wg-label
DRA: add wg/device-management label automatically
2024-10-29 15:24:57 +00:00
Patrick Ohly
4419568259 DRA: treat AdminAccess as a new feature gated field
Using the "normal" logic for a feature gated field simplifies the
implementation of the feature gate.

There is one (entirely theoretic!) problem with updating from 1.31: if a claim
was allocated in 1.31 with admin access, the status field was not set because
it didn't exist yet. If a driver now follows the current definition of "unset =
off", then it will not grant admin access even though it should. This is
theoretic because drivers are starting to support admin access with 1.32, so
there shouldn't be any claim where this problem could occur.
2024-10-29 10:22:31 +01:00
Patrick Ohly
9a7e4ccab2 DRA admin access: add feature gate
The new DRAAdminAccess feature gate has the following effects:
- If disabled in the apiserver, the spec.devices.requests[*].adminAccess
  field gets cleared. Same in the status. In both cases the scenario
  that it was already set and a claim or claim template get updated
  is special: in those cases, the field is not cleared.

  Also, allocating a claim with admin access is allowed regardless of the
  feature gate and the field is not cleared. In practice, the scheduler
  will not do that.
- If disabled in the resource claim controller, creating ResourceClaims
  with the field set gets rejected. This prevents running workloads
  which depend on admin access.
- If disabled in the scheduler, claims with admin access don't get
  allocated. The effect is the same.

The alternative would have been to ignore the fields in claim controller and
scheduler. This is bad because a monitoring workload then runs, blocking
resources that probably were meant for production workloads.
2024-10-29 09:50:11 +01:00
Patrick Ohly
9d1b0654e0 DRA: add wg/device-management label automatically
This makes PRs show up automatically in the WG's project
board (https://github.com/orgs/kubernetes/projects/95/views/1).
2024-10-28 16:36:04 +01:00
Patrick Ohly
c2524cbf9b DRA resourceclaims: maintain metric of total and allocated claims
These metrics can provide insights into ResourceClaim usage. The total count is
redundant because the apiserver also provides count of resources, but having it
in the same sub-system next to the count of allocated claims might be more
discoverable and helps monitor the controller itself.
2024-10-18 09:13:42 +02:00
Kubernetes Prow Robot
b1b4e5d397
Merge pull request #128003 from pohly/dra-classic-dra-removal
DRA: remove "classic DRA"
2024-10-18 00:55:17 +01:00
Patrick Ohly
d572df2493 DRA resource claim controller: improve log messages
Some code paths didn't log anything. One log message about "claim got deleted"
was incorrect.
2024-10-17 18:28:55 +02:00
Patrick Ohly
f84eb5ecf8 DRA: remove "classic DRA"
This removes the DRAControlPlaneController feature gate, the fields controlled
by it (claim.spec.controller, claim.status.deallocationRequested,
claim.status.allocation.controller, class.spec.suitableNodes), the
PodSchedulingContext type, and all code related to the feature.

The feature gets removed because there is no path towards beta and GA and DRA
with "structured parameters" should be able to replace it.
2024-10-16 23:09:50 +02:00
Kevin Hannon
03da672159 remove 1.27 deterministic support for resource claims 2024-09-18 08:25:06 -04:00
Patrick Ohly
0fc78b9bcc DRA resource claim controller: update test
The resource claim controller is completely agnostic to the claim spec. It
doesn't care about classes or devices, therefore it needs no changes in 1.31
besides the v1alpha2 -> v1alpha3 renaming from a previous commit.
2024-07-22 18:09:34 +02:00
Patrick Ohly
8a629b9f15 DRA: remove "sharable" from claim allocation result
Now all claims are shareable up to the limit imposed by the size of the
"reserverFor" array.

This is one of the agreed simplifications for 1.31.
2024-07-21 17:28:14 +02:00
Patrick Ohly
de5742ae83 DRA: remove immediate allocation
As agreed in https://github.com/kubernetes/enhancements/pull/4709, immediate
allocation is one of those features which can be removed because it makes no
sense for structured parameters and the justification for classic DRA is weak.
2024-07-21 17:28:14 +02:00
Patrick Ohly
b51d68bb87 DRA: bump API v1alpha2 -> v1alpha3
This is in preparation for revamping the resource.k8s.io completely. Because
there will be no support for transitioning from v1alpha2 to v1alpha3, the
roundtrip test data for that API in 1.29 and 1.30 gets removed.

Repeating the version in the import name of the API packages is not really
required. It was done for a while to support simpler grepping for usage of
alpha APIs, but there are better ways for that now. So during this transition,
"resourceapi" gets used instead of "resourcev1alpha3" and the version gets
dropped from informer and lister imports. The advantage is that the next bump
to v1beta1 will affect fewer source code lines.

Only source code where the version really matters (like API registration)
retains the versioned import.
2024-07-21 17:28:13 +02:00
Kubernetes Prow Robot
ac9aec9f9b
Merge pull request #125116 from pohly/dra-one-of-source
DRA: remove "source" indirection from v1 Pod API
2024-06-28 12:46:45 -07:00
Patrick Ohly
bde9b64cdf DRA: remove "source" indirection from v1 Pod API
This makes the API nicer:

    resourceClaims:
    - name: with-template
      resourceClaimTemplateName: test-inline-claim-template
    - name: with-claim
      resourceClaimName: test-shared-claim

Previously, this was:

    resourceClaims:
    - name: with-template
      source:
        resourceClaimTemplateName: test-inline-claim-template
    - name: with-claim
      source:
        resourceClaimName: test-shared-claim

A more long-term benefit is that other, future alternatives
might not make sense under the "source" umbrella.

This is a breaking change. It's justified because DRA is still
alpha and will have several other API breaks in 1.31.
2024-06-27 17:53:24 +02:00
Kubernetes Prow Robot
92e0db2bbf
Merge pull request #125640 from googs1025/resourceclaim_controller_log_fix1
added resourceclaim_controller log info
2024-06-27 03:20:10 -07:00
googs1025
5f8fb17652 added resourceclaim_controller log info
Signed-off-by: googs1025 <googs1025@gmail.com>
2024-06-26 18:38:11 +08:00
Patrick Ohly
2da9e660e3 resourceclaim controller: add missing log output
The logging was fairly complete about *not* doing something, but the actual
ResourceClaim creation was not logged.
2024-06-25 16:12:31 +02:00
liyuerich
8e97c0ff7d drop deprecated pointer package in controller
Signed-off-by: liyuerich <yue.li@daocloud.io>

Update job_controller.go

Signed-off-by: liyuerich <yue.li@daocloud.io>
2024-05-09 11:34:25 +08:00
Kubernetes Prow Robot
1dc30bf90f
Merge pull request #124600 from alvaroaleman/typed-wq
Use the generic/typed workqueue throughout
2024-05-06 16:18:31 -07:00
carlory
76aa289608 bugfix: resourceclaim forgot to wait for podSchedulingSynced and templatesSynced 2024-05-06 16:56:16 +08:00
Alvaro Aleman
6d0ac8c561 Use the generic/typed workqueue throughout
This change makes us use the generic workqueue throughout the project in
order to improve type safety and readability of the code.
2024-05-04 14:33:12 -04:00
Kubernetes Prow Robot
eb2a59e8d8
Merge pull request #124214 from Monokaix/dev
fix wrong comments of dra
2024-04-18 03:24:28 -07:00
Xuzheng Chang
3e08030d53 fix wrong comments of dra
Signed-off-by: Xuzheng Chang <changxuzheng@huawei.com>
2024-04-09 09:41:25 +08:00
Patrick Ohly
4126e37f08 dra controller: unit tests 2024-03-22 10:03:22 +01:00
Patrick Ohly
3de376ecf6 dra controller: support structured parameters
When allocation was done by the scheduler, the controller needs to do the
deallocation because there is no control-plane controller which could react to
"DeallocationRequested".
2024-03-07 22:22:13 +01:00
Mengjiao Liu
b584b87a94 kube-controller-manager: readjust log verbosity
- Increase the global level for broadcaster's logging to 3 so that users can ignore event messages by lowering the logging level. It reduces information noise.
- Making sure the context is properly injected into the broadcaster, this will allow the -v flag value to be used also in that broadcaster, rather than the above global value.
- test: use cancellation from ktesting
- golangci-hints: checked error return value
2024-02-26 14:51:56 +08:00
Patrick Ohly
3c2cfd9a4f resource claim controller: separate generated suffix from base
When the resource claim name inside the pod had some suffix like "1a" in
"resource-1a", the generated name suffix got added directly after that, leading
to "my-pod-resource-1ax6zgt".

Adding another hyphen makes the result more readable: "my-pod-resource-1a-x6zgt".
2023-09-04 09:45:25 +02:00