When enabling DynamicResourceAllocation the dynamicresource plugin may
error during scheduling with:
```
E0212 08:57:53.817268 1 framework.go:1323] "Plugin failed" err="podschedulingcontexts.resource.k8s.io \"pod\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>" plugin="DynamicResources" pod="gpu-test2/pod"
```
NFTables proxy will now drop traffic directed towards unallocated
ClusterIPs and reject traffic directed towards invalid ports of
Cluster IPs.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
This commit defines the ClusterTrustBundlePEM projected volume types.
These types have been renamed from the KEP (PEMTrustAnchors) in order to
leave open the possibility of a similar projection drawing from a
yet-to-exist namespaced-scoped TrustBundle object, which came up during
KEP discussion.
* Add the projection field to internal and v1 APIs.
* Add validation to ensure that usages of the project must specify a
name and path.
* Add TODO covering admission control to forbid mirror pods from using
the projection.
Part of KEP-3257.
Controls the lifecycle of the ServiceCIDRs adding finalizers and
setting the Ready condition in status when they are created, and
removing the finalizers once it is safe to remove (no orphan IPAddresses)
An IPAddress is orphan if there are no ServiceCIDR containing it.
Change-Id: Icbe31e1ed8525fa04df3b741c8a817e5f2a49e80
PVC and containers shared the same ResourceRequirements struct to define their
API. When resource claims were added, that struct got extended, which
accidentally also changed the PVC API. To avoid such a mistake from happening
again, PVC now uses its own VolumeResourceRequirements struct.
The `Claims` field gets removed because risk of breaking someone is low:
theoretically, YAML files which have a claims field for volumes now
get rejected when validating against the OpenAPI. Such files
have never made sense and should be fixed.
Code that uses the struct definitions needs to be updated.
* Add slash ended urls for service-account-issuer-discovery to match API in swagger
* update the comment for adding slash-ended URLs
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
---------
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
* [API REVIEW] ValidatingAdmissionPolicyStatucController config.
worker count.
* ValidatingAdmissionPolicyStatus controller.
* remove CEL typechecking from API server.
* fix initializer tests.
* remove type checking integration tests
from API server integration tests.
* validatingadmissionpolicy-status options.
* grant access to VAP controller.
* add defaulting unit test.
* generated: ./hack/update-codegen.sh
* add OWNERS for VAP status controller.
* type checking test case.
When someone decides that a Pod should definitely run on a specific node, they
can create the Pod with spec.nodeName already set. Some custom scheduler might
do that. Then kubelet starts to check the pod and (if DRA is enabled) will
refuse to run it, either because the claims are still waiting for the first
consumer or the pod wasn't added to reservedFor. Both are things the scheduler
normally does.
Also, if a pod got scheduled while the DRA feature was off in the
kube-scheduler, a pod can reach the same state.
The resource claim controller can handle these two cases by taking over for the
kube-scheduler when nodeName is set. Triggering an allocation is simpler than
in the scheduler because all it takes is creating the right
PodSchedulingContext with spec.selectedNode set. There's no need to list nodes
because that choice was already made, permanently. Adding the pod to
reservedFor also isn't hard.
What's currently missing is triggering de-allocation of claims to re-allocate
them for the desired node. This is not important for claims that get created
for the pod from a template and then only get used once, but it might be
worthwhile to add de-allocation in the future.
The status determines which claims kubelet is allowed to access when claims get
created from a template. Therefore kubelet must not be allowed to modify that
part of the status, because otherwise it could add an entry and then gain
access to a claim it should have access to.
As discussed during the KEP and code review of dynamic resource allocation for
Kubernetes 1.26, to increase security kubelet should only get read access to
those ResourceClaim objects that are needed by a pod on the node.
This can be enforced easily by the node authorizer:
- the names of the objects are defined by the pod status
- there is no indirection as with volumes (pod -> pvc -> pv)
Normally the graph only gets updated when a pod is not scheduled yet.
Resource claim status changes can still happen after that, so they
also must trigger an update.
Generating the name avoids all potential name collisions. It's not clear how
much of a problem that was because users can avoid them and the deterministic
names for generic ephemeral volumes have not led to reports from users. But
using generated names is not too hard either.
What makes it relatively easy is that the new pod.status.resourceClaimStatus
map stores the generated name for kubelet and node authorizer, i.e. the
information in the pod is sufficient to determine the name of the
ResourceClaim.
The resource claim controller becomes a bit more complex and now needs
permission to modify the pod status. The new failure scenario of "ResourceClaim
created, updating pod status fails" is handled with the help of a new special
"resource.kubernetes.io/pod-claim-name" annotation that together with the owner
reference identifies exactly for what a ResourceClaim was generated, so
updating the pod status can be retried for existing ResourceClaims.
The transition from deterministic names is handled with a special case for that
recovery code path: a ResourceClaim with no annotation and a name that follows
the Kubernetes <= 1.27 naming pattern is assumed to be generated for that pod
claim and gets added to the pod status.
There's no immediate need for it, but just in case that it may become relevant,
the name of the generated ResourceClaim may also be left unset to record that
no claim was needed. Components processing such a pod can skip whatever they
normally would do for the claim. To ensure that they do and also cover other
cases properly ("no known field is set", "must check ownership"),
resourceclaim.Name gets extended.
When a pod is done, but not getting removed yet for while, then a claim that
got generated for that pod can be deleted already. This then also triggers
deallocation.
This was making my eyes bleed as I read over code.
I used the following in vim. I made them up on the fly, but they seemed
to pass manual inspection.
:g/},\n\s*{$/s//}, {/
:w
:g/{$\n\s*{$/s//{{/
:w
:g/^\(\s*\)},\n\1},$/s//}},/
:w
:g/^\(\s*\)},$\n\1}$/s//}}/
:w
This commit is the main API piece of KEP-3257 (ClusterTrustBundles).
This commit:
* Adds the certificates.k8s.io/v1alpha1 API group
* Adds the ClusterTrustBundle type.
* Registers the new type in kube-apiserver.
* Implements the type-specfic validation specified for
ClusterTrustBundles:
- spec.pemTrustAnchors must always be non-empty.
- spec.signerName must be either empty or a valid signer name.
- Changing spec.signerName is disallowed.
* Implements the "attest" admission check to restrict actions on
ClusterTrustBundles that include a signer name.
Because it wasn't specified in the KEP, I chose to make attempts to
update the signer name be validation errors, rather than silently
ignored.
I have tested this out by launching these changes in kind and
manipulating ClusterTrustBundle objects in the resulting cluster using
kubectl.
The name "PodScheduling" was unusual because in contrast to most other names,
it was impossible to put an article in front of it. Now PodSchedulingContext is
used instead.
1. Define ContainerResizePolicy and add it to Container struct.
2. Add ResourcesAllocated and Resources fields to ContainerStatus struct.
3. Define ResourcesResizeStatus and add it to PodStatus struct.
4. Add InPlacePodVerticalScaling feature gate and drop disabled fields.
5. ResizePolicy validation & defaulting and Resources mutability for CPU/Memory.
6. Various fixes from code review feedback (originally committed on Apr 12, 2022)
KEP: /enhancements/keps/sig-node/1287-in-place-update-pod-resources
* Add tracker types and tests
* Modify ResourceEventHandler interface's OnAdd member
* Add additional ResourceEventHandlerDetailedFuncs struct
* Fix SharedInformer to let users track HasSynced for their handlers
* Fix in-tree controllers which weren't computing HasSynced correctly
* Deprecate the cache.Pop function
When StatefulSetAutoDeletePVC feature gate is enabled, StatefulSet
controller updates ownerReferences on managed PVCs. To be able to pass
OwnerReferencesPermissionEnforcement admission, it must have permissions to
delete PVCs.
Dependencies need to be updated to use
github.com/container-orchestrated-devices/container-device-interface.
It's not decided yet whether we will implement Topology support
for DRA or not. Not having any toppology-related code
will help to avoid wrong impression that DRA is used as a hint
provider for the Topology Manager.
The controller uses the exact same logic as the generic ephemeral inline volume
controller, just for inline ResourceClaimTemplate -> ResourceClaim.
In addition, it supports removal of pods from the ReservedFor field when those
pods are known to not need the claim anymore. At the moment, only this special
case is supported. Removal of arbitrary objects would imply granting full read
access to all types to determine whether a) an object is gone and b) if the
current incarnation is the one which is listed in ReservedFor. This may get
added later.
cd54bd94e9 removes the
handlers for /spec from the kubelet server.
Cleanup the RBAC rules as well.
Change-Id: Id6befbcacec27ad383e336b7189289f55c1c0a68
Improve error logging from timed workers which are used for pod eviction
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
Introduce networking/v1alpha1 api group.
Add `ClusterCIDR` type to networking/v1alpha1 api group, this type
will enable the NodeIPAM controller to support multiple ClusterCIDRs.
- PreemptionByKubeScheduler (Pod preempted by kube-scheduler)
- DeletionByTaintManager (Pod deleted by taint manager due to NoExecute taint)
- EvictionByEvictionAPI (Pod evicted by Eviction API)
- DeletionByPodGC (an orphaned Pod deleted by PodGC)PreemptedByScheduler (Pod preempted by kube-scheduler)
- Run hack/update-codegen.sh
- Run hack/update-generated-device-plugin.sh
- Run hack/update-generated-protobuf.sh
- Run hack/update-generated-runtime.sh
- Run hack/update-generated-swagger-docs.sh
- Run hack/update-openapi-spec.sh
- Run hack/update-gofmt.sh
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
* Introduce networking/v1alpha1 api, ClusterCIDRConfig type
Introduce networking/v1alpha1 api group.
Add `ClusterCIDRConfig` type to networking/v1alpha1 api group, this type
will enable the NodeIPAM controller to support multiple ClusterCIDRs.
* Change ClusterCIDRConfig.NodeSelector type in api
* Fix review comments for API
* Update ClusterCIDRConfig API Spec
Introduce PerNodeHostBits field, remove PerNodeMaskSize
Some of these changes are cosmetic (repeatedly calling klog.V instead of
reusing the result), others address real issues:
- Logging a message only above a certain verbosity threshold without
recording that verbosity level (if klog.V().Enabled() { klog.Info... }):
this matters when using a logging backend which records the verbosity
level.
- Passing a format string with parameters to a logging function that
doesn't do string formatting.
All of these locations where found by the enhanced logcheck tool from
https://github.com/kubernetes/klog/pull/297.
In some cases it reports false positives, but those can be suppressed with
source code comments.
This change updates the generic webhook logic to use a rest.Config
as its input instead of a kubeconfig file. This exposes all of the
rest.Config knobs to the caller instead of the more limited set
available through the kubeconfig format. This is useful when this
code is being used as a library outside of core Kubernetes. For
example, a downstream consumer may want to override the webhook's
internals such as its TLS configuration.
Signed-off-by: Monis Khan <mok@vmware.com>
This patch aims to simplify decoupling "pkg/scheduler/framework/plugins"
from internal "k8s.io/kubernetes" packages. More described in
issue #89930 and PR #102953.
Some helpers from "k8s.io/kubernetes/pkg/controller/volume/persistentvolume"
package moved to "k8s.io/component-helpers/storage/volume" package:
- IsDelayBindingMode
- GetBindVolumeToClaim
- IsVolumeBoundToClaim
- FindMatchingVolume
- CheckVolumeModeMismatches
- CheckAccessModes
- GetVolumeNodeAffinity
Also "CheckNodeAffinity" from "k8s.io/kubernetes/pkg/volume/util"
package moved to "k8s.io/component-helpers/storage/volume" package
to prevent diamond dependency conflict.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
This feature has graduated to GA in v1.11 and will always be
enabled. So no longe need to check if enabled.
Signed-off-by: Konstantin Misyutin <konstantin.misyutin@huawei.com>
The feature gate gets locked to "true", with the goal to remove it in two
releases.
All code now can assume that the feature is enabled. Tests for "feature
disabled" are no longer needed and get removed.
Some code wasn't using the new helper functions yet. That gets changed while
touching those lines.
The name concatenation and ownership check were originally considered small
enough to not warrant dedicated functions, but the intent of the code is more
readable with them.
Since external metrics were added, we weren't running the HPA with
metrics REST clients by default, so we had no bootstrap policy to enable
the HPA controller to talk to the external metrics API.
This change adds permissions for the HPA controller to list and get
external.metrics.k8s.io by default as already done for the
custom.metrics.k8s.io API.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
Through Job.status.uncountedPodUIDs and a Pod finalizer
An annotation marks if a job should be tracked with new behavior
A separate work queue is used to remove finalizers from orphan pods.
Change-Id: I1862e930257a9d1f7f1b2b0a526ed15bc8c248ad
This is the result of
UPDATE_BOOTSTRAP_POLICY_FIXTURE_DATA=true go test k8s.io/kubernetes/plugin/pkg/auth/authorizer/rbac/bootstrappolicy
Apparently enabling the GenericEphemeralVolume feature by default
affect this test. The policy that it now tests against is indeed
the one needed for the controller.
This is the result of
UPDATE_BOOTSTRAP_POLICY_FIXTURE_DATA=true go test k8s.io/kubernetes/plugin/pkg/auth/authorizer/rbac/bootstrappolicy
after enabling the CSIStorageCapacity feature. This enables
additional RBAC entries for reading CSIDriver and
CSIStorageCapacity.