Description of the problem: When the user creates a clusterclaim, they can specify a lifetime, which is converted into a golang time object by using https://pkg.go.dev/time#ParseDuration. The cluserclaim is reconciled by our clusterclaim-controler and hive. If the user specifies an invalid Lifetime, our clusterclaim-controller continuously loops on this error: ``` E0203 07:10:38.266841 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224: Failed to watch *v1.ClusterClaim: failed to list *v1.ClusterClaim: v1.ClusterClaimList.Items: []v1.ClusterClaim: v1.ClusterClaim.v1.ClusterClaim.Spec: v1.ClusterClaimSpec.Lifetime: unmarshalerDecoder: time: unknown unit "w" in duration "1w", error found in #10 byte of ...|time":"1w"}},{"apiVe|..., bigger context ...|clusterPoolName":"policy-aas-hubs","lifetime":"1w"}},{"apiVersion":"hive.openshift.io/v1","kind":"Cl|... ``` And fails to reconcile ANY clusterclaim, not just the affected clusterclaim. We hit and patched a similar issue with hive a while ago - and the fix involved changing the allowed types/formats in the CRD - https://github.com/openshift/hive/pull/1457/files, also relevant is the Hive Jirs: https://issues.redhat.com/browse/HIVE-1587 - but it appears that clusterclaim-controller is still trying to query these clusterclaim objects and failing. Release version: All versions with clusterclaim-controller (suspected 2.4 and 2.3?) Operator snapshot version: GA-level OCP version: All supported Browser Info: CLI Steps to reproduce: 1. Create a clusterclaim with a lifetime set to an invalid entry that the unmarshaller can't unmarshal, for example "7d" 2. Watch clusterclaim-controller logs for errors and clusterclaims stop reconciling on the cluster globally Actual results: All clusterclaims stop working Expected results: Object is either rejected or not reconciled, preferably has status set to indicate its invalid state, but that may not be possible without hitting the unmarshaller issue. Additional info: I'm curious why the hive admission is allowing this malformed ClusterClaim CR in - did they regress their checks or was it never a hard check?
We will target a backport to 2.3.7
@efried @jdiaz I use the latest hive in community (1.2.3481-ad1892a) and create a invalid clusterclaim, and there are also several error in hive-controllers, and all claim can not work. Invalid ClusterClaim: apiVersion: hive.openshift.io/v1 kind: ClusterClaim metadata: name: ldpliu-1018 namespace: default spec: clusterPoolName: aws-pool lifetime: "7d" Error: W0208 09:42:28.764939 1 reflector.go:324] k8s.io/client-go.0+incompatible/tools/cache/reflector.go:167: failed to list *v1.ClusterClaim: time: unknown unit "d" in duration "7d" E0208 09:42:28.764962 1 reflector.go:138] k8s.io/client-go.0+incompatible/tools/cache/reflector.go:167: Failed to watch *v1.ClusterClaim: failed to list *v1.ClusterClaim: time: unknown unit "d" in duration "7d" It looks like in https://issues.redhat.com/browse/HIVE-1587, hive checked the invalid field like: "177777777777777777777777777777777777777777777777777777777777777777777777777777h0m0s", but did not check the value like "7d". Could you help to take a look?
This appears to be an upstream bug. TL;DR: Despite validation of type=duration being documented [1] as parsed by time.ParseDuration, which is what the unmarshaller does [2], it appears it is actually parsed by strfmt.ParseDuration [3], which is more lenient. The longer story: - Hive defines this field as metav1.Duration and configures kubebuilder validation accordingly [4]. - When you submit a CR, the apiserver validates it against strfmt.ParseDuration [3], which accepts suffixes like `d` and `w` [5]. Since this passes, the object gets stored. - When hive tries to List ClusterClaim objects, the object (along with any others, which may be valid) is retrieved and unmarshalled. When the unmarshaller hits the metav1.Duration field, it uses time.ParseDuration [5], which does *not* accept `d` or `w` [6]. So it errors *the whole list* and this is a show-stopper for the controllers. It should be noted that `d` and `w` are the only suffixes subject to this problem. If you try a value like "1q" the apiserver will reject it as expected: ``` The ClusterClaim "the-claim" is invalid: spec.lifetime: Invalid value: "1q": spec.lifetime in body must be of type duration: "1q" ``` From hive's side, I only see a couple of ways around this: - Add a webhook to do the extra validation. This seems like a pretty heavy answer. - Stop using the convenience of go typing and kubebuilder validation for metav1.Duration. Make this (and the other affected API fields -- six total by my count) just a `string` and do the validation in the controller. Invalid values will still cause the CR to be unusable, but wouldn't affect other, valid CRs. However, you would need to probe the logs to figure out why your CR isn't working... unless we added a status condition for same. Again, a very heavy solution. Or we could just open an upstream bug and document this limitation. Thoughts? [1] https://github.com/kubernetes/apiextensions-apiserver/blob/29400d7010f44f7d6ad2c6762604185143382598/pkg/apis/apiextensions/v1/types_jsonschema.go#L51 [2] https://github.com/kubernetes/apimachinery/blob/df993592a122931b8aac4db57689e09458a2332c/pkg/apis/meta/v1/duration.go#L39 [3] https://github.com/kubernetes/apiextensions-apiserver/blob/29400d7010f44f7d6ad2c6762604185143382598/pkg/apiserver/schema/cel/values.go#L135 [4] https://github.com/openshift/hive/blob/9c572408b5299c520160e7a317c52e80b0415b9f/apis/hive/v1/clusterclaim_types.go#L28-L30 [5] https://github.com/kubernetes/kube-openapi/blob/424119656bbfd8b633f4b9f9ef5f93cd1e01266a/pkg/validation/strfmt/duration.go#L51-L52 [6] https://go.dev/src/time/format.go#L1493
Another option comes to mind: I may be able to override metav1.Duration, essentially copying it into the hive API but changing this line [1] to use `strfmt.ParseDuration` instead. To be clear: I still don't think this is The Right Thing™, but (if it works) it's an elegant workaround, as it will cause hive to support rather than reject `d` and `w` suffixes. Trying... [1] https://github.com/kubernetes/apimachinery/blob/df993592a122931b8aac4db57689e09458a2332c/pkg/apis/meta/v1/duration.go#L39
> Trying... I knocked my head against this for a while. Turns out to be pretty tricky due to the way code-generator wants to handle my cloned Duration object when it's in my project (but not, for some reason, when it's up in apimachinery). @jspeed suggested a simpler workaround: ditch `format=duration` entirely and use an explicit regex [1]. This is somewhat brittle in that it (still!) may not precisely match what the unmarshaller accepts. But at least we can be sure it'll reject `d` and `w`. In the meantime, I have opened issues against both apimachinery [2] and apiextensions-apiserver [3], which I'll note next to the regex in case they ever get addressed. [1] https://github.com/openshift/api/blob/b632c5fc10c0ea86cb18577db9388109a43d465f/machine/v1beta1/types_machinehealthcheck.go#L111-L112 [2] https://github.com/kubernetes/apimachinery/issues/131 [3] https://github.com/kubernetes/apiextensions-apiserver/issues/56
Thanks @efried Follow comments https://bugzilla.redhat.com/show_bug.cgi?id=2050332#c1 You may need to backport it to ocm-2.3 and ocm-2.4
@efried I suggest only fix this in ocm-2.4 or later version in ACM part.[still need response]https://github.com/stolostron/backlog/issues/19686#issuecomment-1039804156. So please cherry-pick it to ocm-2.4 first. And for ocm-2.3, I will give the response later.
Verified with the following. The issue has been fixed correctly. Test Case: https://polarion.engineering.redhat.com/polarion/redirect/project/OSE/workitem?id=OCP-48684 Hive version: v1.1.16-323-g7e0b248
Backport to ocm-2.4 via https://issues.redhat.com/browse/HIVE-1760 and https://github.com/openshift/hive/pull/1690.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069