Bug 2050332 - Malformed ClusterClaim lifetimes cause the clusterclaims-controller to silently fail to reconcile all clusterclaims
Summary: Malformed ClusterClaim lifetimes cause the clusterclaims-controller to silent...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hive
Version: 4.11
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: Eric Fried
QA Contact: Jianping SHu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-03 17:32 UTC by Gurney Buchanan
Modified: 2022-08-10 10:47 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:47:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift hive pull 1684 0 None open Bug 2050332: explicit Pattern to validate Duration fields 2022-02-10 17:08:24 UTC
Github stolostron backlog issues 19686 0 None None None 2022-02-03 21:12:29 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:47:34 UTC

Description Gurney Buchanan 2022-02-03 17:32:40 UTC
Description of the problem:

When the user creates a clusterclaim, they can specify a lifetime, which is converted into a golang time object by using https://pkg.go.dev/time#ParseDuration.  The cluserclaim is reconciled by our clusterclaim-controler and hive.  If the user specifies an invalid Lifetime, our clusterclaim-controller continuously loops on this error:
```
E0203 07:10:38.266841       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224: Failed to watch *v1.ClusterClaim: failed to list *v1.ClusterClaim: v1.ClusterClaimList.Items: []v1.ClusterClaim: v1.ClusterClaim.v1.ClusterClaim.Spec: v1.ClusterClaimSpec.Lifetime: unmarshalerDecoder: time: unknown unit "w" in duration "1w", error found in #10 byte of ...|time":"1w"}},{"apiVe|..., bigger context ...|clusterPoolName":"policy-aas-hubs","lifetime":"1w"}},{"apiVersion":"hive.openshift.io/v1","kind":"Cl|...
```

And fails to reconcile ANY clusterclaim, not just the affected clusterclaim.  

We hit and patched a similar issue with hive a while ago - and the fix involved changing the allowed types/formats in the CRD - https://github.com/openshift/hive/pull/1457/files, also relevant is the Hive Jirs: https://issues.redhat.com/browse/HIVE-1587 - but it appears that clusterclaim-controller is still trying to query these clusterclaim objects and failing.  

Release version:
All versions with clusterclaim-controller (suspected 2.4 and 2.3?)

Operator snapshot version: GA-level

OCP version: All supported

Browser Info: CLI

Steps to reproduce:
1. Create a clusterclaim with a lifetime set to an invalid entry that the unmarshaller can't unmarshal, for example "7d"
2. Watch clusterclaim-controller logs for errors and clusterclaims stop reconciling on the cluster globally

Actual results:
All clusterclaims stop working


Expected results:
Object is either rejected or not reconciled, preferably has status set to indicate its invalid state, but that may not be possible without hitting the unmarshaller issue.  

Additional info:
I'm curious why the hive admission is allowing this malformed ClusterClaim CR in - did they regress their checks or was it never a hard check?

Comment 1 Chuck Brant 2022-02-03 18:58:15 UTC
We will target a backport to 2.3.7

Comment 2 daliu 2022-02-08 09:49:41 UTC
@efried @jdiaz 
I use the latest hive in community (1.2.3481-ad1892a) and create a invalid clusterclaim, and there are also several error in hive-controllers, and all claim can not work.

Invalid ClusterClaim:

apiVersion: hive.openshift.io/v1
kind: ClusterClaim
metadata:
  name: ldpliu-1018
  namespace: default
spec:
  clusterPoolName: aws-pool
  lifetime: "7d"


Error: 
W0208 09:42:28.764939       1 reflector.go:324] k8s.io/client-go.0+incompatible/tools/cache/reflector.go:167: failed to list *v1.ClusterClaim: time: unknown unit "d" in duration "7d"
E0208 09:42:28.764962       1 reflector.go:138] k8s.io/client-go.0+incompatible/tools/cache/reflector.go:167: Failed to watch *v1.ClusterClaim: failed to list *v1.ClusterClaim: time: unknown unit "d" in duration "7d"


It looks like in https://issues.redhat.com/browse/HIVE-1587, hive checked the invalid field like: "177777777777777777777777777777777777777777777777777777777777777777777777777777h0m0s", but did not check the value like "7d".

Could you help to take a look?

Comment 3 Eric Fried 2022-02-08 16:55:13 UTC
This appears to be an upstream bug. TL;DR: Despite validation of type=duration being documented [1] as parsed by time.ParseDuration, which is what the unmarshaller does [2], it appears it is actually parsed by strfmt.ParseDuration [3], which is more lenient.

The longer story:
- Hive defines this field as metav1.Duration and configures kubebuilder validation accordingly [4].
- When you submit a CR, the apiserver validates it against strfmt.ParseDuration [3], which accepts suffixes like `d` and `w` [5]. Since this passes, the object gets stored.
- When hive tries to List ClusterClaim objects, the object (along with any others, which may be valid) is retrieved and unmarshalled. When the unmarshaller hits the metav1.Duration field, it uses time.ParseDuration [5], which does *not* accept `d` or `w` [6]. So it errors *the whole list* and this is a show-stopper for the controllers.

It should be noted that `d` and `w` are the only suffixes subject to this problem. If you try a value like "1q" the apiserver will reject it as expected:

```
The ClusterClaim "the-claim" is invalid: spec.lifetime: Invalid value: "1q": spec.lifetime in body must be of type duration: "1q"
```

From hive's side, I only see a couple of ways around this:
- Add a webhook to do the extra validation. This seems like a pretty heavy answer.
- Stop using the convenience of go typing and kubebuilder validation for metav1.Duration. Make this (and the other affected API fields -- six total by my count) just a `string` and do the validation in the controller. Invalid values will still cause the CR to be unusable, but wouldn't affect other, valid CRs. However, you would need to probe the logs to figure out why your CR isn't working... unless we added a status condition for same. Again, a very heavy solution.

Or we could just open an upstream bug and document this limitation. Thoughts?

[1] https://github.com/kubernetes/apiextensions-apiserver/blob/29400d7010f44f7d6ad2c6762604185143382598/pkg/apis/apiextensions/v1/types_jsonschema.go#L51
[2] https://github.com/kubernetes/apimachinery/blob/df993592a122931b8aac4db57689e09458a2332c/pkg/apis/meta/v1/duration.go#L39
[3] https://github.com/kubernetes/apiextensions-apiserver/blob/29400d7010f44f7d6ad2c6762604185143382598/pkg/apiserver/schema/cel/values.go#L135
[4] https://github.com/openshift/hive/blob/9c572408b5299c520160e7a317c52e80b0415b9f/apis/hive/v1/clusterclaim_types.go#L28-L30
[5] https://github.com/kubernetes/kube-openapi/blob/424119656bbfd8b633f4b9f9ef5f93cd1e01266a/pkg/validation/strfmt/duration.go#L51-L52
[6] https://go.dev/src/time/format.go#L1493

Comment 4 Eric Fried 2022-02-08 22:07:43 UTC
Another option comes to mind: I may be able to override metav1.Duration, essentially copying it into the hive API but changing this line [1] to use `strfmt.ParseDuration` instead.

To be clear: I still don't think this is The Right Thing™, but (if it works) it's an elegant workaround, as it will cause hive to support rather than reject `d` and `w` suffixes.

Trying...

[1] https://github.com/kubernetes/apimachinery/blob/df993592a122931b8aac4db57689e09458a2332c/pkg/apis/meta/v1/duration.go#L39

Comment 6 Eric Fried 2022-02-10 15:48:26 UTC
> Trying...

I knocked my head against this for a while. Turns out to be pretty tricky due to the way code-generator wants to handle my cloned Duration object when it's in my project (but not, for some reason, when it's up in apimachinery).

@jspeed suggested a simpler workaround: ditch `format=duration` entirely and use an explicit regex [1]. This is somewhat brittle in that it (still!) may not precisely match what the unmarshaller accepts. But at least we can be sure it'll reject `d` and `w`.

In the meantime, I have opened issues against both apimachinery [2] and apiextensions-apiserver [3], which I'll note next to the regex in case they ever get addressed.

[1] https://github.com/openshift/api/blob/b632c5fc10c0ea86cb18577db9388109a43d465f/machine/v1beta1/types_machinehealthcheck.go#L111-L112
[2] https://github.com/kubernetes/apimachinery/issues/131
[3] https://github.com/kubernetes/apiextensions-apiserver/issues/56

Comment 9 daliu 2022-02-14 01:04:54 UTC
Thanks @efried 
Follow comments https://bugzilla.redhat.com/show_bug.cgi?id=2050332#c1
You may need to backport it to ocm-2.3 and ocm-2.4

Comment 10 daliu 2022-02-15 03:13:06 UTC
@efried I suggest only fix this in ocm-2.4 or later version in ACM part.[still need response]https://github.com/stolostron/backlog/issues/19686#issuecomment-1039804156.
So please cherry-pick it to ocm-2.4 first. And for ocm-2.3, I will give the response later.

Comment 11 Jianping SHu 2022-02-15 08:07:02 UTC
Verified with the following. The issue has been fixed correctly.

Test Case:
https://polarion.engineering.redhat.com/polarion/redirect/project/OSE/workitem?id=OCP-48684

Hive version: 
v1.1.16-323-g7e0b248

Comment 14 errata-xmlrpc 2022-08-10 10:47:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.