Skip to content

CoucbaseLite-IOS, _changes?longpoll with SG channels management #2078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Crapulax opened this issue Sep 12, 2016 · 18 comments
Closed

CoucbaseLite-IOS, _changes?longpoll with SG channels management #2078

Crapulax opened this issue Sep 12, 2016 · 18 comments

Comments

@Crapulax
Copy link

Crapulax commented Sep 12, 2016

Hello,

Origin of the work, around issues #1865 and #2074, was to replicate an issue occuring time to time on our alpha environment. Today with fix for #1865, all logs activated + additional traces we had again the issue. Looking to the logs I realised that couchbaselite is using longpoll, meaning that a since value is provided in the requests which close itself very quickly (two docs let say).
Before digging further in the logs and providing whatever would be necessary for your investigation, I have the strange feeling that below case is not handled, am I right?

longpoll since = 25000, user x
25001: doc1 grants channel access but is not in a channel visible by user x (i.e. not sent)
25002: doc2 sent to user x
longpoll close
longpoll since = 25002, user x  
documents in channel granted by 25001 with a seq id below 25002 will not be sent because since is 25002? Is this what is happening? At least this is what I seems to see in logs.

And here some logs extract:

Grant access doc #279255
Line 1049:  Line 9135: 2016-09-12T12:18:05.222Z Access: Doc "6ffbed60e5a62578235788e252485783" grants channel access: map[be3b68a6-5dec-40ea-a264-db02abe0ef5e:TB=-UQgHTJrIn2bHPi1nFB9E7O:279255,TM=-UQgHTJrIn2bHPi1nFB9E7O:279255,TU=3af379ab-ab24-4e20-b422-db2e9bf24dca:279255]
Line 1050:  Line 9139: 2016-09-12T12:18:05.247Z Access: Rev "6ffbed60e5a62578235788e252485783"/"1-51f43490f90658ff55bdf03b2ca7daff" invalidates channels of [be3b68a6-5dec-40ea-a264-db02abe0ef5e]
Line 1052:  Line 9141: 2016-09-12T12:18:05.247Z Access: Invalidate access of "be3b68a6-5dec-40ea-a264-db02abe0ef5e"

Channel TB=-UQgHTJrIn2bHPi1nFB9E7O:279255,TM=-UQgHTJrIn2bHPi1nFB9E7O:279255 not seen by reload user
Line 1058:  Line 9178: 2016-09-12T12:18:05.957Z Changes+: MultiChangesFeed waiter has changed...   (to be3b68a6-5dec-40ea-a264-db02abe0ef5e)
Line 1059:  Line 9179: 2016-09-12T12:18:05.957Z Changes+: MultiChangesFeed reloading user &{roleImpl:{Name_:be3b68a6-5dec-40ea-a264-db02abe0ef5e ExplicitChannels_: Channels_:!:1,TB=-hq9v2xkUD5Wfg1PIsa70Bn:279248,TM=-hq9v2xkUD5Wfg1PIsa70Bn:279248,TU=cbe2bd39-79de-4486-a3e0-9970f6145e54:279248,U=400:279244 Sequence_:279243 PreviousChannels_:} userImplBody:{Email_:[email protected] Disabled_:false PasswordHash_:[36 50 97 36 49 48 36 106 50 100 71 80 120 73 104 53 82 105 98 66 66 78 80 75 119 101 81 73 117 90 102 112 46 70 113 66 89 87 68 48 114 111 102 111 47 73 49 121 48 49 73 99 65 119 89 73 89 67 122 50] OldPasswordHash_:<nil> ExplicitRoles_:public:279243 RolesSince_:public:279243 OldExplicitRoles_:[]} auth:0xc823a0c300 roles:[]}, no change

Send doc #279256 which is also in channel U=400
Line 1079:  Line 9201: 2016-09-12T12:18:05.957Z Changes+: Sending seq:279256 from channel U=400 user   (to be3b68a6-5dec-40ea-a264-db02abe0ef5e)

Line 1100:  Line 9352: 2016-09-12T12:18:06.193Z Access: Computed channels for "be3b68a6-5dec-40ea-a264-db02abe0ef5e": !:1,TB=-UQgHTJrIn2bHPi1nFB9E7O:279255,TB=-hq9v2xkUD5Wfg1PIsa70Bn:279248,TM=-UQgHTJrIn2bHPi1nFB9E7O:279255,TM=-hq9v2xkUD5Wfg1PIsa70Bn:279248,TU=3af379ab-ab24-4e20-b422-db2e9bf24dca:279255,TU=cbe2bd39-79de-4486-a3e0-9970f6145e54:279248,U=400:279244
Line 1101:  Line 9353: 2016-09-12T12:18:06.193Z Bucket: Update("_sync:user:be3b68a6-5dec-40ea-a264-db02abe0ef5e", 0, ...) --> <nil> [89.232761ms]

Line 1102:  Line 9354: 2016-09-12T12:18:06.193Z HTTP:  #73129: GET /xxxx/202d0c0f0e49ea3b35adc56688ff004c?rev=1-e9b88282f42781197b150f0195d9c71e&revs=true&attachments=true  (as be3b68a6-5dec-40ea-a264-db02abe0ef5e)
Line 1103:  Line 9357: 2016-09-12T12:18:06.220Z Bucket: Update("_sync:user:be3b68a6-5dec-40ea-a264-db02abe0ef5e", 0, ...) --> CAS quit [228.21µs]

Longpoll reset with since = 279256
Line 1104:  Line 9358: 2016-09-12T12:18:06.220Z HTTP:  #73130: POST /xxxx/_changes?feed=longpoll&heartbeat=300000&style=all_docs&since=279256  (as be3b68a6-5dec-40ea-a264-db02abe0ef5e)

Now channels are found
Line 1107:  Line 9363: 2016-09-12T12:18:06.221Z Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"TB=xxxxWorld":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x442cb}, "TM=-hq9v2xkUD5Wfg1PIsa70Bn":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x442d0}, "U=400":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x442cc}, "TB=-UQgHTJrIn2bHPi1nFB9E7O":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x442d7}, "TB=-hq9v2xkUD5Wfg1PIsa70Bn":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x442d0}, "TU=cbe2bd39-79de-4486-a3e0-9970f6145e54":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x442d0}, "!":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x1}, "TU=3af379ab-ab24-4e20-b422-db2e9bf24dca":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x442d7}, "TM=-UQgHTJrIn2bHPi1nFB9E7O":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x442d7}} ...   (to be3b68a6-5dec-40ea-a264-db02abe0ef5e)

But they are not considered as new.... 
Line 1114:  Line 9373: 2016-09-12T12:18:06.222Z Changes+: MultiChangesFeed processing channel TM=-UQgHTJrIn2bHPi1nFB9E7O seqat {VbNo:<nil> Sequence:279255} user   (to be3b68a6-5dec-40ea-a264-db02abe0ef5e)
Line 1115:  Line 9374: 2016-09-12T12:18:06.222Z Changes+: MultiChangesFeed no backfill channel TM=-UQgHTJrIn2bHPi1nFB9E7O chanOpts {Since:279256 Limit:0 Conflicts:true IncludeDocs:false Wait:false Continuous:false Terminator:0xc821042c00 HeartbeatMs:300000 TimeoutMs:300000 ActiveOnly:true} user   (to be3b68a6-5dec-40ea-a264-db02abe0ef5e)
Line 1116:  Line 9375: 2016-09-12T12:18:06.222Z Changes: [changesFeed] Found 0 changes for channel TM=-UQgHTJrIn2bHPi1nFB9E7O and user   (to be3b68a6-5dec-40ea-a264-db02abe0ef5e)
@Crapulax
Copy link
Author

Crapulax commented Sep 12, 2016

I build test scripts of above and I can now reproduce the issue. I need to dig further to verify that I am not abused by scripts and that there is no mistake but I get something like 2 errors out of 20 requests.

@Crapulax
Copy link
Author

Crapulax commented Sep 12, 2016

Using sync function from #1865
Here are the scripts to reproduce the issue: (set DB= in ./testSGlongpoll to your db name)
https://gist.github.com/Crapulax/6d8ff8b079474b388908255640f38814

You start everything doing ./testSGlongpoll
With last seq id close to max sequence
And user id start after last user id created or higher

@adamcfraser
Copy link
Collaborator

adamcfraser commented Sep 12, 2016

I think this highlights a gap in the solution that was provided for #1865. Currently a longpoll (or continuous) changes feed is doing the following, when coming out of a waiting state:

  1. Wait ends
  2. Reload the user (when needed)
  3. Identify current cached sequence
  4. Send changes up to the current cached sequence, for each channel

The gap here is for changes that are indexed/cached between steps 2 and 3 - they can result in missed channel backfills in the scenario you've outlined. It may be sufficient to reverse the order of 2 and 3. I need to do some work to verify there aren't any problematic scenarios with those two steps reversed - and whether that's a better solution to the problem described in #2068.

@Crapulax
Copy link
Author

Crapulax commented Sep 12, 2016

Thanks Adam, for your feedback!
Happy to make some tests on whatever you can provide even if you know it is not perfect. I have test scripts ready and running them takes a few minutes...
We need this issue to be solved and we will invest what is required (till we can t ;) )
Steve.

@adamcfraser
Copy link
Collaborator

@Crapulax I've pushed an initial version of the fix to feature/issue_1865, that ensures the cached sequence check happens before the user is reloaded. I think this should be sufficient to ensure that the user context is always up to date with what's been cached, and we don't have a situation where we're returning data that's raced ahead of our user context.

The reverse situation is now possible - the user context may be ahead of what we've cached - but I don't think that's problematic based on an additional update to the existing backfill handling to avoid duplicate backfill.

@Crapulax
Copy link
Author

Hellor Adam,

I will check the result tomorrow and let you know.

Steve

@Crapulax
Copy link
Author

Crapulax commented Sep 13, 2016

Hello Adam,

There is no concrete change with the latest commit. I have also, in some cases, more documents than expected. I do not recall such behavior with previous version.

Just to make sure what I do is correct. When using long poll and i get an answer of the form Seq:lowseq ("last_seq":"289367:289338") what should me my next longpoll since? 289367 or 289338? Well if I use the lowseq (289338) I enter into an infinite loop so I guess it is correct to use Seq(289367), isn't it?

Let me know, how we can progress further.

Thanks,
Steve.

@adamcfraser
Copy link
Collaborator

You should be using the full last_seq as the since value (e.g. since="289367:289338"). That sequence value represents a changes entry that's midway through a backfill.

@Crapulax
Copy link
Author

oh. ok I have made the change on my test scripts and unfortunately results are similar, many errors.

@adamcfraser
Copy link
Collaborator

adamcfraser commented Sep 13, 2016

I think there's a gap in the notification handling that I didn't catch yesterday, because your grant docs aren't visible to the user. When the changes feed is prevented from running ahead of the cached sequence post-userdoc notification, there's no trigger to wake it up from the subsequent waiting state. Looking into possibilities now.

As a side note - I appreciate that you've identified these issues, and am working to get them closed out - they are definitely valid issues. However, I suspect that for your particular use case you could avoid these race conditions with minor changes to the way you're granting channel access. For example, if you did your channel grant based on a user update to the invite document, I think you'd avoid:

  • concurrent access update race between invite and grant docs
  • race between userdoc update and non-visible grant doc

I'm thinking something like this in your config:

if doc_type = "invite" {
     channel(TM=doc.channel)
     channel(U=doc.userid)
     if doc.invite_status = "accepted" {
       access(user, TB=doc.channel)
       access(user, TU=doc.userid)
    }
}

@Crapulax
Copy link
Author

We are on the same page. What you suggest looks fully feasible. I have left the office, but will make some tests with such change first things tomorrow.
At the end of the day we need to get it working whatever is the solution.

Thanks, I keep you updated.
Steve.

@adamcfraser
Copy link
Collaborator

I've pushed one more update to feature/issue_1865, to handle the notification handling scenario I mentioned above. The current approach feels like a bit of a temporary workaround, but it might be all that I can do in the short term - the longer term solution may need to wait until 2.0.

@Crapulax
Copy link
Author

Crapulax commented Sep 14, 2016

First results looks promising! On tests without your last fix and adding grant doc to a user channel I have no document losts (but some duplicates).

  • Without your last fix and adding the granting doc to a channel visible by the user.
    (limit=0 or limit=2 on _changes?longpoll)
    I have nearly no lost doc but duplicates in 80% of cases. Compare to before occurence is low but is still there.
  • With or without your last fix but without the granting doc seen by user:
    I still have missing docs (probably less than without the fix) but still get some.
    I do have also the duplicate doc issue.

I am going to do two things now (with granting doc added to user's channel and use SG without your latest fix):
1- get tests running for a longer time
2- see in alpha environment if original issue we were facing occurs again

I guess that getting some duplicates documents is not an issue for couchbaselite IOS/Android.

Brgds,
Steve.

@Crapulax
Copy link
Author

Crapulax commented Sep 14, 2016

With additional testing I had a few occurence of missing docs. I will analyze later logs to see in which case.

@Crapulax
Copy link
Author

Hello,

We had again twice the issue today in Alpha environment with new invite mechanism. I will spend time this weekend to fix it.

Brgds,
Steve

@Crapulax
Copy link
Author

Crapulax commented Sep 18, 2016

I guess I have a correction. I need to make some massage on it (you are not going to like it as it is:)), but now I am not losing docs during my tests.
Basically what I am doing is to keep track, in a map, of invalidating sequences by user and prevents sending doc seq ahead of smallest pending invalidating seq for a given user. I prevent also issue #2085 by reentering wait loop when necessary.

I have somehow validated the effect of the fix by:

  • instead of preventing the potential error, log a warning msg when the fix would applied
  • in 100% of warning occurence I indeed lose docs on that user. And for sure no loss without the warning.

I had like to remove duplicates now.

@Crapulax
Copy link
Author

I do not see how to avoid duplicates in a long poll case because the low sequence is not linked to a channel. And by the way all docs higher than this sequence value needs to be sent. Am I missing something?

@Crapulax
Copy link
Author

Crapulax commented Oct 5, 2016

See #2117

@Crapulax Crapulax closed this as completed Oct 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants