SOLVED

Re: Can I avoid race conditions when using static lists as a task queue?

Go to solution
Jon_Wu
Level 4

Sanford Whiteman​​ and others gave me some great advice on how I could use lists as a work queue to work around Munchkin v2 limitations on what actions could be done when a lead was associated: When a lead is merged / associated how can I make previously anonymous activity send an email?

I'm looking for ways to avoid a race condition where people in a list may get processed more than once.

My goal is to put people into a list for specific campaigns, and use the list as a queue that will be flushed by "cron job" / scheduled campaigns that run once an hour, and also by triggered campaigns that hopefully happen more quickly. The triggers may not always happen, so the cron is the fallback to flush the queues regularly.

Imagine I have a Smart Campaign that finds people in a specific list, then in the flow, removes them from the list, and sends an email.

Could this SC be invoked twice at the exact same time e.g. once from a scheduled campaign, and another time from a triggered campaign? If that was the case, maybe the user gets 2 emails.

What if the user is removed from the list in between when the SC criteria makes them eligible and when the flow actually runs? Should I be re-checking list membership before sending the email? Are there any other gotchas here? I can't say only send the email once, b/c sometimes this is for transactional reasons so people may be queued up in the same list multiple times (but no more than 1 time at once).

Thanks!

1 ACCEPTED SOLUTION
SanfordWhiteman
Level 10 - Community Moderator

Unfortunately, you're still going to have race conditions in this scenario. There simply isn't enough atomicity or transaction awareness (not in the transactional email sense but in the SQL transaction sense).

What you need is an Atomic Compare-and-Swap (CAS) or true stack to make this work. You could definitely accomplish this using a webhook and a key-value store with CAS abilities.

(Also, just FYI, I'm primarilySanford Whiteman​ -- you @'d one of my secondary accounts!)

View solution in original post

12 REPLIES 12
Denise_Greenb12
Level 7

Hi Jon,

Re: "Could this SC be invoked twice at the exact same time e.g. once from a scheduled campaign, and another time from a triggered campaign? If that was the case, maybe the user gets 2 emails."

Sanford is right. However, it seems like you could minimize this possibility by adding the choice to the flow of each campaign - "Member of Smart Campaign is Not <The Other Campaign>.

Re: "What if the user is removed from the list in between when the SC criteria makes them eligible and when the flow actually runs? Should I be re-checking list membership before sending the email?" Are you worried the user who has been removed from the list will get the email when she shouldn't because she is no longer eligible by the time the flow runs? If so, re-checking list membership before sending the email is a reasonable approach.

Denise

Jon_Wu
Level 4

Hi Denise,

Yes in addition to lack of "locking read" support for exclusive access to the list within a few second time period (much like Sanford is talking about and I'm thinking about as an engineer), I'm also concerned about the gap of time between member list creation and campaign execution (from Under the Hood II: Batch Campaigns recording and How Campaign Processing Works ).

Since it seems like there could be a big delay between member list creation / initial eligibility via the Smart List and the actual Flow steps, it seems like the secondary check would help. Hopefully that extra check doesn't slow down flows too much, but it seems like that's a fast (< 50 ms) operation so it's probably a good idea. Upon thinking about this race condition more, it seems like I want to remove the user from the list ASAP to reduce the time window between checking eligibility and list removal. Since flow steps like email and webhook sending might take a while, I'm now thinking I'd use a choice step to Remove From Flow as the first step if the user isn't in the list. Then proceed to remove them from the list and then do any flow like webhooks and emails.

Here's the template I think we'll use for each campaign where we use a queue.

pastedImage_3.png

pastedImage_5.png

Denise_Greenb12
Level 7

Hi Jon,

I think you mean to say Member of List "Not in" rather than "in" in Flow step 1. Otherwise, looks good.

Denise

Jon_Wu
Level 4

Thanks for taking a look Denise.

It reads in a confusing way, but I think it's doing what I want, while avoiding "not it" / negation for the best list querying performance although I haven't had a chance to test yet. The intent is that we don't remove from flow if you're still in the list, then continue on to remove you and send emails or fire webhooks. However, if you aren't in the list, I'd expect the default choice to kick in and remove you from the campaign to avoid firing the steps in case you were removed from the list by another concurrent campaign between list creation and flow execution.

Does this seem right?

Denise_Greenbe9
Level 1

Hi Jon,

Ah, I see what you mean now. The logic works but I think it's unnecessarily complicated. I think querying static list for membership is a pretty light load and I would opt for the easier-to-understand option: Remove from Flow if not member of list, otherwise do nothing.

Denise

Jon_Wu
Level 4

Thanks for the advice! While normally, I think readability is king, I was opting for the positive lookup to go along with the Smart List Best Practices #3:

Use positive over negative operators - While "not" filters are available, they have to search the entire data set in your instance, which can be extremely time-consuming. Positive "is" filters are able to leverage more effective search algorithms.

As long as we had good docs on how to set things up correctly, I'm hoping the confusion won't be too bad and that the potential performance will be worth it.

Thinking about the backend, they probably make this recommendation because it's common that databases can't use an index for a negation query, but they can for instantaneous lookup for an equals query. https://stackoverflow.com/questions/1759476/database-index-not-used-if-the-where-criteria-is​ discusses this for SQL. I'm not sure what database Marketo Static Lists are stored in and if the same issue applies, but it seems like better safe than sorry when working with potentially millions of things.

SanfordWhiteman
Level 10 - Community Moderator

Hey Jon, I definitely think this is overkill now, and causing more harm than good. The more steps you have, the more unsynchronized operations there are, and the greater capacity for race conditions. (Because again, the steps of a single flow are not wrapped in a transaction or lock.)

If you're willing to have a Smart Campaign that's this obscure, but with dubious (maybe even negative) effects, why not go for an actual Compare-and-Swap via a webhook? That'll guarantee you will not have a race condition.

databases can't use an index for a negation query, but they can for instantaneous lookup for an equals query

Even if it's called Not in List in the UI, doesn't have to be materialized as a inequality query. It can just as easily be SELECT ID FROM LISTS WHERE ID = ? AND LIST = 22, checking for no result set. That's a standard equality query and will use an index.

Jon_Wu
Level 4

Hi Sanford. It seems like what I have down gives some balance between adding a little safety and not increasing complexity too much, adding external dependencies, or slowing down performance by more than a couple hundred ms, but I'd love to hear more about your thoughts on compare-and-swap via webhook.

First, I think what I have now in the screenshot reduces the race condition window to a reasonable size. There's probably a 50-100 ms window between where we check to remove them from the flow and when we actually get them removed from the list. Not ideal, but I don't expect too much concurrency except when a cron / scheduled Smart Campaign overlaps with triggers. As I'm typing this, another way I could reduce the likeliness of a race condition is to add a wait step before putting people into a cron list, so that the cron is truly more of a fallback and won't compete with triggered campaigns that execute immediately, but that would add yet another step. When new leads are created OR when leads are associated, I try to kick off actions to send emails that were triggered by activity that was logged while the lead was anonymous. The ideal scenario is that I can use a trigger on a server-side form post that I use to associate leads, but if that fails or if I'm associating via REST API, I depend on 24 scheduled batches (running once per hour) to try and flush the queues out as a fallback. If any of those scheduled campaigns run at the same time as the lead association, that's where the race is most likely to happen.

The super ugly part with all of this is that I'd have to have our marketing team implement all of these things every time the want to set a triggered email, which seems tedious. If I was using a webhook to do a lock externally, I'd need to set up another service, have a key/value store with tracking for every campaign that's running, and then I think I'd be writing back to a Marketo field with the result by campaign, so I'd need a new lead field for every campaign that I run. Or this kind of what you're thinking with a compare-and-swap via webhook? If so, I think that's probably too much. Worst case if the race condition happens, is somebody gets 2 emails. If that happens 1 in 100,000 times that's probably not a huge deal.

While I agree the smart thing to do would be to never use an inequality query, but I was imagining if that were the case, the Smart List Best Practices doc wouldn't need to recommend use positive over negative operators, but this is just speculation of course. Either way, I think these lists will probably be small so Denise's advice to go with more readable solution may be the most practical if I was keeping this strategy.

SanfordWhiteman
Level 10 - Community Moderator

If so, re-checking list membership before sending the email is a reasonable approach.

But the lookup and the subsequent send are not interlocked, and there isn't a unified view of the database that's guaranteed to persist across the steps of a flow. (Think about how many flows would be broken if this kind of isolation were in place!)

This race condition is more like the classical examples from programming. It's a fine-grained example of how checking for a condition, then proceeding as if the condition is still true despite the surrounding system not making that guarantee, is ultimately unreliable -- however unlikely it seems that you'll run into the bad case.

One way to avoid this is to use a system that can deliberately invalidate the condition at exactly the same time (interlocked) it evaluates the condition. That guarantees that any later attempt to read the same condition will fail, even if it's only one clock tick later. Or you can use a system that uses at-most-once to pop something off a stack and guarantee to never pop it again.

Jon_Wu
Level 4

Seems like without the ability to lock, you basically can't ever implement at-most-once. Outside of Marketo we use Pub/Sub, which has at-least-once delivery as is common with distributed systems, so we have to track each message ID centrally in MySQL with locking to avoid duplicate processing.

It seems like something similar to the list / flow I have in my screenshots in my other post are as close as I'm going to get in Marketo. Thanks for verifying, just wanted to make sure I wasn't missing some expert strategy. It would be kind of nice if Marketo couldn't run a specific SC for the same user more than once in parallel, but that's probably way too complicated in a distributed system.

SanfordWhiteman
Level 10 - Community Moderator

Seems like without the ability to lock, you basically can't ever implement at-most-once.

Only via a webhook which uses a back end that supports it. In case of network or local processing errors still have to consider the zero case (when the service gives you the result but you mishandle/can't handle it and it never will give it to you again).

SanfordWhiteman
Level 10 - Community Moderator

Unfortunately, you're still going to have race conditions in this scenario. There simply isn't enough atomicity or transaction awareness (not in the transactional email sense but in the SQL transaction sense).

What you need is an Atomic Compare-and-Swap (CAS) or true stack to make this work. You could definitely accomplish this using a webhook and a key-value store with CAS abilities.

(Also, just FYI, I'm primarilySanford Whiteman​ -- you @'d one of my secondary accounts!)