When retrieving result sets from the /rest/v1/activities.json end-point using a Paging Token, the result set returned by the API appears to be less data than what is in the Marketo UI.
In this scenario, we're going to explore how and why Paging Tokens work when utilizing them in REST API calls.
SCENARIO
Say we wanted to grab all 'Send Email' activities (activityType = 6) that have occurred since 2024-04-12T12:00:00Z, and we're planning to pull these activities every two hours, and increment our paging token every two hours prior to making the /activities.json call to obtain the latest data.
Current Time
Apr 12, 2024, 8:00:00 AM (MST) == 2024-04-12T14:00:00Z
What We Want
1. Retrieve all Send Email activities SINCE Apr 12, 2024, 6:00:00 AM (MST) == 2024-04-12T12:00:00Z
We would start by retrieving our Paging Token with the sinceDateTime parameter = 2024-04-12T12:00:00Z:
/rest/v1/activities/pagingtoken.json?sinceDatetime=2024-04-12T12:00:00Z
{ "requestId": "1607c#14884f3e74e", "success": true, "nextPageToken": "RW6ZK46LKV36BMJ2QARX3BFZHQNHRVPPEW4IZVEELI45V2OH6RCA====" }
/rest/v1/activities.json?nextPageToken=RW6ZK46LKV36BMJ2QARX3BFZHQNHRVPPEW4IZVEELI45V2OH6RCA====&activityTypeIds=6
PROBLEM
We know that, from our Marketing team, that we were expecting ~20,000 email sends since 2024-04-12T12:00:00Z. So why does the result set appear to be missing 5,000 records?
Let's examine this further.
'Send Email' activities aren't committed to the Marketo database until the message has successfully been sent from our mail servers. Prior to that, the email is queued, awaiting it's launch to it's receiving mail server. This can present a delay from the time the email is sent, to when the activity is recorded to the Marketo database.
For example, say there were two campaigns sending 10K emails each:
First Campaign Send: 2024-04-12T12:30:00Z (30 minutes after our sinceDateTime token)
Second Campaign Send: 2024-04-12T13:45:00Z (110 minutes after our sinceDateTime token, 10 minutes prior to when the call was made)
While the first campaign send is most likely accounted for in the database (Send Email activities have been committed), Campaign #2 is still processing and sending out all 10K emails in the background. For this scenario, Marketo has only committed 5K out of 10K email sends to the database at the time the call was made.
Without noticing the missing data, you repeat this process two hours from now, incrementing your sinceDateTime token by two hours as well, assuming you'll get all the new Send Email activities for the past two hours.
This would now lead to you missing the 5K Send Email activities that were still being processed, but not yet committed to the database.
SOLUTION
The nextPagingToken returned in the /activities.json end-point is position based, and can only pull data that has been committed to the Marketo database, which inherently changes as records are being processed.
If, for instance, you made the original /activities.json call at (Apr 12, 2024, 8:00:00 AM (MST) == 2024-04-12T14:00:00Z), and used the same sinceDateTime token 5-10 minutes later (Apr 12, 2024, 8:10:00 AM (MST)), you would now notice that campaign #2 has fully finished it's processing AND committed the 'Send Email' activities to the Marketo database with dateTime stamps less than Apr 12, 2024, 8:00:00 AM (MST).
So, what's the best way to ensure you're retrieving all the data?
Solution #1
Increment your sinceDateTime token based on the MAX(created_at) dateTime returned in the latest result set, rather than incrementing it based on an interval (every two hours). This would help ensure you're not missing records that were processing, but not yet committed to the Marketo database.
Solution #2
If the data you're retrieving does not need to be live, we would recommend using the Bulk Extract API to pull this data at a lower frequency (such as twice per day 12AM & 12PM). This would greatly decrease the chances of missing records that weren't committed to the database.