Data Warehouse ETL

Anonymous
Not applicable

Data Warehouse ETL

Has anyone successful built a data warehouse integration where they are able to replicate Marketo Activities to their warehouse? I am looking to replicate as often as I want but I am wondering the best way to do it. My solution I was thinking was something like:

1) Replicate the "Activity Types" (for us its like 40ish records, assume 50 total) and assign each to a bucket of 10

2) Then for each bucket call the getLeadActivities. If I did it every 5 minutes it would be:

  • Every 5 Minutes
  • 24 * 60 / 5 = 288 times per day
  • 50 Activities in buckets of 10 = 50/10= 5
  • min 2 calls per API call (1 to get token, minimum 1 to get data, depending on number of pages)

Total Api Calls = 5 * 288 * (50/10) * 2 = 14,400 per 24 hours period

To do an incremental refresh, I am thinking of the latest activityDate for each bucket as the starting point for each API bucket. This ensures that I don't miss any events in the replication.

I am hoping to understand how others are doing this? My goals are obviously as fast as possible, with as few api calls, and most efficient.  I have included my python library I am using to call the API. Its not ready to be open sourced but its a start. It uses a generator to help with paging when multiple pages need to be for a single time.

Wrapper Class for Marketo API · GitHub

Thanks,

Brad

2 REPLIES 2
Level 10 - Community Moderator

Re: Data Warehouse ETL

Yep, we're building another large integration right now.  Beyond checkpointing the last activity date, which is the common practice, I think your plan is a bit off. As a general guideline, you want to minimize short-tail batches (huge difference between 1 result and 300 results), and tactics like forcibly splitting your activity types (even if you do not keep all activity records) and always polling every 5m (instead of using an adaptive backoff) can both work against that goal.  Also unsure why you're getting an access token before each API call, as you only need to do this once per hour and/or upon error.  (As we've discussed elsewhere, an access token with 1s left can expire before it's used, so you have to either rotate users on the hour or handle errors; reauth'ing before each call doesn't give an advantage.)

Perhaps the reality is you aren't building a DW but trying to replicate the Marketo database in near-real-time? If so, I don't think that's feasible. Same with SFDC or just about any SaaS database (unless they have an API specifically advertised for replication).  A DW doesn't need to be 5-minute-fresh.  Our SFDC DW refreshes every 6 hours, and even that's more frequent than we'd like, but for compliance reasons we have to grab deleted records before they get purged (not a concern with Mkto).

Anonymous
Not applicable

Re: Data Warehouse ETL

Thanks Sanford for your insights here. The suggestion around minimizing the number of API calls by reusing the token is a good one, and I have implemented that logic to check if the token is expired. On the other hand can you expand on what you mean by

Sanford Whiteman wrote:

As a general guideline, you want to minimize short-tail batches (huge difference between 1 result and 300 results), and tactics like forcibly splitting your activity types (even if you do not keep all activity records) and always polling every 5m (instead of using an adaptive backoff) can both work against that goal.

Would you recommend not batching activity types and instead calling one at a time? The reason I batch them is to minimize the number of API calls. What is the logic you would recommend for determining adaptive backoff?

Thanks!