Marketo Activity Ingestion - Understanding Behaviour of bulk extract

DJ_Erraballi · ‎06-17-2020

Hi there,

got a multiparter question:

Question #1

I am currently debugging some issues with our Marketo activity feeds. Noticed recently that we obtained an activity from marketo that looked like this:

result = {OrderedDict: 8}  
 'marketoGUID' = {str} '149877905'
 'leadId' = {str} '3173901'
 'activityDate' = {str} '2020-06-05T23:08:07Z'
 'activityTypeId' = {str} '1'
 'campaignId' = {NoneType} None
 'primaryAttributeValueId' = {str} '31707'
 'primaryAttributeValue' = {str} 'www.multicare.org/photos/'
 'attributes' = {NoneType} None

This landing page activity didn't have the attributes set entirely. (This is where i am usually obtaining the web page url, referral url if present, etc.). Is this expected to occur? Been ingesting landing page activities since 2019 for multiple clients and this is the first time we have come across the above so wondering if it somethign that is likely to occur again.

Currently: not only do we depend on attributes to be set, we also depend on 'Webpage URL' to be set on the attributes field in order to properly report on data in marketo.

Question #2

It appears that we are actually missing some data in our extract from marketo, and it is possible that we may need to tweak our strategy.

Currently we execute a bulk extract with a start date a couple minutes before our most recent known stored activity. If we look at each extract as a slice, currently we are guaranteed to include every single time slice that is possible in our extracts.

Where i am worried, if i query something like (dummy values):

startAt: 3:00pm

endAt: 4:00pm

at 4:01pm. i would get a set of activities that is DIFFERENT

than if made the same query at 4:16pm. (This is my current working theory for why it appears there are activities missing).

Is it possible that marketo can add out of order activities (with activity dates in a past time range)? If so is there a recommended buffer to add to our start_at time period, to ensure we don't miss any activities? Also how long after a time range has elapsed could activities be added to that time range?

SanfordWhiteman · ‎06-17-2020

Yeah, it's almost impossible to solve this. You do have to be aware of it, though: any dashboard is necessarily frozen in time and under some circumstances might be showing you a minority of the activities that would be shown if you re-downloaded later. For example, if you accidentally no-tracked a link, it wouldn't associate people's Munchkin sessions. Then in the future, a tracked link would associate the session and replay all their old activities.

View solution in original post

SanfordWhiteman · ‎06-18-2020

By "Control K" do you mean ASCII %0B? Not sure why that would have any affect on your CSV parser.

View solution in original post

DJ_Erraballi · ‎06-18-2020

Sigh, so for anyone looking into this, i am still in the process of debugging/ working around. But the gist is that there was acontrol character present in the primary attribute value id.

Python has as string command "splitlines" which is generally used w/ python csv readers to parse files. The splitlines commands willa ctually attempt to split on the ^k character as well, which is a fairly underdocumented feature of splitlines. This caused the parsing of the activity to likely fail in the way i have experienced.

since this issue is client-side i will close this issue, (i am assuming there are no guarantees we won't get future files with control characters like this, and that recieving control characters is expected behaviour).

View solution in original post

DJ_Erraballi · ‎06-18-2020

Also looks like we have some data that is coming back from bulk activity export with null activityDates. This is alittle strange since not only had this never happened before june/5 we are only experiencing this data issue with one of out the 8 marketo instances we work with.

DJ_Erraballi · ‎06-18-2020

Here is the raw activity we saw:

'marketoGUID' = {str} 'Education/IntroductionToCorporateCompliance_NonLMS/story.html'
'leadId' = {str} '{"Client IP Address":"REDACTED","Search Engine":"Gmail","Query Parameters":"","Referrer URL":null,"User Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36","Webpage URL":"/
'activityDate' = {NoneType} None
'activityTypeId' = {NoneType} None
'campaignId' = {NoneType} None
'primaryAttributeValueId' = {NoneType} None
'primaryAttributeValue' = {NoneType} None
'attributes' = {NoneType} None

If looks almost as if the activity export generated a completely faulty row, all the fields are null, except marketo_guid, which appears to have the primary attribute value, and the leadId which appears to have the attributes.

DJ_Erraballi · ‎06-18-2020

Ok so both those separate activities were actually the same row in the file,

tracked this down to: where the record contains a control k character in the URL which breaks the csv parsing.

149877905,3173901,2020-06-05T23:08:07Z,1,null,31707,www.multicare.org/photos/^KEducation/IntroductionToCorporateCompliance_NonLMS/story.html,"{""Client IP Address"":""REDACTED"",""Search Engine"":""Gmail"",""Query Parameters"":"""",""Referrer URL"":null,""User Agent"":""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"",""Webpage URL"":""/photos/\u000BEducation/IntroductionToCorporateCompliance_NonLMS/story.html""}"

149878023,2928924,2020-06-05T23:09:25Z,10,1951,2598,Puget Sound-ES-202005-Essential and Financial Email.Email,"{""Choice Number"":""0"",""Campaign Run ID"":""670"",""Platform"":""Win7"",""Device"":""PC"",""Step ID"":""2530"",""User Agent"":""Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; BRI/1; BRI/2; Zoom 3.6.0; Microsoft Outlook 14.0.7248; ms-office; MSOffice 14)"",""Is Mobile Device"":false}"

DJ_Erraballi · ‎06-18-2020

Sigh, so for anyone looking into this, i am still in the process of debugging/ working around. But the gist is that there was acontrol character present in the primary attribute value id.

Python has as string command "splitlines" which is generally used w/ python csv readers to parse files. The splitlines commands willa ctually attempt to split on the ^k character as well, which is a fairly underdocumented feature of splitlines. This caused the parsing of the activity to likely fail in the way i have experienced.

since this issue is client-side i will close this issue, (i am assuming there are no guarantees we won't get future files with control characters like this, and that recieving control characters is expected behaviour).

SanfordWhiteman · ‎06-18-2020

Yeah, that's pretty weird (and contra the CSV RFC, for what that's worth). The main reason you'd use a VTAB in the first place would be to indicate a new line semantically without using an actual CRLF.

SanfordWhiteman · ‎06-18-2020

By "Control K" do you mean ASCII %0B? Not sure why that would have any affect on your CSV parser.

SanfordWhiteman · ‎06-17-2020

Activities are continually merged from the Anonymous side into the Known side of the database (and keep their original timestamp).

DJ_Erraballi · ‎06-17-2020

Thanks for the quick reply. that does answer question #2.

If data is added in an ongoing fashion it does create some challenges for ensuring that exported activities and the activities in Marketo match up, especially if reprocessing time ranges eats away out our daily export quotas. But it does make sense, so i think what i will do is increase the start_at buffer to an hour and hopefully that will suck up enough of a percentage of the difference, without having too large of an impact on the quota.

Any ideas on question #1?

Thanks for the help!

SanfordWhiteman · ‎06-17-2020

start_at buffer to an hour and hopefully that will suck up enough of a percentage of the difference, without having too large of an impact on the quota.

... except activities can be updated months later.

As for #1, no idea yet, but I'll look into it when I can.

DJ_Erraballi · ‎06-17-2020

Yep understood, seems like there isn't gonna be an easy way to recapture missed ones regularly without hitting our export quota. One option is to query historic 28 day ranges on a once per day basis, but i'm too scared to hit export quotas for the clients.

Do a best effort, and hope those late-add activities weren't form submissions :/.

SanfordWhiteman · ‎06-17-2020

Yeah, it's almost impossible to solve this. You do have to be aware of it, though: any dashboard is necessarily frozen in time and under some circumstances might be showing you a minority of the activities that would be shown if you re-downloaded later. For example, if you accidentally no-tracked a link, it wouldn't associate people's Munchkin sessions. Then in the future, a tracked link would associate the session and replay all their old activities.

DJ_Erraballi · ‎06-17-2020

In regards to question #1, i actually think this behavioru did change with this release:

https://docs.marketo.com/display/public/DOCS/Release+Notes%3A+June+%2720 either intentionaly or unintentionally.