Ashburn Data Center Incident: June 4

Document created by AnneMarie Covelli Employee on Jun 7, 2019Last modified by AnneMarie Covelli Employee on Jun 14, 2019
Version 8Show Document
  • View in full screen mode

This document provides an overview of the recent service issue that impacted customers in our Ashburn data center on June 4, 2019. If you are unsure if your instance was impacted, please Contact Marketo Support at https://support.marketo.com.

 

When:

June 4, 2019

  • Impact to interactive logins: 12:33 PM PDT - 2:30 PM PDT
  • Impact to remaining services listed below: 12:33 PM PDT - 5:45 PM PDT

 

Duration:

5 hours, 12 minutes

 

Service affected:

Interactive logins may have been intermittently disrupted between 12:33 PM PDT and 2:30 PM PDT. The services listed below may have been intermittently, or completely impacted for the duration of the issue, 12:33 PM PDT - 5:45 PM PDT.

 

  • The Marketo Sky user interface inaccessible
  • REST API may have returned error messages of 611. A full list of REST API error messages can be found here.
  • For a subset of customers, activities that occurred during the impacted timeframe may not have been fully indexed. This could cause smart lists to show inaccurate data, which may cause campaigns to qualify, or not qualify, records inaccurately.
  • Some triggers were not evaluated even if a lead qualified for a trigger campaign.
  • Batch and trigger campaigns could have failed. Any campaign failures should be visible in your instance's notifications center. Documentation on how to find these notifications can be found here.
  • List imports were put in queue for a small subset of customers. These queued up list imports had to be canceled and will need to be re-imported. To identify lists that failed to import, right-click on the list from the navigation tree and select "Show Import Status". This will bring up the import status dialog box.
  • Bulk exports (leads and activities) were put in queue due to this incident. Our team resolved this on June 5, 2019, allowing most exports to complete at that time. There is a possibility that the export failed for some customers. In this case, exports would have to be re-attempted.
  • Intermittent bandaid errors may have appeared in the Marketo user interface.
  • Email tracking links may not have been able to be resolved. This would have resulted in:
    • Landing pages being inaccessible when clicking a tracked link.
    • "Click link in email" activities not persisting.
  • Account Based Marketing (ABM) may have returned stale data.
  • Real-Time Personalization (RTP) and Search Engine Optimization (SEO) tiles in the Marketo user interface may have been inaccessible.
  • Facebook and LinkedIn LeadGen may have been inaccessible.
  • Some web / Munchkin activities may not have been recorded.

 

While these services were impacted, the serving of forms and landing pages and SOAP API continued to function as normal.

 

What happened:

The source of the disruption has been identified as an IP address conflict issue. On June 4, 2019, a new network hardware device was initialized that inaccurately acquired an IP address that was already in use by the load balancer, another network hardware device. This network address conflict caused the load balancer to be intermittently unavailable, stopping network traffic. This could have caused network requests for internal services such as Locator Service, Metadata Service, and Activity Service to time out, resulting in the affected Marketo services to be intermittently or completely impacted. Due to the complex nature of the issue, it took longer than normal to identify the source of the problem. While we were able to identify the network address generating errors, each time the new device was reinitialized, the errors would appear and disappear, causing the intermittent symptoms.

 

Remediation:

Once the issue was identified, our team took immediate action to resolve the issue. To restore interactive logins, our team implemented a workaround to bypass the impacted load balancer device. Additionally, we began to migrate the remaining impacted services to an alternative load balancer. During this secondary process, we discovered the IP address conflict issue. Once we identified the definitive root cause, we disabled the network devices and full service was immediately restored.

 

To correct the activity data that was not fully indexed during the impacted timeframe, our team has developed and begun the process to correct the data. Please note that activities that occurred outside of the impacted timeframe were not affected. Due to the large volume of activities, we expect this process to take up to 30 days to complete, with an anticipated end date of July 8, 2019.

 

Facebook and LinkedIn LeadGen data may not have been recorded in Marketo during the impacted timeframe. To correct this, our team completed a data fix process on June 12, 2019, to replay these requests and ensure no data was lost.

 

There is a chance that a small number of Munchkin activities that were affected during the impacted timeframe did not get recorded. Our team cannot reprocess these activities as this would cause duplicate events for other activities that did get recorded.

 

We will continue to update this article including data fix timelines and processes as soon as such information is available. Please check this article frequently for updates.

 

For additional questions, please Contact Marketo Support 

Attachments

    Outcomes