How Much Storage is Required to Extract My Entire Marketo Engage Database?

Summary

Estimating the external storage requirements for your Marketo Engage Database

Issue

I want to extract all of my data from Marketo Engage and store it. How much storage space will I need?

Solution

Summary

There is no repeatable method to accurately estimate the amount of storage you will need to extract and store your Marketo Engage database. Standing in the way of good estimation is data availability, field selection, and storage method. Any accurate estimate will take into account the potential sizes of each type of data and their quantities (known to data scientists as "facts and dimensions"). Determining ranges for these values takes a lot of preparation and may require a high level of skill.

IMPORTANT NOTE: Estimating Database size is hard so any estimate used to make business decisions should be made in cooperation with a database or application architect or other qualified professional.

Scope

Some information won't be extracted. Information about anonymous leads, for example, cannot be extracted. Some of the data that can be extracted may not be needed at all. Selecting the right data for your needs is the best practice as it reduces the required storage and leads to a more efficient extraction process.

Field Definitions

How the fields are defined in the target system will affect how big the stored data is. Depending on your storage format, padding may play a role in the size of your extracted database. As an example, the "Country" field in Marketo is a string of up to 255 characters. You could chose to store 255 characters for every country value. Or you may choose a format that uses a variable amount of space. You might also know that the longest country name is "the United Kingdom of Great Britain and Northern Ireland" meaning that 199 of those characters will always be extra so you truncate the value from Marketo storing the first 56 characters only. Each choice will have an impact on the size of your extracted database. Estimating 199 unnecessary characters per lead and making similar decisions for other fields will add up to increased storage requirements and slower extraction time.

Format

Once the desired data is identified, the next step is to extract, transform and load (ETL) the data from Marketo Engage into your storage system. The data returned by Marketo's API is plain old text which is usually formatted as JSON or CSV. For the information to be useful, you will transform it from JSON into the format necessary for your storage system. That format could be an Excel spreadsheet, Microsoft SQL database or a schema-agnostic database like Azure Cosmos DB. How the data is formatted and encoded will make a big difference in the amount of storage needed. Take this simple example: a Microsoft Excel spreadsheet with "Marketo Engage" in cell A1. I saved that same file in four different formats which resulted in files ranging from 1 KB to 25 KB. The format you store you information in may have a bigger impact on your final storage requirement than the data itself.
Screenshot of Excel files in various formats ranging in size from 1 KB to 25 KB
 
To help illustrate the impact of the storage system, take a look at this guide for Microsoft SQL Database size estimation: https://docs.microsoft.com/en-us/sql/relational-databases/databases/estimate-the-size-of-a-database 

Functionalization

Once you've extracted your data, what are you going to do with it? Archiving your data (simply storing it) is easiest and comes with the fewest contraints. A compressed archive (Zip file) will save dramatically on storage space at the cost of functionality and ease of use. Functionalizing your data (using it in an application) requires more: at least better speed and searchability: typically a relational database. An application will often require additional data and that will need to be accounted for too.

Facts and Dimensions: Do the Math

It's a lot of work to get to this stage. Once you've determined how your extracted data will be stored, you can set upper and lower bounds on the sizes for each object type extracted (lead, email, activity, etc). These are your facts. Then multiply those values by the number of each type of record. These are your dimensions. Add to that the overhead of your target storage system and its functional requirements to generate your final estimates.
 
 

Environment

Marketo Engage and External Systems

Labels (2)