Re: Assessment of Data Quality Dimensions

Balkar_Singh · ‎10-18-2022

I have been trying to wrap my head around data quality for a while now. Most instances I have worked on now have the data management program with the steps to normalize, clean, enrich etc. Are there any techniques/methods to assess other data quality dimensions, like accuracy, integrity etc. for the data we have in Marketo Instance?

For example,

We check validity of email addresses via tools like Neverbounce, check completeness based on values in certain mandatory fields, and so on. How to check for accuracy, integrity etc.?

Ashleigh-Ange · ‎10-31-2022

I recommend opting for a working backwards approach -- start with outlining the ideal state of the data in your instance. Consider the ideal, "in a perfect world" outcomes of the campaigns and objectives your instance is supporting, talk to your stakeholder teams and/or your CRM team if your instance is syncing with a CRM, set clear standards for what your data should look like.

Once you've done that, you have a reference point to work off of to identify current gaps in your database through analysis. I suggest taking a tiered approach and prioritizing analysis of your most valuable/impactful fields first, because not all fields are necessarily equal and doing this level of analysis across a multitude of fields can be quite a cumbersome effort.

Once you've figured out the priority fields, run some Smart Lists with records that have those fields populated, and do a manual analysis of 5-10 records for each field. Look at the activity logs for those fields within those records. This will help you start to familiarize yourself with the different ingress points funneling data in -- APIs, CRM, list imports, forms, data staging campaigns. It will also start to help you highlight any patterns (i.e. why does this data show up consistently in a certain way across records?) and allows you to draw comparison with these ingress points (i.e. are they overlapping and/or cancelling out one another?)

As you begin getting clarity around your data sources, that will help you start to evaluate (or at least point you in the right direction) with understanding the reliability of the data. If the source is unrecognized or unknown or old/outdated, there's a likelihood that the data is unreliable. I'd also advise creating some kind of data dictionary or repository in a spreadsheet to store all of this information so that you can go back to cross-reference any outliers/behaviors easily as you're doing your analysis without needing to re-pull your Smart Lists.

Once you're clear on the data sources and any shortcomings/concerns around your sources, this should help make it a bit more straightforward for you to know which fields/data sources are secure reliable, and which warrant triage to improve accuracy and reliability through a one-time cleanup and/or recurring data normalization campaigns.

Hope this helps!

View solution in original post

Ruchi_Lapran1 · ‎10-28-2022

Data sourcing is another dimension to consider and that plays a major role in data maintenance. It's imperative to monitor various sources data is coming from and entering into the database.

For existing data, we can consider data completeness, activity/inactivity, opt-in/opt-out trends, duplicacy, blocklisting, etc.

Ashleigh-Ange · ‎10-31-2022

I recommend opting for a working backwards approach -- start with outlining the ideal state of the data in your instance. Consider the ideal, "in a perfect world" outcomes of the campaigns and objectives your instance is supporting, talk to your stakeholder teams and/or your CRM team if your instance is syncing with a CRM, set clear standards for what your data should look like.

Once you've done that, you have a reference point to work off of to identify current gaps in your database through analysis. I suggest taking a tiered approach and prioritizing analysis of your most valuable/impactful fields first, because not all fields are necessarily equal and doing this level of analysis across a multitude of fields can be quite a cumbersome effort.

Once you've figured out the priority fields, run some Smart Lists with records that have those fields populated, and do a manual analysis of 5-10 records for each field. Look at the activity logs for those fields within those records. This will help you start to familiarize yourself with the different ingress points funneling data in -- APIs, CRM, list imports, forms, data staging campaigns. It will also start to help you highlight any patterns (i.e. why does this data show up consistently in a certain way across records?) and allows you to draw comparison with these ingress points (i.e. are they overlapping and/or cancelling out one another?)

As you begin getting clarity around your data sources, that will help you start to evaluate (or at least point you in the right direction) with understanding the reliability of the data. If the source is unrecognized or unknown or old/outdated, there's a likelihood that the data is unreliable. I'd also advise creating some kind of data dictionary or repository in a spreadsheet to store all of this information so that you can go back to cross-reference any outliers/behaviors easily as you're doing your analysis without needing to re-pull your Smart Lists.

Once you're clear on the data sources and any shortcomings/concerns around your sources, this should help make it a bit more straightforward for you to know which fields/data sources are secure reliable, and which warrant triage to improve accuracy and reliability through a one-time cleanup and/or recurring data normalization campaigns.

Hope this helps!

Balkar_Singh · ‎11-01-2022

Thanks @Ashleigh-Ange - This is helpful!