As a RevOps professional, you may be struggling with SPAM form submissions, keyword matching on job titles to determine personas, or messy open-text fields that make it hard to extract insights from your data. These data categorization challenges hinder segmentation, personalization, and reporting, preventing your team from leveraging your data and making it difficult to send tailored content to your audience.
This blog will show you how to use the Fine Tuning feature offered by OpenAI to train models on your custom data to boost the accuracy of SPAM filtering, automate persona classification, and intelligently categorize open text fields.
Check out the video below to see:
We have probably all seen cases where our Marketo forms have been filled out by bots or by people with SPAM content. The Google CAPTCHA integration with Marketo is one way to solve this problem but I’ve had issues in the past (see images below) where genuine leads have been classified as Suspicious and bots have been classified as Trusted.
This means that with the trigger constraint shown below genuine leads were not making it to the sales team leading to lost revenue and bot/SPAM leads were making it to the team wasting their time.
To compound these issues, the way CAPTCHA works is a black box and there is no insight into why someone was classified as “Trusted” or “Suspicious” and there is no way to improve the performance or to train the CAPTCHA model on your own data.
In steps Open AI’s Fine tuning feature which allows you to train one of OpenAI’s models on your own form fill data. Now when you call a webhook from Marketo to reference this fine-tuned model and the result comes back as bot/SPAM then you can remove this lead from the flow of your smart campaign so they do not waste processing and they do not end up in front of your sales team.
Whats more, as I show in the example below you can get OpenAI to give you the reason for classification so you know why a lead was classified a certain way. Then if you ever have any issues with leads being misclassified then you can add these as examples to the training dataset and tell OpenAI how they should be classified so that the next time similar examples come in they will be correctly classified.
I’m sure some (or a lot) of you out there have tried to build lead scoring or persona matching using “Job title contains keyword” logic in Marketo. As you might have seen this can be fraught with errors, if you are trying to find Chief Operating Officers and you keyword match on the job title containing “coo” then you will inadvertently pull in “cook” and “coordinator” who are nowhere close to a COO.
Now you could use smart lists to help with job title matching so that you can say title contains “coo” but does not contain “cook” or “coordinator” but this becomes cumbersome and it’s hard to predict and exclude all unintended matches.
This is why using AI for persona classification is so powerful because it is smart enough to:
The power of building your fine tuned model on top of OpenAI’s broader base model means that even though your training dataset might only contain examples in English and no misspelled examples, the fine tuned model will still be able to understand titles in different languages or misspelt titles and map them to the correct persona.
P.S. I can’t take credit for thinking of using fine tuning for persona classification, the idea actually comes from my friend Lucas Machado. He presented on persona matching with AI with me at summit in 2024.
Having an open text field asking a lead “How did you hear about us?” is a powerful way to gain new insights into how people are finding out about you. You already know about all the channels you are actively trying to get people’s attention so it is no surprise when these channels show up in the field.
Where the real value comes from this open text field is that it shows all the places people are hearing about you that you are not responsible for and might not even know existed e.g. YouTube influencers mentioning your product, blogs linking out to your site, apps listing you as an integration partner etc.
You would not get all these new insights if you were to offer people a dropdown field of all the channels you know about already.
The challenge with this open text field, where there are an infinite number of possible values, is how you can analyze the data and extract insights. At a small scale, looking at each response individually, categorizing it, and then searching for trends and patterns is the best way to go. However, at a large scale it is not feasible to do this for each response.
This is the type of use case that AI is perfect for. Once you train the model on how you want the open text field to be categorized e.g. into a “Hear Source” field, then you can use this categorized field in your reporting visualizations to see trends and patterns for the leads created in your database over time.
All the same advantages mentioned above with persona matching apply here, the AI model is smart enough to handle misspellings and values given in any language and still map them to the correct “Hear Source” category.
For example you can now build a stacked bar of “Hear Source” with month on the x axis and lead count on the y axis. If you had tried to build the same graph using the “How did you hear about us” field then there would be so many different values in the stacked bar that you wouldn’t be able to visually see “groups” to extract insights from e.g. goole, goggle, and google, will all be distinct lines in the graph rather than being grouped together.
If you are unfamiliar with using webhooks in Marketo, then you can check out the webhook quick start guide to see how to set them up, use different tokens in the payload, and get inspired by some real world RevOps use cases.
In the webhook for each use case we are going to make an API request to the create chat completion endpoint from OpenAI referencing the fine-tuned model to achieve the AI categorization we desire. We will be sending the webhook to this URL: https://api.openai.com/v1/chat/completions
The payload of the webhook is where we configure all the settings for the OpenAI chat. We set the:
Pasting each of the 4 options for the persona value (Csuite, Engineer, Manager, Other) into the OpenAI tokenizer we can see that each one only uses 2 tokens so this is why the max_completion_tokens
parameter is set to 2. This is important to prevent the model from being too verbose and returning extraneous information especially since we want to send this value into the mapped Marketo field right away without any text formatting or parsing (which is hard to do in Marketo).
{
"model": "ft:gpt-4o-mini-2024-07-18:telnyx-official:marketo-spam-20250401:BHgA3TLC",
"temperature": 0.2,
"max_completion_tokens" : 2,
"messages": [
{ "role": "system", "content": "Your job is to categorize the given job title into one of the following personas: Csuite, Engineer, Manager, Other" },
{ "role": "user", "content": "{{lead.Job Title}}" }
] }
Follow this link to get an OpenAI API Key for your account and then in the headers section of the webhook we will set the:
A typical response from the Chat endpoint is pasted below. In order for us to get the answer from the model we have to dig down to the content parameter using this notation in Marketo: choices[0].message.content
. We then map this answer to our desired Marketo field so that we can use it in our smart campaigns or our data analysis.
{
"id": "chatcmpl-BZaLlt2YOBRXARMzoUpxHGDjS5eng",
"object": "chat.completion",
"created": 1747820909,
"model": "ft:gpt-4o-mini-2024-07-18:telnyx-official:marketo-spam-20250401:BHgA3TLC",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Csuite",
"refusal": null,
"annotations": []
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 74,
"completion_tokens": 2,
"total_tokens": 76,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
}
Preparing the dataset is the most manually intensive part of the fine tuning process because it requires you to manually review all the “user” (input) values and write out what the desired “assistant” (output) answer should be. It is important to be as consistent as possible when you are giving the “assistant” answers because this will make the output of your fine tuned model more consistent.
The OpenAI fine tuning feature then requires the training data to be in a JSONL file which is a file where each line is in JSON format containing the system, user, and assistant values. You can see what this looks like for each of the 3 use cases above in this Google sheet in the “…Export JSONL” tabs.
As mentioned in the sections below, you can either get the data from Marketo into this JSONL format using Google Sheets formulas, as shown in the “…Export JSONL” tabs, or using a Python script with the “…Export for Python” tabs downloaded as CSV files.
OpenAI recommends having at least 50 examples in your training dataset with a minimum requirement of 10. If you have way more than 50 examples then to improve performance you can split 80% of your data into a “training” JSONL file and then the other 20% into a “validation” JSONL file. I will show you how to upload those to the OpenAI Fine Tuning GUI later.
The first step in preparing your training dataset is to create a smart list in Marketo of the people with the fields populated that you want to create the training dataset from. Then when viewing the list of people edit the view so that only the fields you want in the training dataset are included. Lastly, export this list view and import it into a Google sheet.
In each sheet you will see that there is a column for the:
You will see that this formula is being used to create the JSON body by joining the helper columns A-D with the system, user, and assistant columns.
=CONCATENATE(A2,TOJSONSAFE(E2),B2,TOJSONSAFE(F2),C2,TOJSONSAFE(G2),D2)
The TOJSONSAFE function is used to ensure there are no illegal characters in the system, user, and assistant values, which will cause the string to break JSON syntax and the downstream fine tuning attempt to fail.
This function which you can copy into “Extensions > Apps Script” will remove things like new line characters, double inverted commas , and backslashes. This removal is especially important for open text fields where people can type in these illegal characters.
In order to export the “json” column as a JSONL file we will:
If you have enough data and you want to create your fine tuned model with a test dataset and a validation dataset then you can do step 1-3 above twice: once with 80% of the rows from the “json” column and then again with the remaining 20%.
So long as you know how to run a Python script, it is much easier to create the JSONL file by downloading the “…Export for Python” sheet and using this Python script.
This script should be easy to use and the only changes you need to make are to:
The Python script will then:
Now that you have your training JSONL file (and maybe your validation JSONL file) you are now ready to go to the OpenAI fine tuning interface and create your fine tuned model.
Here in the interface you will:
I recommend sticking to the default settings unless you know enough about what each of the parameters do and you are confident your custom settings would outperform that of the default settings. If you want to learn more about what each of the parameters do you can check out the guide and you can upload your JSONL files to ChatGPT and ask it to explain the fine tuning settings and advise on what settings it would use for your data.
Once you click “Create” the model will take a while to generate so it is best to go and do something else while you await the email confirming that the fine tuned model was created successfully.
Then when you get the email, click on the link to view the fine tuning results:
Now that you have the “Output Model” id you can put this in your Marketo webhook to start using the fine tuned model within Marketo.
Webhooks are a nice, easy way to start using fine tuned models within Marketo. However, in general self service flow steps (SSFS) offer a number of advantages over webhooks that would make it easier to make requests to OpenAI and improve performance.
One of the performance benefits of SSFS is that they do not have the 30 second timeout limit of webhooks, which can be problematic if OpenAI takes a long time to return the result to Marketo. You can compare webhooks to SSFS using this blog post.
The most important reason you might want to change to a SSFS is if you are trying to populate multiple fields using the webhook. Unfortunately, even if you do force OpenAI to return its answer in JSON format with a value for each field, it will return that JSON answer in the “content” parameter and we cannot drill any deeper than this in Marketo.
Unlike tools like Zapier where we have the ability to use text formatting/parsing tools and code steps to extract multiple field values this is difficult or impossible to do in Marketo. So if you need to populate multiple fields from your fine tuned model then I would recommend taking a look at this blog post and the “GPT Completion” example to see how you can create a SSFS to reference your fine tuned model.
Self service flow steps do require a lot more effort to set up so if you only want to be able to populate a single Marketo field and your webhook is performing well (i.e. is not timing out) then you can stick with webhooks.
If you are comfortable with programming then you might want to check out:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.