Bulk Processing Using OpenAI Batch API

Tyron_Pretorius
Level 9 - Champion Level 9 - Champion
Level 9 - Champion

This blog post will use the example of bulk formatting numbers in your Marketo instance to E164 format to show how you can use Python code to:

 

  • Transform your data exported from Marketo into the required format
  • Upload it to the OpenAI Batch API with instructions on how to process the data
  • Download the processed data
  • Upload the processed data to Marketo

 

 

The main advantage of using the OpenAI Batch API is that it offers a 50% cost reduction compared to using the synchronous i.e. regular API, which can offer significant cost savings if you have hundreds of thousands of records in your Marketo instance that need to be processed.

 

Another nice feature of the batch API is that it’s consumption is not taken into account in the rate limits of the synchronous API, which means that it can be used to increase the number of requests you are sending to OpenAI if you are running into rate limit issues with the synchronous API.

 

The main idea here is that instead of looping through each row of your CSV of input data and using the synchronous API to make a request for each row, you will:

 

  • Create a JSON object from each row of your CSV containing the system prompt and user message
  • Store all these JSON objects in a JSON Lines (JSONL) file
  • Create a batch job from this JSONL file so that each line will be processed
  • Download the output file containing the result for each line

 

The one downside to using the batch API is that there is a 24hr completion window, you cannot make it any shorter than this and you have no way to speed up the processing of your file. So if you need results quickly then the best approach is to loop through your CSV and make a request using the synchronous API for each row.

 

 

E164 format is an international standard for phone numbers starting with a “+” symbol, the international country code, the area code, and then the number without any non-digit characters present e.g. +18583269202.

 

Most importantly this is the standard required when using Telecoms APIs e.g. the SMS APIs provided by the likes of Telnyx and Twilio, meaning that you will need your numbers in this format if you want to send SMS messages using automation from your Marketo instance.

 

Additionally, having phone numbers in E164 format also allows you to use number lookup APIs , which can be used to verify that a phone number entered in a form is valid and it can tell you information about the phone number such as:

 

  • Geographic information
  • Mobile versus landline
  • Information about the carrier

 

Aside from being able to use these numbers with telecoms APIs it is always good to have consistent formatting for data integrity and for your sales team especially, they will be able to call these numbers without needing to look up what country the person is in and find the extension for that country. If you use a number lookup API then you can also prioritize leads for a sales team if the number entered is a mobile number rather than a landline.

 

 

Now that you know the benefits of having your phone numbers in E164 format you should first of all break your phone number field into a drop down selection for the country e.g. Phone Number - Ext, and then an open text field for the remainder of the number e.g. Phone Number - Base.

 

Tyron_Pretorius_0-1751761436822.png

 

Marketo form set up with hidden “Phone” field

 

Then ask your front end team to concatenate these 2 fields together with Javascript and then store the value in the hidden Phone Number field. Additionally, ask them to implement a validation rule on the Phone Number - Base field pre-submission that will show an error message if someone enters non-digit characters in the field. Alternatively, you can let people enter what they want in the Phone Number - Base field and then clean up the field post submission using Javascript to strip out non-digit characters.

 

Tyron_Pretorius_1-1751761436932.png

 

E164 warning message on a form

 

Once you have worked with your front end team to lock down your Marketo forms you will know that every time a phone number is submitted to your instance it will be in the correct format, which means that it is time to clean up those historic phone numbers!

 

 

The first part of the cleanup process is to find the phone numbers that need to be put in E164 format. If you have never had any formatting rules applied to a form, either to prevent non-E164 number submissions or to clean the number up after submission, then you will most likely need to export every phone number.

 

So the “phone number is not empty” filter will likely be all you need but if you want to start with a smaller list of numbers that will most definitely need E164 formatting then you can search for numbers not starting with a “+” symbol or those containing characters like “(” or “-“.

 

In the list view we will need to export the Id, Email, Phone, and Country fields.

 

Tyron_Pretorius_2-1751761436851.png

 

Finding numbers to be formatted

Tyron_Pretorius_3-1751761436965.png

 

Fields to export for E164 formatting

 

Once you have exported your data as a CSV from Marketo, the next step is to convert this CSV to the JSONL file format accepted by the OpenAI batch API. While you can create this JSONL file manually using a Google sheet, it is tedious, and since you will need to be able to run Python code to create, poll, and retrieve a batch job I recommend using the csv_to_jsonl.py script.

 

This script iterates through every row of the CSV and joins the system prompt and user message together in the correct JSON syntax. If you want a visual representation of what the script is doing to transform the input data to JSONL format then you can check out the 2. JSONL File tab in this Google sheet.

 

You will see that the Phone and Country from each row will be used in the user message. Additionally, the id and email from every row will be concatenated together in the custom_id parameter. As we will see later on when we retrieve the processed batch, this allows us to know which output row corresponds to which person in Marketo and the email address allows us to do a list import to update the phone numbers on each person.

 

Tyron_Pretorius_4-1751761437004.png

 

Creating the JSON Object

 

In order to prevent OpenAI from being too verbose and returning more than the E164 number, we also set the max_tokens parameter for each JSON line to be 12 tokens. We can do this safely because we know the longest E164 number possible is 15 digits and even if we want to keep extensions in the output the OpenAI tokenizer shows we are still well under 12 tokens.

 

Tyron_Pretorius_5-1751761436830.png

 

Max tokens used by an E164 number

Tyron_Pretorius_6-1751761436881.png

 

Storing the Marketo id and email in custom_id

 

An important consideration when running this script is the model you chose and the “Batch Queue Limits” applied to your account for that model. If your OpenAI account is relatively new and you have not made a lot of API calls then this limit can be quite low e.g. 90k tokens per day (TPD). You can find your OpenAI account limits here.

 

Tyron_Pretorius_7-1751761436979.png

 

OpenAI Account API Limits

 

Another constraint is that there cannot be more than 50k requests i.e. lines, in the JSONL file. If you want to read more about the rate limits for the batch API then you can check out the “Rate Limits” section in the docs.

 

To account for this the script will make sure that each JSONL file it creates contains less tokens than the MAX_TOKENS_PER_BATCH value and it will make sure that there are less than 50k lines in each file.

 

 

The first step in using the following 3 scripts is to obtain your OpenAI API key. You will need to insert this in all three of the following scripts.

 

Then to create your OpenAI batch you will need to modify the script to reference the location of the JSONL file you created in the previous step. If you had to create multiple files because the number of tokens or requests required by your input data exceeded your TPD account limit or the 50k request per batch limit then you will have to run this script over successive days to create batches for your multiple JSONL files.

 

You can use the openai_create_batch.py script to create a batch.

 

An example response returned from OpenAI when a batch is created is shown below. You will see that a file id is returned because your JSONL file is now stored in the cloud in your OpenAI account and you will be able to reference this file in future using this id in other API requests. A batch id is also returned and you will use this to poll the status of the batch to see its progression and when it is complete.

 

FileObject(id='file-MUN9BbrWp4f3xFtSftxthb', bytes=8252, created_at=1751335735, filename='batch_0.jsonl', object='file', purpose='batch', status='processed', expires_at=None, status_details=None)

Batch(id='batch_686343384ad08190bba092b41084b50e', completion_window='24h', created_at=1751335736, endpoint='/v1/chat/completions', input_file_id='file-MUN9BbrWp4f3xFtSftxthb', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1751422136, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'Blog Post Test'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

 

The openai_poll_batch.py is only 4 lines of code since using the OpenAI library makes this very easy. You will use the batch_id returned in the response when you created the batch in order to poll the status of the batch. The first response pasted below shows what it looks like when the batch is processing, you can see the “status” value is “in_progress” and the “completed” value near the end indicates how many of the lines in the JSONL file have been processed. Note that the output_file_id parameter is empty because the batch has not finished processing.

 

Batch(id='batch_686341e5c6ec81908e93932629afb250', completion_window='24h', created_at=1751335397, endpoint='/v1/chat/completions', input_file_id='file-RDdWicvkgB7x16M87jvb4C', object='batch', status='in_progress', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1751421797, failed_at=None, finalizing_at=None, in_progress_at=1751335399, metadata={'description': 'Blog Post Test'}, output_file_id=None, request_counts=BatchRequestCounts(completed=4, failed=0, total=9))

 

The second response shows what it looks like when a batch has finished processing. You will see that the “status” parameter is now equal to “completed” and the “completed” number matches the “total” number. The output_file_id is now populated, you will use this id in the next step to retrieve your processed batch.

 

Batch(id='batch_686341e5c6ec81908e93932629afb250', completion_window='24h', created_at=1751335397, endpoint='/v1/chat/completions', input_file_id='file-RDdWicvkgB7x16M87jvb4C', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1751335444, error_file_id=None, errors=None, expired_at=None, expires_at=1751421797, failed_at=None, finalizing_at=1751335438, in_progress_at=1751335399, metadata={'description': 'Blog Post Test'}, output_file_id='file-Vwiv9sUhPUDBD9KTNEmKT6', request_counts=BatchRequestCounts(completed=9, failed=0, total=9))

 

The openai_get_batch.py script uses the output_file_id from the poll batch response to download the processed data from OpenAI. It then stores the raw JSONL output file on your local machine and then goes through each line in the file to extract the Id, Email, Phone Number, Input Tokens, and Output Tokens so that they can be stored in a more readable CSV format. The “Total Cost” for each line is also calculated and stored in the output CSV.

 

Tyron_Pretorius_8-1751761436890.png

 

Output columns stored in CSV format

 

You can see what the raw JSONL output from OpenAI looks like in the 3. OpenAI Raw JSONL Output tab of this Google sheet and what it gets transformed to by the script in the 4. OpenAI Formatted Output tab. Then if you want to do a comparison between the original phone number and the E164 number returned by OpenAI then you can use a vlookup formula as shown in the 5. Comparison tab.

 

The email address and the phone number columns are the most important here because we will isolate these and use them to do a list import to update the phone numbers for all these people in Marketo as shown in the 6. Marketo Export tab.

 

 

Now this is the exciting part where you can actually use these E164 formatted numbers with an SMS API to start sending notifications and marketing messages to your Marketo database. Then if you want to implement phone number validation, enrich your database with geographic information, and enhance your lead scoring then you can start using a number lookup API with Marketo webhooks.

1176
0