Clustering products for marketing budget optimization

An ML clustering + budget optimization pipeline to maximize contribution margin

This project is a result of my work with rebuy, a recommerce startup in Berlin. Their marketing and acquisitions team needed a data-driven way to allocate budgets for their marketing campaigns.

As is typical in the ecommerce industry, companies handle thousands of unique products with widely varying margins, stock levels, and price competitiveness. To efficiently allocate their ad budgets across their inventory, I developed a budget allocation tool - a machine learning pipeline that predicts product-level performance, clusters similar items into campaigns, and optimizes marketing spend at the campaign level.

Each campaign would group a set of products in a way as to maximise the profit (contribution margin). The pipeline involved data extraction from their data warehouse Snowflake, EDA, feature engineering, predicting modelling, response curve generation, clustering and constrained optimization.

What I used

  • Snowflake for data warehousing
  • SQL for data extraction
  • HEX notebook - the company tool for data science projects and publishing, works just like Jupyter
  • numpy, sklearn, scipy, plotly, matplotlib - Packages relevant for data science, ML prediction, constrained optimization and plotting

Data source and extraction

The data was stored in the company’s central data warehouse using Snowflake. Using SQL I extracted weekly level data comprising a number or product features including:

  • Google ads performance data: ad costs, impressions, clicks, conversions
  • Product data: sale price, competitor price, contribution margin (CM1), stock availability.
  • Product category data: Manufacturer, Category name etc.
  • Some engineered features like PCI (Price Competitiveness Index): product price vs competitor price.

The result was a nice usable dataframe that was essentially a weekly time series of all products going back 18 months like below (brand names masked for anonymization and only select features shown).

Data frame for ML marketing budget optimization


Preprocessing and exploratory data analysis

Some essential preprocessing and EDA included:

  • Removal of nulls, duplicates and outliers and converting snapshot date to datetime format.
  • Exploring distributions of the different features to get a sense of the range of values and which ones still have outliers.
  • Correlation analysis to find out what features are most correlated to CM1 (and which pairs are correlated among themselves to remove redundant ones and prevent multicollinearity before modelling).


Correlation matrix for ML marketing campaign budget optimization


It became clear that while no feature is highly correlated with CM1 or ROAS

  • CLICKS is highly correlated with AD_COSTS (0.89), IMPRESSIONS (0.8) - which makes sense since more spending means more eyeballs and more clicks, and somewhat highl correlated with CONVERSIONS (0.69), which also makes sense along similar lines.

This means we could only include AD_COSTS among these, which also is necessary since we want to know how much impact AD_COSTS has on CM1.

  • Defining feature and target sets for modelling - with those select features and the target as CM1.

Predictive modelling

The overall logic flow for this was as follows:

1. Engineer new features that represent ad cost-CM1 behaviour of each product

In clustering the products into different buckets, we also want to include features that represent the product’s ad-cost CM1 behaviour - i.e. how each product’s CM1 changes with ad cost. This is done by a “response curve” creation - by fitting a log curve to the ad costs and CM1 value for each product, which will give us 2 “similarity features” - a and b - that come from the log function that represents the response curve for each product.

Why log function? This is a common function used in marketing mix modelling how performance of ads responds to ad costs. Log functions capture the effect of increasing performance with ad costs and then a subsequent “diminishing returns effect” - where initial spend gives big returns but spending more and more does not boost performance as much compared to the initial spend.

For this step, I also needed to generate artificial data of ad costs for each product and predict CM1 for those ad costs, because there was not enough data points for all products.

For the prediction, a RandomForest model was chosen, trained on the available data. RandomForest works well in this case because it can handle non-linear relationships between features (like price, margins, and competition) without requiring strong assumptions. It also performs well on tabular data and gives stable predictions, which is important since these predictions are later used to fit spend–response curves.

2. Preparing the data for clustering

Next up was the preparation of the dataframe that would go into the clustering algorithm. This involved a few steps:

  • Associating the similarity features for each product: After the feature engineering step to get the similarity features a and b, I merge these with the other features selected from the original dataframe.

  • Latest week’s data snapshot: Then I took a “latest snapshot” of the latest week numbers of the products in the cleaned dataframe. This is because we wanted the clustering for the next month’s campaigns to be influenced by the latest week’s data.

  • Include core products:There was also a business requirement to check for some “core products” which was a list of products provided separately. I checked if any of these core products were dropped in the data cleaning stage (due to any missing value in any of the columns mainly) and merged them back into the latest week’s snapshot. This means that there would be some products that go into the clustering algorithm that had some features missing. This is fine because we do want to cluster all the products for the next month’s campaign.

  • Impute missing a and b values in the latest snapshot. : However, we do want to impute missing a and b values in this data set since the clustering should also consider how products behave in their ad-costs-CM1 response curve. And this is available for all products since the a and b values were predicted from historical data and not just the latest week’s data. The imputation is doing using K-Nearest Neighbours- i.e filling in missing a and b values by looking at similar products. The idea is that products with similar characteristics are likely to have similar response curves, so this method suffices in this case.

3. Clustering using KMeans

For the actual clustering, I normalized the features to ensure that no individual feature was influencing the clustering disproportionately, and used the elbow method to find the approximate number of clusters. The elbow method yields a plot of the inertia against the no. of clusters tested. This gives an idea of how closely packed the products are within a cluster (so that different clusters are appropriately separated and distinguishable) for various cluster number choices.


Elbow method for finding number of clusters


The no. of clusters to be chosen is the point at which there is an “elbow” in the graph - a point of somewhat sharp bent, which indicates that the products within each cluster for that many clusters are more closely packed and any more clusters don’t give significant improvements in distinguishing the product clusters. This usually came around to 8 to 10 in the graph but the decision also depended on business needs - how many campaigns the acquisition team wanted to set up in the coming month.

With the number of clusters chosen, a KMeans algorithm was used to cluster the products. KMeans again is a commonly used clustering algorithm - it is simple, fast, and works well on numerical data.

    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler

    def cluster_similar_products(df, n_clusters=10):
        feature_df = df.drop(['PRODUCT_ID','CM1','AD_COSTS'], axis=1)

        scaler = StandardScaler()
        normalized_features = scaler.fit_transform(feature_df)
        
        kmeans = KMeans(n_clusters=n_clusters, random_state=42)
        clusters = kmeans.fit_predict(normalized_features)
        
        df['Cluster'] = clusters
        return df 


Clustered dataframe with products tagged to clusters


Although it doesn’t show the product clusters as distinct in 2 dimensions (since the features used for clustering are many), it helps to see a pairwise plot of 2 metrics for each cluster.


Pairwise plot of products by cluster


Once the clustering is done, I merged the cluster assignments of each product to the weekly snapshot data from the original data frame and aggregated the values of key metrics per cluster for each SNAPSHOT_DATE, to be fit these values to a log curve again.


    weekly_cluster_df = pd.merge(df[['SNAPSHOT_DATE','PRODUCT_ID','CM1','AD_COSTS']], clustered_df[['PRODUCT_ID','Cluster']], on=['PRODUCT_ID'], how='inner')
    weekly_cluster_grouped = weekly_cluster_df.groupby(['SNAPSHOT_DATE','Cluster'])[['CM1','AD_COSTS']].sum().reset_index()


4. Fitting the weekly aggregate data to a log curve

Next, for each cluster, I fit the aggregated AD_COSTS and CM1 values to a log curve. This is so that we can get the “a” and “b” similarity values for each cluster.

    from scipy.optimize import curve_fit

    def log_func(x, a, b):
        return a * np.log(x) + b

    df_cluster_function = pd.DataFrame(columns=['Cluster', 'a', 'b', 'cost'])

    for cluster in range(10):
        weekly_cluster_select = weekly_cluster_grouped[weekly_cluster_grouped['Cluster'] == cluster]
        
        # Prepare data for curve fitting
        x = weekly_cluster_select['AD_COSTS'].values
        y = weekly_cluster_select['CM1'].values
        
        # Perform logarithmic curve fitting
        popt, _ = curve_fit(log_func, x, y)
        a, b = popt
        mean_ad_cost = x.mean()
        
        new_row = pd.DataFrame([{'Cluster': cluster, 'a': a, 'b': b, 'cost': mean_ad_cost}])
        df_cluster_function = pd.concat([df_cluster_function, new_row], ignore_index=True)


And using these a and b values for each cluster, I optimized a set budget across the clusters.

5. Budget allocation across clusters

The most important step in the project - the actual budget allocation - was essentially an optimization problem. The goal was to distribute a fixed total budget across clusters to maximize expected CM1 (minimize the negative of expected CM1) based on the fitted response curves. The constraints were

  • each cluster has minimum and maximum spend limits based on past data.
  • the total of minimum budgets should not exceed total budget and
  • sum of maximum budgets should be at least the total budget


    from scipy.optimize import minimize

    def objective(budgets, a_values, b_values):
        return -np.sum([log_func(budget, a, b) for budget, a, b in zip(budgets, a_values, b_values)])

    def constraint(budgets):
        return np.sum(budgets) - total_budget

    # Example data
    np.random.seed(42)  # for reproducibility
    num_campaigns = 10
    total_budget = 20000

    a_values = df_cluster_function['a'].values
    b_values = df_cluster_function['b'].values

    # # Define min and max boundaries for each campaign
    min_budgets = df_cluster_function['cost'].values * 0.5
    max_budgets = df_cluster_function['cost'].values * 3


    # Ensure that the sum of min budgets doesn't exceed total budget
    # and the sum of max budgets is at least the total budget
    while np.sum(min_budgets) > total_budget or np.sum(max_budgets) < total_budget:
        min_budgets = np.random.uniform(100, 500, num_campaigns)
        max_budgets = np.random.uniform(1000, 2000, num_campaigns)

    # Initial guess: equal distribution, but respecting min/max bounds
    initial_budgets = np.clip(np.ones(num_campaigns) * (total_budget / num_campaigns), min_budgets, max_budgets)

    # Adjust initial guess to meet total budget constraint
    initial_budgets *= total_budget / np.sum(initial_budgets)

    # Optimization
    constraints = {'type': 'eq', 'fun': constraint}
    bounds = [(min_budget, max_budget) for min_budget, max_budget in zip(min_budgets, max_budgets)]

    result = minimize(objective, initial_budgets, args=(a_values, b_values), 
                    method='SLSQP', bounds=bounds, constraints=constraints)

    optimized_budgets = result.x


The SLSQP (Sequential Least Squares Programming) method was chosen for the optimization because SLSQP, a method that can handle non-linear relationships and constraints. The result is an optimized budget allocation for the different clusters that together maximizes the CM1.


CM1 Budget Response Curves per cluster


6. Merging small impact clusters

There may be some clusters which are small in their impact - small budget allocated and small resulting CM1 compared to the others. They may also have fewer products. From a business perspective, it makes sense to merge these together into a single cluster.

With these new clusters, the optimization is done again to get the optimum budgets allocated to the clusters.


ML Marketing Campagin Budget Optimization Results


ML Marketing Campagin Budget Optimization Results Final Output


As a last step, products which have no data are bucketed in a “Cluster 100” so that the acquisition team can filter them out as needed.

Results & Findings

Through this work, I was able to help rebuy take a data-driven approach to allocating their marketing budgets for paid acquisition campaigns by clustering products into different buckets. The clustering was done in a way that took into account the products’ historical performance data and their behaviour in how CM1 varies with ad costs. The actual allocation across clusters was done through constrained optimization with the objective of maximising the collective CM1.

In terms of impact, I wasn’t able to measure a clean causal uplift (since there was no controlled experiment and implementation wasn’t fully observable). However, looking at the data before and after this approach was introduced, performance became noticeably more stable, with a large drop in week-to-week ROAS volatility. This helped the team have more control over how budget was being distributed, even during periods with lower overall spend and without major promotional events.

Another useful takeaway is that this approach helps shift the focus from chasing short-term spikes (e.g. Black Friday peaks) to maintaining consistent, controlled performance over time. In that sense, the main value of the project was less about one-off gains and more about creating a reliable system for ongoing budget decisions.

Limitations & Assumptions

There are a few important caveats to keep in mind when interpreting the results.

First, this is not a controlled experiment. The evaluation is based on a before–after comparison, and the two periods are not directly comparable. The pre-period includes major seasonal peaks (like Black Friday and year-end demand), while the post-period does not, which naturally affects metrics like ROAS and CM1.

Second, the clusters themselves change over time. Since products are re-clustered regularly, it doesn’t make sense to compare cluster-level performance across months. Because of that, all evaluation is done at an aggregated level rather than trying to track individual clusters.

Third, I didn’t have visibility into how closely the recommended budgets were followed in practice. The results reflect what actually happened in the data, not a perfect execution of the optimized allocation.

Finally, the spend–response curves are based on modeled relationships. They assume that the relationship between spend and CM1 is relatively stable in the short term, which is a simplification. In reality, things like competitor pricing, stock changes, and demand shifts can all affect performance.

For a simpler example of audience clustering applied to web visitors — using kMeans to segment visitors into behavioural personas — see the audience segmentation project.