Customer Value Model (K-means)

Jan 7, 2025

6 min read

Click here to use the template

Keywords: #pandas data processing, #K-means clustering, #LRFCM model

In the information age, enterprises have shifted their marketing focus from products to customers, aiming to maximize corporate profits. Accurately classifying customers and developing personalized service plans based on classification results helps optimize the allocation of marketing resources.

In the airline marketing industry✈️, various airlines attract more customers by offering favorable marketing strategies. Establishing a customer value assessment model to segment customers, analyze and compare the value of different groups, and develop corresponding marketing strategies to provide personalized services to different customer groups is both important and effective. This experiment, based on membership files and flight record data from an airline company from 2012 to 2014, aims to achieve the following.

Objectives:

◼︎ Classify airline customers using customer data.

◼︎ Analyze characteristics of different customer types and compare customer value.

◼︎ Develop personalized services and marketing strategies based on customer value.

The goal of this experiment is to identify customer value,

using the RFM model (Recency, Frequency, Monetary) to segment customers and identify high-value customers. However, due to fluctuating ticket prices, the value of customers spending the same amount may differ for the airline.

For example, customers who purchase short-haul, high-class tickets are more valuable than those who buy long-haul, low-class tickets. Therefore, this experiment replaces monetary value with Indicator 1 "Accumulated Flight Mileage (M)" and Indicator 2 "Average Discount Coefficient (C)". Additionally, Indicator 3 "Length of Membership (L)" also affects customer value, so this indicator is included in the model.

We construct the LRFMC model with the five indicators L, R, F, M, and C. Traditional classification methods cannot effectively reduce classified customer groups, which increases marketing costs. We choose a clustering algorithm to identify customer value and use K-means clustering to process the five indicators of the LRFMC model to identify the most valuable customers.

Data Analysis Process:

◼︎ Selectively sample data to form historical data and incremental data.

◼︎ Perform exploratory analysis and preprocessing on the sampled data, including data cleaning, attribute reduction, and transformation.

◼︎ Use the processed data for customer segmentation based on the LRFMC model, perform characteristic analysis, and identify valuable customers.

◼︎ Provide customized services and marketing strategies to customers of different values based on the model results.

Step 1: Initial Data Exploration

(1) Step Overview: Create data tables based on the demonstration and then analyze them as follows.

(2) Business Overview:

⭕️Data Extraction

For data extracted from the airline up to March 31, 2014, include customer basic information, flight information, and loyalty points details. A total of 62,988 records are extracted, containing 44 fields such as membership number, membership date, gender, age, membership level, city of workplace, province of workplace, country of workplace, observation window end time, observation window flight points, flight kilometers, flight times, flight duration, points interval, and average discount rate.

⭕️Data Analysis

For the raw data, we perform exploratory analysis to identify missing values and anomalies, and analyze data patterns and outliers.

In our data:

◼︎ Some ticket price attributes are empty.

◼︎ The minimum ticket price is 0, the minimum discount rate is 0, and the total flight kilometers are greater than 0.

Empty ticket prices might be due to no flight records for some customers, while other zero values could be due to zero-discount tickets or point redemptions. How to handle these depends on their proportion in the total data.

▼ First, after upload the file , then find in each attribute: the number of missing values, maximum values, and minimum values.

prompt:

Read all fields from / En_air_data.xlsxEn_air_data.xlsx

Conduct descriptive statistical analysis on the data table.

Display the results in a table format

Transpose the output for better readability

Save the resulting table locally, named "explore"

Step 2: Data Cleaning and Attribute Reduction

【Data Cleaning】 involves handling abnormal data. Here we discard non-compliant data since anomalies account for only a minor proportion.

【Attribute Reduction】 involves selecting appropriate fields for customer value modeling.

⭕️Data Cleaning:

The case mainly uses data cleaning, attribute reduction, and data transformation for preprocessing. Based on the LRMFC model for customer segmentation, we need to filter and transform 44 attributes to form our required indicator list. Data with missing values need to be cleaned.From the initial analysis, missing values and anomalous records with the minimum ticket price and discount rate of 0, yet with non-zero flight kilometers, were found. Since these anomalies are minor, they are discarded.

Specific handling methods:

Discard records with empty ticket prices.

Discard records with a minimum ticket price of 0, minimum discount rate of 0, and total flight kilometers greater than 0.

Delete rows meeting the cleaning criteria.

⭕️Attribute:

From the original data, select 6 attributes related to the LRFMC model: FFP DATE, LOAD TIME, FLIGHT COUNT, AVG DISCOUNT, SEG KM SUM, LAST TO END. These correspond to membership date, observation window end time, flight count in the observation window, average discount rate, total flight kilometers in the observation window, and the time from the last flight to the end of the observation window.

Delete unrelated, weakly related, or redundant attributes.

Ensure to create a blank data table named "data_cleaned" before starting.

prompt:

Read the / En_air_data.xlsx file and perform data cleaning with the following steps:

Discard records where ticket prices are empty (two ticket price fields: "SUM_YR_1" and "SUM_YR_2").

Discard records where the minimum ticket price is 0, or the minimum discount rate "avg_discount" is 0, while the total flight kilometers "SEG_KM_SUM" is greater than 0.

Discard all rows that meet the cleaning criteria, i.e., delete those rows.

Then, perform field selection by retaining only the following fields:

"FFP_DATE", "LOAD_TIME", "avg_discount", "FLIGHT_COUNT", "SEG_KM_SUM", "LAST_TO_END".

Save the resulting table as "data_cleaned.csv".

Step 3: Data Transformation

Data transformation means converting data into an appropriate format. Based on the LRFMC model, extract data as follows:

(1) L = LOAD_TIME - FFP_DATE (Months from membership to the observation window end)

(2) R = LAST_TO_END (Months from the last flight to the observation window end)

(3) F = FLIGHT_COUNT (Number of flights within the observation window)

(4) M = SEG_KM_SUM (Total flight kilometers within the observation window)

(5) C = AVG DISCOUNT (Average discount rate within the observation window)

After extracting the 5 indicators, analyze the data distribution for each indicator. Due to large differences in the value ranges of the 5 indicators, standardize the data to eliminate scale effects using the standard deviation method, resulting in ZL, ZR, ZF, ZM, ZC attributes.

prompt:

Read data_cleaned.csv and standardize the fields "LOAD_TIME-FFP_DATE," "LAST_TO_END," "FLIGHT_COUNT," "SEG_KM_SUM," and "AVG DISCOUNT" using standard deviation. After standardization, generate the five attributes "ZL, ZR, ZF, ZM, ZC."

The specific calculations are as follows:

(1) L = LOAD_TIME - FFP_DATE: The number of months from membership enrollment to the end of the observation window = Observation window end time - Membership enrollment time [Unit: months]

(2) R = LAST_TO_END: The number of months from the customer's last flight to the end of the observation window = Time from the last flight to the end of the observation window [Unit: months]

(3) F = FLIGHT_COUNT: The number of times the customer flew with the company during the observation window = Number of flights during the observation window [Unit: times]

(4) M = SEG_KM_SUM: The total flight kilometers accumulated by the customer within the observation time = Total flight kilometers during the observation window [Unit: kilometers]

(5) C = AVG DISCOUNT: The average discount factor corresponding to the cabin class flown by the customer during the observation time = Average discount rate [Unit: none]

After the standardization process, save the file as data_stand.

Step 4: K-means Clustering Analysis

⭕️ Model Building

Construct the customer value analysis model in two parts:

Cluster customers based on the 5 LRFMC model indicators.

Analyze and rank customer value for each cluster based on business context.

⭕️Customer Clustering

Use the K-means clustering algorithm, with the number of clusters determined by specific business needs. In this experiment, cluster customers into 5 groups. Implement the K-means algorithm using the sklearn library.

prompt:

Read data_stand.csv and use the fields "ZL, ZR, ZF, ZM, ZC" to apply the K-Means clustering algorithm, dividing the samples into 5 clusters (i.e., using n_clusters = 5). You can use the sklearn library. Finally, save the clustering results as a table file, stored as K-means locally (clear the original table before writing).

Step 5: Plotting a Radar Chart

prompt:

Read K-means.csv and draw an aesthetically pleasing radar chart. Ensure the final visualization has appropriate and visually appealing proportions, and it should be one radar chart.

Create a corresponding cluster feature table where the "Cluster" values are 0-4, representing the five clusters. "ZL, ZR, ZF, ZM, ZC" are the cluster feature attributes. Process as follows:

Integrate the original data with cluster labels.

Calculate the sample size and sample proportion (to two decimal places) for each cluster category.

Calculate the most significant feature values within each cluster category:

For numerical data, extract the mean value, rounded to two decimal places.

For string data, extract the most frequently occurring value (mode).

Combine the above feature values to represent the typical characteristics of each cluster category.

Output the complete cluster feature information by combining the sample size, sample proportion, and significant feature information for each cluster. Display the main features of each cluster category in a table format.

Ensure the output includes the significant feature values for each cluster, not the original data. The fields should be "Cluster, Sample Size, Proportion, ZL, ZR, ZF, ZM, ZC."

Save the resulting file as Cluster_Analysis_Feature_Table (clear the original table before writing).

Radar chart formatting:

Ensure the text does not overlap, and numerical labels on the chart can be omitted.

Use different colors to fill different clusters, with 50% transparency.

prompt:

Read Cluster_Analysis_Feature_Table.csv and process the data row by row according to the "Cluster" column. For example, if Cluster=1 has ZL, ZR, ZF, ZM, ZC values of -0.72, -0.36, -0.28, -0.27, -0.18, the values from smallest to largest map to "L, R, F, M, C" as "Lowest, Lower, Average, Higher, Highest." Process all clusters in this manner.

Form a table with fields "Cluster, Sample Size, Proportion, ZL, ZR, ZF, ZM, ZC, L, R, F, M, C."

Save the processed data to Customer_Value_Table locally

(clear the original table before writing).

prompt:

Read Customer_Value_Table.csv and classify the customer value based on the fields "L, R, F, M, C" corresponding to the situation:

Data Explanation

L: Months of membership

F: Number of flights taken by the customer

M: Total flight kilometers by the customer

C: Average discount rate for the customer's travel class

R: Time since the customer's last flight

Cluster: Represents the customer's classification

Classify customers as:

General Customers: Cluster = 1

Important Developing Customers: Cluster = 2

Important Retained Customers: Cluster = 4

Low-Value Customers: Cluster = 0

Important Retained Customers: Cluster = 3

Analyze each Cluster, and after classification, update the original table based on the data explanation. Remove the fields "ZL, ZR, ZF, ZM, ZC" and add a new column to fill in the customer type corresponding to the above Clusters. Save the updated data to Customer_Value_Types (clear the original table before writing).

About Bayeslab

Bayeslab: Website

The AI First Data Workbench

X: @BayeslabAI

Documents:

https://bayeslab.gitbook.io/docs

Blogs:

https://bayeslab.ai/blog

Bayeslab makes data analysis as easy as note-taking!

Bayeslab makes data analysis as easy
as note-taking!

Start Free