Click here to use the template
Analyze the target
Analyze the target based on user data and consumption behavior data
Use Bayeslab to build a classification model and perform logistic regression
to predict customer groups with a higher probability of using coupons
Data overview analysis
Data Preview
Indicator explanation
ID record code
age age
job occupation
marital marital status
default has Huabei ever been in default
returned has there ever been a return
loan has Huabei been used for payment
coupon_used_in_last6_month number of coupons used in the past six months
coupon_used_in_last_month number of coupons used in the past month
coupon_ind was a coupon used in this event
prompt:
Check and display the basic data situation of the data table
prompt:
Check if the data has any missing values
Checked the general information of the data and confirmed that there are no missing values.
Data cleaning
Categorical variable
Transform categorical variables into numerical variables for easier analysis later
But in this case, to facilitate subsequent analysis, only process the default, returned, loan variables, keeping job, marital
Extract the default, returned, loan variables separately and perform one-hot encoding using get_dummies().
prompt:
Extract the variables default, returned, and loan separately for dummy variable processing using get_dummies(), and concatenate the result with the original table, then save it in the table.
prompt:
Delete the columns 'ID', 'default', 'default_no', 'returned', 'returned_no', 'loan', 'loan_no', rename coupon_ind to flag, and save the final result in the table.
prompt:
display the data information
Univariate analysis
Observe the balance of sample 0 and 1
In binary classification problems, we typically have two classes, represented here by 0 and 1. The ideal class distribution is balanced, meaning the number of samples in both classes is roughly equal. If one class has significantly more samples than the other, this leads to class imbalance, which can affect the model's generalization ability and prediction accuracy.
In binary classification problems, even for the minority class, its proportion in the total sample should not be less than 5%. This is an empirical rule to ensure that there are enough samples for the minority class to train the model, allowing it to learn features about the minority class.
prompt:
For the sample flag, show the proportion of two categories.

In binary classification problems, the proportions of 0 and 1 should be kept balanced, with no less than 0.05 in actual situations; otherwise, it will affect the model's predictions.
The proportions of 0 and 1 in this dataset are both higher than 0.05, so its distribution is reasonable.
Observe the magnitude of the mean value
prompt:
Group by flag and aggregate, calculate the mean of each other field. Ensure that all other indicators are numerical.
For variables with data types of 0 and 1, observing the size of the mean can help us analyze the distribution of this variable on flag:
The mean of coupon_used_in_last_month is 0.26 for 0 and 0.53 for 1, indicating that customers who used coupons last month are more likely to use them again
The means of default_yes and loan_yes when 0 are both greater than when 1, suggesting that customers who defaulted on Huabei or paid bills using Huabei have a lower probability of using coupons in the following period
The means of age for 0 and 1 are 40.8 and 41.8 respectively, with little difference, indicating that age does not have a significant distinguishing relationship
Visualization
prompt:
Draw a chart to observe the distribution of returned_yes on the flag

It is found that customers who return items are less likely to use coupons compared to those who do not return items. It is speculated that part of the reason may be due to forgetting to use the coupon.
prompt:
Draw a chart to observe the distribution of marital on flag
Distinguish the flag.

The probability of married customers using coupons is slightly higher than that of unmarried and divorced customers using coupons.
The probability of married people not using coupons is also higher than that of unmarried people not using coupons.
However, the probability of all three groups not using coupons is much higher than that of using coupons.
prompt:
Draw a graph to observe the job distribution on the flag

Customers who found that their job title was management, technician, blue-collar were more likely to use coupons
prompt:
Draw a graph to observe the age distribution on the flag

prompt:
Draw a graph to observe the age distribution on the flag

Fewer extreme values were found for ages > 60, but they affected the overall data distribution. It is speculated that this part of data is wrong data, so this part of data needs to be excluded from the scope of analysis
prompt:
Regardless of age>60, age was quickly grouped (<20,<40,<60) to explore the influence of each age group on flag
drop >60

The data shows that customers younger than 20 years old are more likely to use coupon
Correlation and visualization
prompt:
Draw the correlation heat maps of all fields except job and marital (excluding rowid). Use blue for the image colo

flag was strongly positively correlated with coupon_used_in_last_month and age
flag is strongly and negatively correlated with coupon_used_in_last6_month, returned_yes
The correlation between other variables and flag is not obvious. For the sake of analysis accuracy, over-interpretation is not carried out
Establishment and evaluation of logistic regression model
Model establishment
prompt:
Set the independent variables as ['coupon_used_in_last_month', 'returned_yes', 'loan_yes'], and the dependent variable as ['flag']. Call the sklearn module to randomly split the training set and test set (in a 7/3 ratio). Then fit using logistic regression and display the model coefficient results.
prompt:
use auc to evaluate the model and give me the model score roc_auc
When coupon_used_in_last_month changes from 0 to 1, the probability of not using a coupon to using a coupon increases by a factor of e^0.41, which is 1.5 times that of other groups of customers.
When returned_yes changes from 1 to 0, the probability of using a coupon to not using a coupon increases by a factor of 0.41 times that of other groups of customers.
When loan_yes changes from 1 to 0, the probability of using a coupon to not using a coupon increases by a factor of 0.63 times that of other groups of customers.
Therefore, from a probabilistic perspective, customers who used coupons last month, customers who have not returned goods, and customers who did not pay with Huabei are more likely to use coupons again.
However, the model score is 0.67. Generally, a good model score is between 0.7 and 0.8, so consider adjusting this model.
Model optimization
prompt:
Set the independent variables as ['coupon_used_in_last_month', 'returned_yes', 'loan_yes', 'coupon_used_in_last6_month', 'default_yes', 'age'], and the dependent variable as flag. Call the sklearn module, randomly split the training set and test set (7/3), then fit using logistic regression, and display the model coefficient results.
prompt:
use auc to evaluate the model and give me the model score roc_auc

Only coupon_used_in_last_month, age, and flag have a positive correlation, while other variables are negatively correlated with flag.
The AUC score did not change much after model iteration, indicating that the usage rate of coupons is significantly low.
Business Suggestions
User Analysis
The probability of using a coupon is highest among customers aged 20-40.
18, 32, and 48 are the average ages with higher probabilities of using coupons in their respective age groups.
Analysis of improving coupon usage rate - high-value users
The mean of coupon_used_in_last_month is 0.26 for 0 and 0.53 for 1, indicating that customers who used coupons last month are more likely to use coupons again.
The means of default_yes and loan_yes when they are 0 are both greater than when they are 1, suggesting that customers who defaulted on Huabei or paid with Huabei have a lower probability of using coupons in the following period.
Compared to customers who did not return goods, those who returned goods have a lower probability of using coupons.
Married customers have a slightly higher probability of using coupons compared to unmarried and divorced customers.
Customers with job titles of management, technician, and blue-collar are more likely to use coupons.
The flag has a strong positive correlation with coupon_used_in_last_month and age, and a strong negative correlation with coupon_used_in_last6_month and returned_yes. The correlations between other variables and the flag are not significant.
Conclusion
The usage rate of coupons is low.
Pay special attention to the retention of customers aged 20-60. For customers with purchasing potential or those who purchase a relatively single type of product, develop an upselling or cross-selling model to enhance the value of existing customers.
Encourage customers who used coupons last month to use them again, and develop corresponding product response models or event response models to maximize benefits.
For customers who have not returned products, married customers, those without Huabei (Ant Credit Pay) defaults, and those who have not paid via Huabei, develop customer churn warning models or customer win-back models. Focus on managers, technicians, and blue-collar workers to try to retain these customers as much as possible.
Increase the promotion of coupons within the APP, strengthen marketing measures such as banners and advertising pushes; conduct additional pushes outside the APP, including third-party coupon pushes, so that customers can learn about and increase their likelihood of using coupons.
*Data is analog.