Implementing Customer Segmentation: Practical Data Analysis Process — Based on Customer Data of A Department Store

slug

example-4

type

Post

status

Published

date

Aug 15, 2025

summary

Based on a public department store dataset, this project systematically practices the entire data analysis process for customer segmentation. The goal is to gain in-depth insights into the customer composition of the department store, evaluate the impact of customers at different levels on sales performance, and thereby improve customer retention rate and consumption conversion rate to optimise the overall business performance of the store. In the project, I fully implemented data cleaning, K-Means clustering analysis based on the improved LRFMPSX model, and proposed targeted marketing strategies according to the characteristics of different customer groups. This project fully covers the entire process from data preprocessing and model construction to business interpretation.

📝 1. Project Background and Objectives

This project conducts learning by reproducing a classic case. Its core objective is to use the given dataset to design a standardised analysis process, aiming to build customer portraits and segmentation models, explore the characteristics of customer groups, and ultimately enhance customer attraction through personalised marketing strategies. My personal objective is to gain an in-depth understanding of the complete workflow of data science and master the methods of using analytical tools to solve core business problems through this project.

The complete code of this project has been published on GitHub: https://github.com/IvyXiaZhou/Customer-segmentation/blob/main/Department%20store.py

2. Data Understanding and Preparation

The project data is sourced from a department store dataset on GitHub, consisting of two tables:

The member information table (nearly 190,000 records with 4 columns: member ID, date of birth, gender, and registration time).

The sales transaction table (nearly 1.89 million records with 15 columns, recording sales transactions from January 1, 2015, to January 3, 2018, including features such as member ID, consumption time, consumption amount, and member points generated from the consumption).

The main steps of data cleaning are as follows:

2.1 Processing of the Member Information Table

Remove 6 records with duplicate member IDs.

Since the registration time is a key field for calculating the active duration of members and has a low missing rate, delete records with missing registration time.

For the gender field with a low missing rate, use the mode for imputation.

For the "date of birth" column with a high missing rate, extract its decade information (e.g., 1995 → 1990), filter within a reasonable range (1920-2020), and then delete the original "date of birth" column. Finally, retain the three columns: "member ID", "gender", and "registration time".

2.2 Processing of the Sales Transaction Table

Delete records with missing "member ID" (accounting for more than half of the total records).

Remove columns irrelevant to the analysis, such as "cash register number", "counter code", and "counter name".

2.3 Data Merging and Filtering

Merge the two cleaned tables using "member ID" as the key. To eliminate the interference of invalid transactions, filter out records that simultaneously meet the following conditions as valid data: consumption amount > 0, member points > 0, and sales quantity > 0.

Eventually, a dataset with 11 features and 738,462 samples is obtained, which can be used for preliminary analysis.

3. Exploratory Data Analysis

This phase aims to explore data distribution and feature relationships through visualization to reveal potential business insights. The analysis first distinguishes between members and non-members, then focuses on the member group.

3.1 Data Preprocessing and Feature Engineering

To accurately distinguish members, a "member" field is created: records that are successfully matched with the member information table via member ID are marked as 1 (member), and the rest are marked as 0 (non-member).

To improve data readability, convert the gender code (0/1) to "Female"/"Male".

Classify customers into three groups by birth decade: young (≥1990), middle-aged (1951-1989), and elderly (≤1950).

3.2 Analysis of Basic Member Information

Key findings through visual analysis are as follows:

3.2.1 Distribution of Member Birth Decades

Members born in the 1980s (47,142 people) and 1970s (43,407 people) are the main groups. Notably, the number of members born in the 1990s (11,811 people) is only 1/4 of that in the 1980s, indicating that the department store has great potential in attracting young customer groups.

3.2.2 Member Gender Ratio

Most members are female, accounting for 78.66%, while male members only account for 21.34%.

3.2.3 Member Age Distribution

In terms of age levels, middle-aged people are the core consumer group, accounting for 85.16%, while young and elderly people account for 8.95% and 5.89% respectively. Based on the previous birth decade judgment, the ages of members are mainly concentrated in the 30-50 years old range.

3.3 Analysis of Member Consumption Behavior

3.3.1 Proportion of Orders and Consumption Amount between Members and Non-Members

The proportion of member consumption in both order quantity (53.18%) and total amount (53.99%) are slightly higher than that of non-members, confirming the value of the membership system to the business.

3.3.2 Consumption Preferences by Quarter and Day Interval

To ensure data quality, null records are removed, resulting in a subset of 393,482 member consumption records. The following operations are carried out on this subset:

3.3.2.1 Time Data Processing: Extract year, month, day, and hour from the "consumption time" field, and add four columns to mark time information for analyzing consumption behavior from the time dimension.

3.3.2.2 Order Quantity Calculation: Use a custom function to calculate the average number of orders under different time dimensions (such as quarter and day interval) between 2015 and 2018. Using the mean can better eliminate the interference of outliers and reflect the actual consumption preferences of members.

By plotting the consumption trend charts of the two time dimensions (quarter and day interval), it is found that the overall time preference shows obvious patterns: the second quarter is the consumption peak, followed by the fourth quarter, and the third quarter is the off-season. The average consumption interval of members is concentrated in 25-26 days.

3.3.3 Differences in Quarterly and Daily Consumption across Years

To eliminate the interference of incomplete data in 2018, an in-depth analysis of the data from 2015 to 2017 is conducted, revealing that the store's sales model has undergone a process of evolution and stabilisation.

Quarterly Dimension: In 2015, the number of member orders showed a straight downward trend. In 2016 and 2017, the number of orders gradually increased with the progress of the quarter, reaching the first peak in the second quarter, declining slightly in the third quarter, and continuing to rise to the highest point in the fourth quarter.

Daily Interval Dimension: In 2015, the distribution of the number of member orders was scattered, showing no obvious pattern. In 2016 and 2017, the number of orders showed a significant peak at an interval of 25-26 days, indicating a stable consumption cycle.

The chart trends are basically consistent with the previous analysis. It should be noted that although the average number of orders in the fourth quarter is lower than that in the second quarter, the total number of orders in the fourth quarter is actually the highest in the whole year, which indicates that the end of the year is the most critical period for the store's sales.

3.3.4 Differences in Monthly Consumption Orders and Amounts across Years

From the monthly consumption situation in the three years, the monthly consumption trend is basically consistent with the quarterly consumption trend. In 2015, the number of member orders and the amount reached a peak in February, then declined all the way, and dropped to zero from September to December. This indicates that the store may have experienced business suspension and rectification during this period, resulting in abnormal data. In 2016 and 2017, the number of member orders and the amount reached small peaks in April and May respectively, then declined in fluctuations to the bottom in August and September, and finally surged to the annual peak in November.

The analysis confirms that in normal operation years, the store's sales peak seasons stably occur in the second and fourth quarters. Among them, April, May, November, and December are the key months, while the third quarter is a typical off-season, providing an ideal window for the store to carry out store upgrades, equipment transformation, or plan customer acquisition activities.

3.3.5 Number of Sales Orders by Hour of the Day

The intraday sales trend shows that the number of orders rises steadily from 9:00, reaches a peak at 16:00, then gradually decreases. Although there is a slight rebound at 20:00, it fails to reverse the downward trend. The most active period of the day is 14:00-17:00, with the number of orders exceeding 20,000 per hour during this period.

After that, the processed member consumption data is saved as vip_info.csv for modeling.

4. Modeling and Analysis

4.1 Construction of the LRFMPSX Model

After loading vip_info.csv, abnormal records with registration or consumption time later than the cut-off date (January 3, 2018) are filtered out, resulting in a dataset with 393,006 valid records of 42,548 members.

The modeling objective is to calculate the corresponding LRFMPSX feature values for each member in preparation for customer segmentation.

In terms of feature selection, this project expands on the basis of the classic RFM model and introduces four dimensions: L, P, S, and X, which helps to build a more comprehensive customer portrait. Finally, the 7 key features selected by the model are as follows:

L (Length): Member active duration

R (Recency): Recent consumption time

F (Frequency): Consumption frequency

M (Monetary): Total consumption amount

P (Point): Total points

S (Session): Consumption time preference

X (Sex): Gender

The specific calculation process of feature engineering is as follows:

L and R: Define the time_minus() function to calculate the time difference (in months). Among them, L is the number of months from the member's registration date to the cut-off date (January 3, 2018); R is the number of months from the member's last consumption date to the cut-off date.

F, M, P: Aggregate by member ID, and calculate the total number of consumption times (F), total consumption amount (M), and total points (P) for each member.

S: Divide the consumption time into 5 periods: early morning (0-5 o'clock), morning (6-10 o'clock), noon (11-13 o'clock), afternoon (14-17 o'clock), and evening (18-23 o'clock); count the most frequent consumption period (mode) of each member as their consumption time preference.

X: Convert the gender code to "Male/Female" labels.

After completing the above calculations, the 7 features are merged into a new DataFrame, with the corresponding letters as column names, and saved as the LRFMPSX.csv file for subsequent analysis.

4.2 Construction of Customer Portraits

To improve business readability, data binning is performed on the LRFMPSX features, and intuitive business labels are assigned to them. The label definitions and binning rules are as follows:

L (Active Duration): Based on the number of active months, it is divided into new users (≤3 months), medium-term users (4-12 months), and long-term users (≥13 months).

R (Recent Activity): Convert the number of months since the last consumption to days.

F (Consumption Frequency): Based on the total number of consumption times, it is divided into low-frequency consumption (≤5 times), medium-frequency consumption (6-19 times), and high-frequency consumption (≥20 times).

M/P (Consumption/Point Capacity): Uniformly divide the consumption amount (M) and points (P) into three levels: low level (<10,000), medium level (10,000-100,000), and high level (>100,000).

S/X (Consumption Period Preference/Gender): Retain the original labels.

After completing the label conversion, the data is saved as consumers_profile.csv. This file does not contain complex values, making it easy for the business team to read and use directly.

To further enhance the intuitiveness of the portraits, a word cloud generation tool is developed, which can randomly select or specify a member ID and generate a word cloud diagram from all its portrait labels (such as ‘long-term user’ and ‘high-frequency consumption’). This function allows account managers to quickly grasp customer characteristics within seconds, realising the efficient transmission of data insights.

4.3 Customer Clustering Analysis (K-Means)

4.3.1 Calculation of Silhouette Coefficient

Clustering analysis usually uses ‘distance’ to measure the similarity between samples. However, the dimensions and magnitudes of the LRFMP features vary greatly. For example, the consumption amount M can reach tens of thousands, while the consumption frequency F is usually dozens. To prevent the dimension from dominating the distance calculation, data standardisation must be performed before clustering.

For clustering analysis, first, 5 core numerical features (L, R, F, M, P) are selected from the LRFMPSX features. Then, the Z-Score standardization is performed on the data using the StandardScaler in the scikit-learn library. Subsequently, the silhouette coefficient is calculated by traversing K=2-6, and the silhouette coefficient curve is plotted to evaluate the clustering effect under different K values. The results show that when K=2 (the K value corresponding to the red dashed line in the figure), the silhouette coefficient is the highest, indicating the best clustering effect under this setting.

4.3.2 Comparison of Preprocessing Methods

To verify robustness, the impact of Z-Score standardisation and Min-Max normalisation on the clustering results is additionally compared. Clustering (K=2-5) is performed on the data processed by the two methods respectively, and radar charts are used to visualise the clustering centers of different clusters to identify the key characteristics of each customer group.

4.3.2.1 Radar Chart after Z-Score Standardisation

The Z-Score standardisation process converts the data into a distribution with a mean of 0 and a standard deviation of 1, i.e., a standard normal distribution. The advantage of this method is that the relative relationships between the sample points in the original data are still maintained after the data points are converted, and it is not easily affected by extreme outliers.

The following four radar charts are based on the standardised data, where customers are divided into 2, 3, 4, and 5 categories respectively, and each color represents a cluster, i.e., a group.

When K=2: The two clusters have their own distinct characteristics and clear patterns. Cluster 1 and Cluster 2 are basically the same in the L dimension, and show protrusion and depression respectively in the R dimension. It is worth noting that the shape of Cluster 2 is very spread out in the three dimensions of F, M, and P, with very obvious protrusions. It can be seen from the figure that the registration times of the two types of customers are similar, but the recent consumption time of Cluster 1 is longer, indicating that its consumption behavior is inactive; the consumption frequency, amount, and points of Cluster 2 are far higher than those of Cluster 1, and the recent consumption time is also shorter. It can be seen that the segmentation result conforms to business common sense.

When K=3: The shape and size of Cluster 3 are significantly different from the other two clusters, indicating that this customer group has frequent consumption and very high amount and points. The shapes of Cluster 1 and Cluster 2 in the F, M, and P dimensions almost overlap. The difference is that Cluster 2 has a larger protrusion degree than Cluster 1 in the R and L dimensions, indicating that compared with Cluster 1, Cluster 2 has a longer registration time and less active consumption. However, both belong to the low-frequency and low-consumption groups, and their impact on the store's business is almost the same, so it is not very necessary to subdivide them into two clusters.

When K=4: Cluster 1 and Cluster 2 have similar shapes and belong to the combination with higher consumption; Cluster 3 and Cluster 4 have similar shapes and belong to the combination with lower consumption. The main difference between the two in each group is the size. It can be seen that the clustering effect is not good, and the distinction between customer groups is not high, which is not as good as directly dividing into 2 clusters.

When K=5: The two clusters of the higher consumption combination are almost the same as those when K=4. Cluster 1, Cluster 3, and Cluster 5, which all belong to the lower consumption combination, overlap in the F, M, and P dimensions, and only differ slightly in the L and R dimensions. This is equivalent to subdividing the low-consumption group into a lost customer group with the longest consumption interval, an old customer group with the longest registration time, and a new customer group with recent consumption behavior.

With the increase of K value, the clustering result evolves into: on the basis of having divided a distinct high-consumption group, the low-consumption group is gradually subdivided, which gradually deviates from the actual situation that the business department should focus on high-value customers. At the same time, the shape distinction of the clusters is also decreasing, and the business interpretability is getting worse. Therefore, there is no need to continue to increase the K value, and K=2 is the most reasonable choice.

4.3.2.2 Radar Chart after Min-Max Normalisation

The Min-Max normalisation process compresses all data into a fixed interval, the most common interval being [0, 1], i.e., the minimum value of the original data is 0 and the maximum value is 1.

The radar chart shows that no matter how many clusters the customers are divided into, the three dimensions of F, M, and P are severely compressed, and the values of all clusters in these three dimensions are squeezed in the interval near 0 in the center. The model relies more on the two dimensions of L and R for clustering, that is, the F, M, and P dimensions are invalid, and the customer groups are basically distinguished by new/old customers and activity level.

This is because normalisation does not change the distribution shape of the original data, but is very sensitive to outliers. However, outliers are very easy to appear in customer data. When a very small number of customers have huge F, M, and P values, using normalisation to process the data will make the differences between most ordinary customers in these dimensions disappear, resulting in the clustering model being unable to effectively distinguish different customer groups. In addition, the range of indicators such as R may continue to change, and the maximum value will increase over time. After normalization, its value becomes particularly prominent in this dimension, and the obtained clustering effect is more unstable.

By comparing the two preprocessing methods, Z-Score standardisation performs better on this dataset. It is not sensitive to outliers in the data, such as customers with ultra-high consumption, and can better retain the feature differences of most data, allowing the three dimensions of F, M, and P to play their distinguishing roles normally, thus obtaining clear and robust clustering results.

4.4 LRFMP Mean Analysis of the Two Customer Groups

Comprehensive consideration of the silhouette coefficient, the visualization effect of radar charts, and business interpretability led to the determination of the optimal number of clusters (K=2), dividing all customers into two groups. Subsequently, the mean values of the five features (L, R, F, M, P) for each group were calculated, and the differences were visually compared by plotting mean bar charts. Meanwhile, group labels "0" and "1" were assigned to the two customer groups respectively.

Statistical results show that Group "0" includes 41,570 customers, accounting for 97.7% of the total; Group "1" has only 978 customers, representing 2.3% of the total. The customer structure presents a typical long-tail distribution.

Further comparison of the LRFMP mean values between the two customer groups reveals distinctly different characteristics:

Group ‘1’ (High-Value Active Customers, hereinafter referred to as High-Value Customers): Their consumption frequency (F), total consumption amount (M), and total points (P) are far higher than those of Group ‘0’, and their recent consumption time (R) is extremely short. Although the member duration (L) is similar to that of Group ‘0’, their high activity and high contribution mark them as the core assets of the department store.

Group ‘0’ (General-Value Customers): This group exhibits the ‘double-high’ characteristic in terms of member duration (L) and recent consumption time (R), meaning they registered early but have not consumed for a long time. Meanwhile, their F, M, and P values are all at a low level, making them customers at risk of churn with low activity and low value contribution.

5. Business Insights and Recommendations

Based on the clustering analysis results, two customer groups with distinct characteristics were identified, confirming that the department store's performance is closely related to a small number of high-value customers. Therefore, it is recommended to develop differentiated operation strategies to optimise resource allocation and enhance the overall customer value.

5.1 High-Value Customers: Refined Maintenance

Although the high-value customer group accounts for only 2.3% of the total, they are the core contributors to the store's revenue. They should be regarded as key maintenance targets, and special resources should be invested to consolidate their loyalty.

5.1.1 Establish an Exclusive Service System

Create personal files for the top 10% of customers and assign exclusive account managers to provide VIP services, including personal shopping guides, new product previews, exclusive hotlines, and door-to-door delivery.

During the service process, account managers should make personalised recommendations based on customer portraits, focusing on demand matching and value transmission, and avoiding rigid sales promotion to improve customer experience and long-term satisfaction.

5.1.2 Design Differentiated Membership Benefits

Provide value-added services such as fashion styling consultation, personal image design, lifelong maintenance of valuable items, and free return/exchange to create an unexpected service experience.

Regularly organise high-end exclusive events, such as new product launches, designer meetings, and niche appreciation sessions, to enhance customers' sense of belonging and brand recognition.

5.1.3 Build a Digital Interaction Mechanism

Use data analysis to predict customer needs and push personalised new product information and exclusive discounts in a timely manner.

Establish an exclusive membership community and a digital interaction platform to promote social connections among customers at the same level.

The department store needs to continuously optimise service content and innovate experiences in the long run to build a competitive barrier. This not only consolidates the relationship with existing high-value customers but also helps attract similar high-quality customers.

5.2 General-Value Customers: Low-Cost and Efficient Maintenance

The general-value customer group accounts for more than 97% of the total. Although the individual value is low, the group as a whole has considerable potential and contributes to word-of-mouth promotion. The marketing focus should be on preventing churn, promoting activity, and identifying users with conversion potential.

5.2.1 Automated and Precise Outreach

Rely on low-cost digital tools (such as text messages and App notifications) to automatically trigger preferential incentives in specific scenarios. For example, send "reunion coupons" to customers who have not consumed for a long time, "birthday exclusive coupons" to customers whose birthdays are approaching, or recommend related products based on customers' historical behaviors.

5.2.2 Design Incentive Activities

Activate inactive customers by pushing no-threshold coupons and privilege experience coupons (such as "10-yuan cash coupons" and "free coffee coupons").

Strengthen the operation of the point system, regularly carry out point redemption activities or send point expiration reminders to stimulate customers' point consumption and repurchase behavior.

5.2.3 Plan Lightweight Promotions

Introduce activities that combine fun and cost-effectiveness during the shopping process, such as lucky draws and bundle promotions (e.g., "buy three get one free"), to attract customers' attention and stimulate immediate consumption.

5.3 Non-Member Group: Guided Transformation

For non-member customers, efforts should be made to convert them into members to improve user stickiness, improve user data, and lay a foundation for subsequent analysis and outreach.

In offline scenarios, attract registration through checkout counter guidance, member-exclusive prices, gifts for scanning codes, and membership card recharge discounts.

In online scenarios, launch benefits such as member-exclusive products and new member gift packages to lower the registration threshold and enhance the willingness to convert.

6. Conclusion and Reflection

The value of this project lies in clearly identifying high-value customers and general-value customers through customer segmentation. By implementing the differentiated strategy of ‘focusing on maintaining high-value customers, handling general customers efficiently, and actively converting non-members’, the department store can effectively improve customer retention rate, optimise resource utilisation efficiency, and ultimately enhance customer lifetime value and revenue stability.

In this project, the Python data analysis technology stack was fully applied, including the use of Pandas and NumPy for data cleaning and processing, Matplotlib/Seaborn for data visualisation, and Scikit-learn for building and evaluating clustering models. This experience has improved the ability to transform business problems into data solutions.

From the challenges of data cleaning, it was deeply realised that data quality is the cornerstone of all analysis conclusions. The process of determining the optimal number of clusters (K=2) through silhouette coefficients and radar charts also helped recognise the importance of making model decisions by combining statistical indicators and business interpretability.

Finally, the clustering results were transformed into differentiated marketing strategies. This process reinforced the belief that the ultimate value of data analysis lies in producing clear and implementable business recommendations, thereby truly driving decision-making and creating business benefits.

📎 References

https://github.com/zhengkangle-555/ConsumersProfile