This blog post is part of Udacity Data Scientists Nanodegree Program. Detailed analysis with all required code is posted in https://github.com/axerocks/udacity_data_scientist_nanodegree/tree/main/Project1
This dataset available on Kaggle was originally scraped from Wish E-Commerce Platform. It contains product listings, products ratings, sales performance, and merchant/supplier information if you type “Summer” in the search field of the platform.
Key Business Objectives:
1. What are the top selling categories, product size and colors for Summer?
2. Which key variables/factors successfully predict the number of units sold?
3. Can you build a machine learning model to predict the number of units sold?
The data was sourced from kaggle.Here’s the link:
Sales of summer clothes in E-commerce Wish
Top products with ratings and sales performance, unlike many other datasets
It’s a key step in Machine Learning project to ensure that data is transformed ,clean, and easy to use for analytical purpose.
Below are some important/key steps:
a. Drop irrelevant/unnecessary variables.
b. Check and impute missing values wherever appropriate
c. Create new variables from existing features
d. Clean categorical variables
e. Transform categorical variables to numeric variables
a. Drop Irrelevant/Unnecessary Variables:
Some variables were rendered to be irrelevant for use. These didn’t contain any vital information or any variability in data that could help predict number of units sold.
E.g.currency_buyer ,shipping_option_name,merchant_profile_picture,urgency_text, crawl_month, theme, and product_url.
Hence, these variables were removed from the dataset.
b.Impute Missing Values:
After generating descriptive statistics on numeric variables, it was noted that only rating data had some missing values.
All missing count of ratings for the given product were imputed with zero. It was safe to assume that if rating data didn’t exist for the product, it pretty much means that no users rated this product. Hence, imputing with zero makes complete sense in this case.
c. Create New Variables from Existing Features:
Some new variables were created from existing features. Here’s a list below:
a. Discount rate which is % change in price vs retail price
b. Distribution of user ratings for each product
c. Distribution of badges (local product, product quality and fast shipping etc.) awarded to each product
d. Clean Categorical Variables:
A quick frequency distribution on some categorical variables such as ‘Product Color’, ‘Product Variation Size Id’ and’Origin_Country’ indicates that data is not clean.
It’s a must to clean and transform the categorical variables to use these are predictors in building a good model.
- Product Color
There is a lot of distinct values for ‘Product Color’. Most of these values are either redundant or irrelevant due to low frequency count. It’s best to group like colors together manually or by running a string-to-string fuzzy match algorithm to eliminate redundancy in data. In addition, I chose to create ‘other’ category to include those colors that have really low distribution.
The goal/objective is to identify top 10–15 distinct values and create dummy variables to offer them as predictors in model building process.
One simple approach I took was the following:
1. Created a list of colors (unique in nature) that have highest frequency distribution based on the raw/untransformed values in the original column. This is our “Reference list”
2. Created a list of all possible unique values (colors) in the original column. This is our “Input List”.
3. Ran a fuzzy string-to-string matching algorithm between the “Reference List” and “Input List” using fuzz module of fuzzywuzzy package. More details can be found here:
4. Basically, this algorithm will match a string from “Input List” with each and every string in “Reference List” to generate a similarity score for each and every pair/combination.
5. Finally, I chose to preserve only those combinations that have a very high similarity score (>90). This indicates a strong match to the values (colors) in “Reference List”.
Note: It’s possible to have duplicates especially when a value ‘blackwhite’ was assigned to both ‘black’ and ‘white’ categories. The duplicate values were removed.
The fuzzy match approach worked pretty good for the most part. However, there’s still some opportunity to do additional cleaning and remove redundancy.
I finally tried to manually classify remaining colors that matched the best based on values in our “Reference List”.
I then applied the same logic to clean other categorical variables such as — Country of Origin, Shipping Price Option, and Product Size.
2. Product Categories:
The Product Categories were not clearly defined in the dataset.
I used the product description and matched it against reference list to assign a category to each and every product.
- Created a list of categories (unique in nature) based on general business knowledge about clothing industry. This is our “Reference list”
- Created a list of all possible unique product descriptions from ‘title_orig’ column. This is our “Input List”.
- Ran a fuzzy string-to-string matching algorithm between the “Reference List” and “Input List” using fuzz module of fuzzywuzzy package to extract top categories of interest.
- Basically, this algorithm will match a string from “Input List” with each and every string in “Reference List” to generate a similarity score for each and every pair/combination.
- Finally, I chose to preserve only those combinations that have a high similarity score (>=90). This indicates a strong match to the values (colors) in “Reference List”.
Note: Removed all duplicate rows based on similarity scores. In case of ties, the first row was kept as best match. This may not be the best approach. However, it still tends to work in this case.
d. Transform Categorical Variables into Numeric Variables:
Created dummy variables for all previously processed/cleaned categorical variables for use as predictors in modeling.
Now, that data processing is finished. Let’s try to answer the first key business question:
1. What are the top selling categories, colors, and sizes in Summer?
The top 5 categories selling at Wish E-commerce platform in Summer are — Dress, Shirt, Top, Pants, and Shorts.
The top 5 colors selling at Wish E-commerce Platform in Summer are — Black, White, Blue, Green, and Red.
The top 3 product sizes for Summer clothes at Wish E-commerce Platform are — S, XS, and M.
- Created two arrays X and Y each containing independent variables and the target variable (units sold) respectively.
- Split data into training and validation datasets.
- Standardized all features in training dataset using standard scaler (subtract by mean and divide by standard deviation) and applied the same standardization process to all variables in validation dataset.
- Leveraged Feature Importance attribute from Random Forest algorithm to identify top 15 variables to predict number of units sold. It’s important to get rid of redundant features to minimize multi-collinearity or highly correlated features in the model. Multicollinearity reduces the precision of the estimate coefficients, which weakens the statistical power of the regression model.
Now, let us answer the second question :
2. Which key variables/features help us successfully predict the number of units sold?
I ran Random Forest Regression and used Feature Importance attribute to extract the top 15 features that help predict the number of units sold. Below is a quick summary :
Number of 2-star, 4-star, 1-star and 3-star ratings awarded to the product has the most impact on number of units sold. It makes sense as customers’ who rated the product have purchased it in the very first place. It’s a self-fulfilling prophecy in a way.
In addition, other variables such as the product having some kind of urgency banner, product marketed on Wish MarketPlace, number of badges awarded to the product or its seller, a flag to indicate that discount was offered , and whether the merchant has a profile picture were deemed to be of moderate importance as well.
Now, let us answer the third question :
Can you build a machine learning model to predict the number of units sold?
I leveraged different algorithms and built various models such as Linear Regression, Decision Tree Regression, Random Forest Regression, Gradient Boosting Regression, and LASSO Regression on the training dataset to predict number of units sold and validated it on validation dataset using various metrics such as : R Square, Mean Absolute Error, and Mean Squared Error.
LASSO model outperformed other models based on highest R square value (~81%) on both training and test/validation dataset. This model also has a lowest Mean Square Error and Mean Absolute Error in comparison to other models which indicates this model is accurate compared to other models.
However, it was interesting to see that the Linear Regression model came relatively close to the LASSO model in-spite of other tree based modeling approaches.
The tree based approaches — Decision Tree, Random Forest, and Gradient Boosting Regressor seem to ‘overfit’ the data as these models have a high R square on training dataset while the R Square on the validation dataset is significantly low. These models have lower bias and higher variance indicating that these are not suitable models for prediction.
On the other hand, Lasso (Least Absolute Shrinkage and Selection Operator) model performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regularization can result in sparse models with few coefficients; Some coefficients can become zero and eliminated from the model. This makes the model simpler and thereby it was chosen as the best algorithm/model.
However, it would be great to optimize model parameters using GridSearch to get a better fit.
In this article, we walked-through various steps involved in solving a business problem. This included data processing both numeric and textual (string-to-string match using fuzzywuzzy) data. In addition, the key business questions were answered using both descriptive and predictive analytics.