Forecast Sales Using Machine Learning in Python
Imagine being able to see into the future of your business. That’s what sales forecasting does, and it’s a game-changer for organizations. In this machine learning project, we’re taking on a real-world challenge: predicting Walmart’s weekly sales.
We’ll be diving into historical data from 45 stores, wrestling with holiday promotions, and putting various predictive models through their paces.
Let’s get started.
1. Setup the Environment and Understand the Data
Setup the Environment
The full problem statement and datasets are available from Kaggle:
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
Let’s setup the environment and create DataFrames:
Load the Datasets
Let’s read the datasets in the CSV files into DataFrames and print them out to understand the type of data in the files.
Output:
The following dataset descriptions are provided on the Data tab on Kaggle:
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data.
“Features.csv: This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:
Store - the store number
Date - the week
Temperature - average temperature in the region
Fuel_Price - cost of fuel in the region
MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI - the consumer price index
Unemployment - the unemployment rate
IsHoliday - whether the week is a special holiday week
For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):
Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13”
Notice there are several NaN values in the MarkDown, CPI, and Unemployment columns. NaN stands for “Not a Number” and represents missing values. We’ll need to address these gaps to maintain integrity of the analysis.
“stores.csv: This file contains anonymized information about the 45 stores, indicating the type and size of store.”
“train.csv: This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:
Store - the store number
Dept - the department number
Date - the week
Weekly_Sales - sales for the given department in the given store
sHoliday - whether the week is a special holiday week”
“test.csv: This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.”
Combining datasets will allow access to all of the training data in a single DataFrame, so let’s combine features_df with stores_df, and then with train_df and test_df.
Merge features_df and stores_df:
Output:
To help us understand the type of data in the datasets, we will check the data type objects (dtypes) in the train_val_df DataFrame and the test_df DataFrame.
Let’s change the Date column dtype from object to datetime and create new columns for Week, Month, Year values:
2. Exploratory Data Analysis
Let’s analyze and investigate the datasets and summarize their main characteristics. This will help to identify any obvious errors and potentially detect relationships between the variables.
The info() method will help to understand the number of columns, column labels, data types, memory usage, range index, and the number of cells in each column.
The describe() method will return a statistical description of the data in the DataFrames.
Output:
Output:
Let’s confirm the unique values in Store and Type in both DataFrames:
Output:
There are 45 different stores present, and three unique types of stores.
Let’s count the unique values in the Year column using the value_counts() method.
Output:
We will use visual tools to explore and analyze the datasets, including the relationships between different subsets of variables.
Create chart showing Weekly_Sales by Year and by Month in the training dataset:
Output:
Create pie chart showing % of stores by Type in the training dataset:
Output:
Type A stores make up 51.1% of the total number of stores, followed by Type B stores at 38.8% and Type C stores at 10.1%.
Let’s understand the size of each store type and the weekly sales by store type.
We’ll use a box plot to show the size of the stores by store type. Box plots display the distribution of data and can indicate whether the distribution is symmetrical or skewed. The ends of the box represent the first (lower) quartile and third (upper) quartile. The line inside the box represents the median which is the second quartile. The whiskers extend from the box to represent minimum and maximum values, excluding any outliers.
Output:
The size of stores is skewed by store type. Type A stores are the largest and Type C are the smallest. Also, there is a significant difference between store sizes with no overlap in size between each type.
Let’s create a bar chart showing Weekly_Sales by Type:
Output:
Directionally, larger stores should generate higher sales since they have more space for more items. Type A stores appear to have higher weekly sales than Type B stores, which appear to have higher weekly sales than Type C stores. There are more Type A stores, followed by Type B and Type C stores. Store size may be an important feature in predicting weekly sales.
Let’s create a scatter plot to visually break down weekly sales by store size and type:
Output:
While the bar chart shows that larger stores generate higher weekly sales in the aggregate, the scatter plot shows that there is overlap in weekly sales across store types. We also note some weekly sales outliers.
Let’s create a box plot to visually check the relationship between weekly sales, store, and store type.
Output:
The plot shows that there is some symmetry between some stores within each store type, but there’s also quite a variance. For example, Type A (green) stores 2, 6, 13, 14, and 27 exhibit symmetry in their distributions of weekly sales, and Type A stores 28 and 31 also exhibit similar distributions.
Let’s create a bar chart showing the relationship between weekly sales by holidays and store type:
Output:
IsHoliday appears to generate higher weekly sales for Type A and B stores, but not necessarily Type C stores. Since roughly 90% of stores are either Type A or B, and there’s significant difference in weekly sales with holidays for these stores, the weeks with holidays may be an important feature in the models.
Let’s create pie chart to show the proportion of holidays:
IsHolidays appear to generate higher weekly sales while only accounting for 7% of sales duration.
Let’s create a box plot showing the relationship between weekly sales for each store and holiday.
Output:
The box plot shows that for most stores, the weeks with holidays exhibit greater weekly sales. Certain groups of stores appear to have similar distributions, like the earlier box plot. For example, stores 6, 13, 14, and 27 have similar distributions, as do stores 28 and 31.
Let’s generate a box plot to visualize the relationship between weekly sales and department by type of store:
Output:
Department 72 shows a significant increase in weekly sales during holidays. Outside of a few departments, there is symmetry in the distribution of data across several departments.
Let’s check the relationship between the mark downs (1-5) and weekly sales and store type by creating scatter plots:
There’s no obvious relationship here, to me.
MarkDown2:
Output:
MarkDown3:
Output:
MarkDown4:
Output:
MarkDown5:
Output:
The mark downs may not be significant features for predicting weekly sales in the models.
As with box plots, histograms are a graphical representation for the frequency of numeric data values. Histograms are useful in determining the underlying probability of a dataset. In contrast, box plots less detailed than histograms, but are useful when comparing multiple datasets.
Let’s create a histogram to see how unemployment rates are distributed across different types of stores. We’ll add the Kernel Density Estimation curve which provides a smoothed estimate of the probability density for each group:
Output:
The peaks in the histogram bars and KDE curves identify the most common unemployment rates for each store type. This is a normal distribution, called a unimodal distribution due to the single peak, but with a slight skew toward higher unemployment rates.
Let’s do the same for CPI:
Output:
The histogram indicates an unusual, polarized distribution of CPI values. This is called a bimodal distribution due to the histogram having two distinct peaks at each end of the range where CPI data is clustered around the endpoints and having few observations in between relatively lower and higher values.
Output:
The histogram peaks around 70 for Type A and B stores, and closer to 90 for Type C stores.
I don’t expect unemployment, CPI, or temperature to be important features in the models.
Let’s create a correlation heatmap for numerical data in the training dataset, which can help to understand relationships the variables:
Output:
There are only a few areas of correlation. MarkDown 1 correlates relatively well with MarkDown 4, Month correlates with Week, and Year correlates with Fuel_Price.
3. Data Preprocessing
To prepare the dataset for modeling, let’s split the dataset into training, validation, and test datasets.
Output:
We’ll split the data based on time. We’ll use data prior to 2012 as training data and data from 2012 onward as the validation data. We’ll also define weekly sales as the target variable:
Output:
The datasets are mixed since they include both numerical and categorical values, and there are missing numerical values in a few columns. These need to be handled before building and training the models.
We’ll handle missing numerical values first. For the missing values in the ‘MarkDown’ columns, we will replace ‘NaN’ with zeroes which is effectively equivalent to no markdowns:
Output:
Preprocessing Numerical Data
Let’s handle missing numerical values. We’ll use a simple imputation strategy to replace any missing values with a measure of central tendency for each variable, or feature.
Output:
Output:
Output:
Next, to improve the performance and convergence potential of the models, let’s ensure that all numeric features in the datasets are on a similar scale, from 0 to 1:
Output:
Encoding Categorical Data
Let’s handle the categorical values now by using one-hot encoding, a technique used to represent categorical variables as numerical values.
Setup and fit a OneHotEncoder for the categorical variables:
Output:
Values between 0 and 1 are transformed in each column by a MinMax Scaler. Since One-hot and IsHoliday are already in the range of 0 and 1, they won't be influenced using encoded columns.
Now that the encoder is prepared, let’s perform one-hot encoding on categorical columns in the three different datasets, train_inputs, val_inputs, and test_inputs:
Output:
Since we now have encoded numerical columns, we can remove the categorical columns:
Output:
4. Model Building
We’re ready to start training ML models! We have finished preprocessing the datasets!
Baseline Model
A baseline model is a simple model that can serve as a benchmark to compare other, more complex models against.
We’ll train a linear regression model as our baseline model. Linear regression algorithms try to express the target as a weighted sum of the inputs.
Output:
We need to calculate WMAE for each model. MAE is the average difference between predicted values and actual values, and in this project, we will weight the weeks with holidays 5x more than weeks without holidays.
In addition to WMAE, we’ll also calculate the accuracy of each model using MSE, RMSE, and R-squared, which are commonly used for regression problems:
Mean Squared Error (MSE) – is the average squared difference between the predicted values and observed values. It is always a positive number, and values close to zero are better.
Root Mean Squared Error (RMSE) – indicates how far apart the predicted values are from the observed values, on average.
R-squared (R²) score – is the coefficient of determination, and ranges from 0 to 1. It represents how observed results are reproduced by the model. 0 indicates that the predicted variables cannot be explained at all by the predictor variables. 1 indicates that the predicted variable can be perfectly explained by the predictor variables.
Output:
The values WMAE in the baseline model are quite large. MSE and RMSE are also quite large, and R-squared is quite low.
Let’s create a scatter plot to visualize the relationship between observed and predicted values:
Output:
Train and Evaluate Different Models
Let’s dive further into the world of machine learning models! We’re about to unleash powerful algorithms to tackle our sales prediction challenge. Here are the contenders:
Ridge Regression: The smooth operator. Imagine linear regression with a twist. This algorithm adds a special penalty to keep our model in check, making sure we don't go overboard with less important variables. It's like having a savvy financial advisor for your data!
Decision Tree: The choose-your-own-adventure of ML. Picture a flowchart that comes to life. Starting from the root node (our main decision point), we'll branch out into a maze of choices. Each fork in the road brings us closer to our prediction, creating a fascinating tree of possibilities. It's data storytelling at its finest!
Random Forest: The wisdom of the crowd. Why settle for one tree when you can have a whole forest? This ensemble method creates a bustling community of decision trees, each with its own unique perspective. For classification, they vote on the outcome. For regression, they put their heads together to give us an average prediction. It's democracy in action!
Gradient Boosting: The tag team champion. Last but not least, we have the powerhouse of the ML world. Gradient Boosting is like a relay race of algorithms, each one learning from the mistakes of the last. These "weak learners" team up to form an unstoppable prediction machine. It's the ultimate example of "stronger together.”
Each of these models brings something special to the table. We'll be putting them through their paces, seeing how they handle our Walmart sales data. Get ready for some friendly competition as we uncover which algorithm reigns supreme in our forecasting challenge!
Let's start crunching some numbers!
Ridge Regression Model
Let’s build the Ridge model:
Output:
Let’s calculate the accuracy of the Ridge Model:
Output:
The Ridge model does not appear to perform any better than the baseline linear regression model.
Let’s create a scatter plot of observed and predicted values:
Output:
Decision Tree Model
Let’s build the decision tree model and calculate its accuracy:
Output:
The performance of the decision tree is better than the Ridge model.
Let’s create a scatter plot for the decision tree model:
Output:
Let’s determine the number of levels and the number of leaf nodes in the decision tree model:
Output:
Since we did not explicitly set the max_depth and max_leaf_nodes parameters of the decision tree, we allowed the tree to grow to its full depth, which includes 47 levels and 292865 leaf nodes.
Not constraining max_depth or max_leaf_nodes can either help the tree capture complex patterns in the training data, or it can lead to the model fitting the training data to such an extent that it fails to make accurate predictions outside of the training data, which is called overfitting.
Let’s build a graphic visualization of the decision tree but limit the number of levels to three:
Output:
Next, let’s evaluate which columns, or features, are the most important for the decision tree model. To do this, we’ll review the weights that the model assigned to each feature:
Output:
Let’s visualize feature importance in the model using a bar chart:
Output:
In this model, the most important feature is Dept, followed by Size, Week, and Store.
Random Forest Model
Let’s build and train a random forest model and calculate its accuracy:
Output:
This model appears to perform better than our prior models.
Let’s create the scatter plot:
Output:
Let’s review feature importance in the random forest model:
Output:
Let’s visualize feature importance with a bar chart:
Output:
The variables Dept, Size, Week, and Store are the most important features in this model.
Gradient Boosting Model
Let’s train a gradient boosting model and calculate its accuracy:
Output:
The gradient boost model does not perform as well as the decision tree or random forest models.
Let’s generate the scatter plot for the model:
Output:
XGBoost provides a feature importance score for each column, or feature, of the input, like decision trees and random forests:
Output:
Let’s visualize the weights assigned to each feature:
Output:
Hyperparameter Tuning
Imagine parameters as the building blocks of the model's brain. As the model learns, it tweaks these blocks to create the perfect map from input to output.
Now, hyperparameters are like the secret sauce in the model's recipe. They guide the learning process, but once the dish is done, they vanish - leaving only the perfectly seasoned parameters behind.
As ML chefs, we get to play with this secret sauce before cooking begins. Our mission? To find the perfect blend that makes our model's predictions as spot-on as possible. It's like a culinary experiment - we keep tweaking the recipe, tasting the results, until we discover that mouthwatering combination of hyperparameters that makes our model shine.
Decision Tree Regressor Model
Let’s dive into the world of decision trees. We’ll build a regressor that learns from its mistakes!
Picture this: we're about to embark on a treasure hunt, searching for the perfect parameters to make our model shine. Our map? A series of charts that'll guide us through the jungle of hyperparameters.
We'll set up two companions for our journey: they are pytest fixtures named 'param_name' and 'param_values'. Think of them as our compass and binoculars, helping us navigate the terrain of possibilities.
Our quest will take us through the lands of two decision tree parameters, 'max_depth' and 'max_leaf_nodes'. We'll be on the lookout for two troublemakers: overfitting and underfitting. If we spot the training and validation error lines drifting apart like old friends, we've stumbled into overfitting territory. On the flip side, if both lines stick together like glue at high values, we might be in the underfitting zone.
The ultimate prize? The sweet spot where our validation error hits rock bottom. That's where our model finds the perfect balance - not too simple, not too complex, but just right.
Let’s chart these waters and discover the hidden treasures of optimal parameters:
Output:
max_depth:
The trend shows errors decreasing significantly beyond max_depth of 10, indicating that 10 may be insufficient for the data. The flat lines after max_depth of 10 however suggests that the model’s performance on unseen data (validation) does not significantly improve the model’s performance on the validation dataset.
max_leaf_nodes:
The error lines do not significantly diverge, nor are they close together. The lowest point on the validation error line is 4000 but it’s still slightly declining.
Decision Tree with Custom Parameters
Based on the decision tree regressor model results, we’ll set max_depth to 18 and no specified max_leaf_nodes:
Calculate Accuracy
Output:
Let’s plot the model:
Output
Let’s confirm the depth of the tree and number of leaf nodes:
Output:
Let’s determine feature importance in the model:
Output:
Plot the Feature Importance:
Output:
The variables Dept, Size, Week, and Store are the most important for this model.
Let’s make predictions using our final model:
Output:
Random Forest Regressor Model
We can tune parameters in a random forest regressor model in a similar manner to decision trees.
Here we’ll define three parameters to tune via four pytest fixtures, n_estimators, min_samples_leaf, max_leaf_nodes, and max_depth:
Output:
n_estimators overfitting information:
The model’s performance on unseen (validation) data decreases slightly as n_estimators approaches 20 and then remains relatively constant.
min_samples_leaf overfitting information:
The high, flat validation line suggests that the model’s performance on unseen (validation) data is consistently not generalizing well to new data and that changing min_samples_leaf is not significantly improving the model’s performance. The upward trending training line indicates training error is increasing.
This may suggest the model is underfitting the data, as it may be too simple to capture the underlying patterns in the training and validation datasets.
max_leaf_nodes overfitting information:
As max_leaf_nodes increases, the model becomes more complex. Divergence between training and validation errors as max_leaf_nodes increases is a sign of overfitting. We don’t see that here; we see nearly parallel decreasing error lines. This indicates that increasing max_leaf_nodes improve the model’s performance at a similar rate for both training and validation data without overfitting. It also indicates that the model may be underfitting at lower max_leaf_node values.
max_depth overfitting information:
As max_depth increases, the model becomes increasingly complex. Both error lines decrease sharply until max_depth of 15, then they reduce further to 20, and then appear to level or decrease at a slower rate.
Random Forest with Custom Parameters
Based on the random forest regressor model results above, we’ll set n_estimators to 17, min_samples_leaf to 3, and max_depth to 20. We will not set max_leaf_nodes and instead allow the model to learn on its own:
Let’s calculate the accuracy of this model:
Output:
Let’s plot the observed and predicated values in a scatter plot:
Output:
Let’s review the Random Forest Feature Importance and look at the weights assigned to different columns, to figure out which columns in the dataset are the most important for this model.
Output:
Create a bar chart of feature importance.
Output:
The variables Dept, Size, Week, and Store are the most importance variables in the model.
Let’s now make predictions using our final model.
Ouput:
Gradient Boost Regressor Model
We can also tune parameters in a gradient boost regressor model. Here, we’ll define three pytest fixtures to tune, n_estimators, max_depth, and learning_rate, and provide a range of values to test each parameter:
Output:
n_estimators overfitting information:
The error lines fall substantially as the n_estimators parameter increases, indicating the model’s performance on the validation dataset improves as well. The trend line suggests diminishing returns may begin after 350, suggesting that the model is not overfitting as more estimators are added through this point.
max_depth overfitting information:
learning_rate overfitting information:
Validation error decreases substantially through a learning_rate value of 0.2, then decreases slightly as learning_rate approaches 0.3, and increases sharply thereafter. This indicates that the model may be underfitting the data at low values and overfitting at higher values. The model also appears to be sensitive to learning_rate, suggesting small changes may have a large impact on model performance. Lower learning_rate values tend to result in slower but more precise learning while higher rates tend to result in faster but less stable learning.
Gradient Boost with Custom Parameters
Based on the gradient boost regressor model results, we’ll set n_estimators to 300, max_depth to 13 , and learning_rate to 0.2 to train a new model:
Calculate Accuracy
Output:
Create Scatter Plot Gradient Boost Model with Custom Parameters
Output:
Let’s review Gradient Boost Feature Importance.
Output:
Let’s create a bar chart of feature importance.
Output:
The variables Size, Type_B, Dept, Week, and Type_A are the most important for this model.
Let’s use the gradient boost regressor model to make predictions using the test dataset:
Output:
5. Performance of the Best Model
The following table shows the WMAE results for each of our models:
And the winner is. . . Gradient Boosting with Custom Parameters!
The tag team champion proved its metal outperforming the competition in our sales forecasting showdown. Like a master chess player, the algorithm’s strength lies in its ability to learn from its missteps, continuously refining its strategy with each iteration. It sliced through Walmart’s data, navigating holiday rushes and promotional events with agility. By combining the insights of several “weak learners,” Gradient Boosting created a powerhouse predictor that left rival models in the dust. Sometimes the best way to solve a problem, as the adage goes, is to break it down into smaller pieces and tackle them one at a time. Hats off to Gradient Boosting!
Let’s predict sales on the test dataset using the final Gradient Boosting model:
Output:
6. Potential Future Work
Remove outliers.
Tune additional hyperparameters.
Test other machine learning regression models and evaluate how they perform.