侧边栏壁纸

Predicting the Rossmann Store Sales Using Different Models in R --- Kaggle IN class Project

2021年03月31日 1.2k阅读 0评论 0点赞

OK, Let us Start our Project ----- Predicting the Sales

Step 0 : Define the Root Mean Square Function for checking my Model in the future.

root.mean.square.jpg.png

Step 1 : Loading the Data Set and Combine them correspondingly.

loadingData.Combine.jpg.png

It is Always important to know what kind of data we are dealing with
Use summary(train) to see, and the result is below:
train.summary.jpg.png

Step 2 : Explorative Data analysis. Take a closer look at the data we have.
Now we could First Make a Huge Graph containing all possible variables, VS Sales ;
plotAll.Code.jpg.png

1.plot.result.jpg.png

It seems that “Customers” has a positive relation with Sales; However, Test Data does not have “Customers”. Under this condition, I think deleting “Customers” in the future modeling will be helpful;
From the Graph, it is easy to see that most of our features are categorical features;
There are many interesting points I noticed. First, The Sales will be greater on weekday than weekend, as we could see that Sunday has the lowest Sales; Second, Holiday will influence Sales a lot.
How About other Variables ? Let us see whether “StoreType” has anything interesting.
storeType.jpg.png

storeType2.jpg.png

It seems that “StoreType a “ will have better possibility to get higher sales.
Moreover, check “CompetitionDistance” with “StoreType”:
distance.storetype.jpg.png

distance.storetype2.jpg.png

The Plot tells us that Higher Sales dependent on Smaller Distance with StoreType “a”; So, ‘CompetitionDistance’, “StoreType” will be crucial features for modeling.

Step 3 : Transforming the Data
We could define “week”, “Day”, “Month” using “Date”;
trainsformaingDate.jpg.png

We could make Dummy Variables using “StateHoliday”, “StoreType”;
Also, we could create “WeekEnd” using “DayOfWeek”;
According to any possible holidays, we combine them into “HolidayT”;
Using “HolidayT” and “CompetitionDistance”, create “DistanceHoliday”;
MakingDummy.jpg.png

Next, we need to delete some useless variables;
deleteVaribales.jpg.png

Then we have transformed our data to be available in Modeling !

Step 4 : Modeling
As we do not want “Customers”, and we need to create data set suitable to different models;
modelingSampleSet.jpg.png

trian2Set.jpg.png

The First Model : Linear Regression + Lasso Regression + Ridge Regression
lassoCode.jpg.png

lasso.1.jpg.png

lasso.2.jpg.png

lasso.error.jpg.png

ridge.Code.jpg.png

ridge1.jpg.png

ridge2.jpg.png

ridge.error.jpg.png

For Linear Regression Model, we could use StepWise Method to select the variables;
linear.Code.jpg.png

Then we could combine these three regression model together with different weights, and we could get our first Kaggle Score !
regression.Code.jpg.png

regression.Error.jpg.png

The Second Model : Random Forest
Before Using the Raw Model, we need to select our best parameters;
randomForestSelect.jpg.png

mtry.plot.jpg.png

randomForest.Code.jpg.png

randomForest.plot.jpg.png

Applying the Random Forest Model to the test file,
I Tried many times with Random Forest Model, sometimes with mtry = 10, sometimes with mtry = 9;
And Below is the best Kaggle Score using Random Forest Model;
randomForest.Error.jpg.png

The Third Model : XgBoost
Similarly, before training the model, we need to select the best parametes;
Here, I used grid-search for several parametes;
xgboost.Code.jpg.png

xgboost.bestTune.jpg
Then We apply the xgboost model with the best parameters ;
xgboost.model.jpg.png

Using similar codes shown in Regression Model Session, I get the result from xgboost and submit on Kaggle to get Score.
Through several attempts, the best Score is :
xgboost.error.jpg.png
The Final Model : Combined with Random Forest and Xgboost
We could Try some combination between models. The usage of our Model 1 (combination of regressions) may be helpful in this time.
Through several attempts, I found that 0.75 xgboost + 0.25 random_forest will be the best combination under my data.
finalCombination.jpg.png
Then the final and the best Kaggle Score is the combination of random forest and Xgboost;
combination.error.jpg.png

End of the Project, we used several models to predict the sales;
Lasso Regression, Ridge Regression, LInear Regression, Random Forest, Xgboost, And we apply some combinations;
Here is the rank of my result out of 80 students ! Hope this is a good help to anyone want to learn machine learning.
rank.jpg.png

0
打赏

—— 评论区 ——

请登录后发表评论
立即登录
LOGIN