OK, Let us Start our Project ----- Predicting the Sales
Step 0 : Define the Root Mean Square Function for checking my Model in the future.
Step 1 : Loading the Data Set and Combine them correspondingly.
It is Always important to know what kind of data we are dealing with
Use summary(train) to see, and the result is below:
Step 2 : Explorative Data analysis. Take a closer look at the data we have.
Now we could First Make a Huge Graph containing all possible variables, VS Sales ;
It seems that “Customers” has a positive relation with Sales; However, Test Data does not have “Customers”. Under this condition, I think deleting “Customers” in the future modeling will be helpful;
From the Graph, it is easy to see that most of our features are categorical features;
There are many interesting points I noticed. First, The Sales will be greater on weekday than weekend, as we could see that Sunday has the lowest Sales; Second, Holiday will influence Sales a lot.
How About other Variables ? Let us see whether “StoreType” has anything interesting.
It seems that “StoreType a “ will have better possibility to get higher sales.
Moreover, check “CompetitionDistance” with “StoreType”:
The Plot tells us that Higher Sales dependent on Smaller Distance with StoreType “a”; So, ‘CompetitionDistance’, “StoreType” will be crucial features for modeling.
Step 3 : Transforming the Data
We could define “week”, “Day”, “Month” using “Date”;
We could make Dummy Variables using “StateHoliday”, “StoreType”;
Also, we could create “WeekEnd” using “DayOfWeek”;
According to any possible holidays, we combine them into “HolidayT”;
Using “HolidayT” and “CompetitionDistance”, create “DistanceHoliday”;
Next, we need to delete some useless variables;
Then we have transformed our data to be available in Modeling !
Step 4 : Modeling
As we do not want “Customers”, and we need to create data set suitable to different models;
The First Model : Linear Regression + Lasso Regression + Ridge Regression
For Linear Regression Model, we could use StepWise Method to select the variables;
Then we could combine these three regression model together with different weights, and we could get our first Kaggle Score !
The Second Model : Random Forest
Before Using the Raw Model, we need to select our best parameters;
Applying the Random Forest Model to the test file,
I Tried many times with Random Forest Model, sometimes with mtry = 10, sometimes with mtry = 9;
And Below is the best Kaggle Score using Random Forest Model;
The Third Model : XgBoost
Similarly, before training the model, we need to select the best parametes;
Here, I used grid-search for several parametes;
Then We apply the xgboost model with the best parameters ;
Using similar codes shown in Regression Model Session, I get the result from xgboost and submit on Kaggle to get Score.
Through several attempts, the best Score is :
The Final Model : Combined with Random Forest and Xgboost
We could Try some combination between models. The usage of our Model 1 (combination of regressions) may be helpful in this time.
Through several attempts, I found that 0.75 xgboost + 0.25 random_forest will be the best combination under my data.
Then the final and the best Kaggle Score is the combination of random forest and Xgboost;
End of the Project, we used several models to predict the sales;
Lasso Regression, Ridge Regression, LInear Regression, Random Forest, Xgboost, And we apply some combinations;
Here is the rank of my result out of 80 students ! Hope this is a good help to anyone want to learn machine learning.