OK, Let us Start our Project ----- Predicting the Sales

**Step 0 : Define the Root Mean Square Function for checking my Model in the future.**

**Step 1 : Loading the Data Set and Combine them correspondingly.**

It is Always important to know what kind of data we are dealing with

Use summary(train) to see, and the result is below:

**Step 2 : Explorative Data analysis. Take a closer look at the data we have.**

Now we could First Make a Huge Graph containing all possible variables, VS Sales ;

It seems that “Customers” has a positive relation with Sales; However, Test Data does not have “Customers”. Under this condition, I think deleting “Customers” in the future modeling will be helpful;

From the Graph, it is easy to see that most of our features are categorical features;

There are many interesting points I noticed. First, The Sales will be greater on weekday than weekend, as we could see that Sunday has the lowest Sales; Second, Holiday will influence Sales a lot.

How About other Variables ? Let us see whether “StoreType” has anything interesting.

It seems that “StoreType a “ will have better possibility to get higher sales.

Moreover, check “CompetitionDistance” with “StoreType”:

The Plot tells us that Higher Sales dependent on Smaller Distance with StoreType “a”; So, ‘CompetitionDistance’, “StoreType” will be crucial features for modeling.

**Step 3 : Transforming the Data**

We could define “week”, “Day”, “Month” using “Date”;

We could make Dummy Variables using “StateHoliday”, “StoreType”;

Also, we could create “WeekEnd” using “DayOfWeek”;

According to any possible holidays, we combine them into “HolidayT”;

Using “HolidayT” and “CompetitionDistance”, create “DistanceHoliday”;

Next, we need to delete some useless variables;

Then we have transformed our data to be available in Modeling !

**Step 4 : Modeling**

As we do not want “Customers”, and we need to create data set suitable to different models;

The First Model : Linear Regression + Lasso Regression + Ridge Regression

For Linear Regression Model, we could use StepWise Method to select the variables;

Then we could combine these three regression model together with different weights, and we could get our first Kaggle Score !

The Second Model : Random Forest

Before Using the Raw Model, we need to select our best parameters;

Applying the Random Forest Model to the test file,

I Tried many times with Random Forest Model, sometimes with mtry = 10, sometimes with mtry = 9;

And Below is the best Kaggle Score using Random Forest Model;

The Third Model : XgBoost

Similarly, before training the model, we need to select the best parametes;

Here, I used grid-search for several parametes;

Then We apply the xgboost model with the best parameters ;

Using similar codes shown in Regression Model Session, I get the result from xgboost and submit on Kaggle to get Score.

Through several attempts, the best Score is :

The Final Model : Combined with Random Forest and Xgboost

We could Try some combination between models. The usage of our Model 1 (combination of regressions) may be helpful in this time.

Through several attempts, I found that 0.75 * xgboost + 0.25 * random_forest will be the best combination under my data.

Then the final and the best Kaggle Score is the combination of random forest and Xgboost;

End of the Project, we used several models to predict the sales;

Lasso Regression, Ridge Regression, LInear Regression, Random Forest, Xgboost, And we apply some combinations;

Here is the rank of my result out of 80 students ! Hope this is a good help to anyone want to learn machine learning.

## —— 评论区 ——