OK, Let us Start our Project ----- Predicting the SalesStep 0 : Define the Root Mean Square Function for checking my Model in the future.Step 1 : Loading the Data Set and Combine them correspondingly.It is Always important to know what kind of data we are dealing withUse summary(train) to see, and the result is below:Step 2 : Explorative Data analysis. Take a closer look at the data we have.Now we could First Make a Huge Graph containing all possible variables, VS Sales ;It seems that “Customers” has a positive relation with Sales; However, Test Data does not have “Customers”. Under this condition, I think deleting “Customers” in the future modeling will be helpful;From the Graph, it is easy to see that most of our features are categorical features;There are many interesting points I noticed. First, The Sales will be greater on weekday than weekend, as we could see that Sunday has the lowest Sales; Second, Holiday will influence Sales a lot. How About other Variables ? Let us see whether “StoreType” has anything interesting.It seems that “StoreType a “ will have better possibility to get higher sales. Moreover, check “CompetitionDistance” with “StoreType”:The Plot tells us that Higher Sales dependent on Smaller Distance with StoreType “a”; So, ‘CompetitionDistance’, “StoreType” will be crucial features for modeling.Step 3 : Transforming the DataWe could define “week”, “Day”, “Month” using “Date”;We could make Dummy Variables using “StateHoliday”, “StoreType”;Also, we could create “WeekEnd” using “DayOfWeek”;According to any possible holidays, we combine them into “HolidayT”;Using “HolidayT” and “CompetitionDistance”, create “DistanceHoliday”;Next, we need to delete some useless variables;Then we have transformed our data to be available in Modeling !Step 4 : Modeling As we do not want “Customers”, and we need to create data set suitable to different models;The First Model : Linear Regression + Lasso Regression + Ridge RegressionFor Linear Regression Model, we could use StepWise Method to select the variables;Then we could combine these three regression model together with different weights, and we could get our first Kaggle Score !The Second Model : Random ForestBefore Using the Raw Model, we need to select our best parameters;Applying the Random Forest Model to the test file, I Tried many times with Random Forest Model, sometimes with mtry = 10, sometimes with mtry = 9;And Below is the best Kaggle Score using Random Forest Model;The Third Model : XgBoostSimilarly, before training the model, we need to select the best parametes;Here, I used grid-search for several parametes;Then We apply the xgboost model with the best parameters ;Using similar codes shown in Regression Model Session, I get the result from xgboost and submit on Kaggle to get Score.Through several attempts, the best Score is :The Final Model : Combined with Random Forest and XgboostWe could Try some combination between models. The usage of our Model 1 (combination of regressions) may be helpful in this time.Through several attempts, I found that 0.75 xgboost + 0.25 random_forest will be the best combination under my data.Then the final and the best Kaggle Score is the combination of random forest and Xgboost; End of the Project, we used several models to predict the sales;Lasso Regression, Ridge Regression, LInear Regression, Random Forest, Xgboost, And we apply some combinations;Here is the rank of my result out of 80 students ! Hope this is a good help to anyone want to learn machine learning.