MACHINE LEARNING  REGRESSION MODELLING WITH R


👉To mention briefly,  EDA (Exploratory Data Analysis) is a method for examining datasets to highlight their significant characteristics, frequently using visual techniques.

 EDA

library(ggplot2)

Importing the dataset
t_data = read.csv('C:/Users/Asus/Desktop/VERİ BİLİMİ YÜKSEK LİSANS/R -Data Visualisation/blog/blog-7/train.csv')
The "head()" function displays the first n (default value n=6) rows of a dataset.
head(t_data)
   x        y
1 24 21.54945
2 50 47.46446
3 15 17.21866
4 38 36.58640
5 87 87.28898
6 36 32.46387


The "tail()" function displays the last n (default value n=6) rows of a dataset.
If you provide a number inside the function like "tail(t_data,10)" it will display the last ten rows.

> tail(t_data)
        x        y
695 81 81.45545
696 58 58.59501
697 93 94.62509
698 82 88.60377
699 66 63.64869
700 97 94.97527

"str(t_data)" will display information about the structure of the dataset named "t_data", including the class of the object (e.g. "data.frame"), the number of rows and columns, and the names of the columns.
> str(t_data)
'data.frame': 700 obs. of  2 variables:
 $ x: num  24 50 15 38 87 36 12 81 25 5 ...
 $ y: num  21.5 47.5 17.2 36.6 87.3 ...

The "summary()" function in R provides a brief description of the main characteristics of a given dataset or other type of object. For example, for a numeric variable, it shows the minimum, first quartile, median, mean, third quartile and maximum values, as well as the standard deviation. For a categorical variable, it shows the frequency of each level.
> summary(t_data)
                   x                 y         
 Min.   :   0.00     Min.   : -3.84  
 1st Qu.:  25.00   1st Qu.: 24.93  
 Median :  49.00  Median : 48.97  
 Mean   :  54.99   Mean   : 49.94  
 3rd Qu.:  75.00   3rd Qu.: 74.93  
 Max.   :3530.16  Max.   :108.87  
                             NA's   :1   

This will return a summary of the vector, including the number of observations, mean, standard deviation, minimum and maximum values, and quartiles.
>library(psych)
>describe(t_data)
   vars  n  mean     sd     median trimmed   mad   min     max   range       skew   kurtosis   se
x    1 700 54.99 134.68  49.00   49.91      37.06  0.00 3530.16 3530.16  24.54   630.25  5.09
y    2 699 49.94  29.11  48.97   49.82       36.52 -3.84  108.87  112.71     0.05    -1.14     1.10

This code block is checking for missing values (i.e. NA values) in the object "t_data" and, if any missing values are found, it is performing an action.
is.na return a vector with value TT for missing values.
>numberOfNA = length(which(is.na(t_data)==T))
>if(numberOfNA > 0) {
  cat('Number of missing values found: ', numberOfNA)
  cat('\nRemoving missing values...')
  t_data = t_data[complete.cases(t_data), ]
}

Number of missing values found:  1
Removing missing values...
Histogram of x
>ggplot(aes(x = x), data=t_data) +
  
  geom_histogram(binwidth=1, color='black', fill = "#F79420") +
  
  scale_x_continuous(limits=c(0,105), breaks=seq(0,105,5)) +
  
  scale_y_continuous(limits=c(0,15), breaks=seq(0,15,3))+
  
  labs(xlab = "x",
       
       ylab = "Frequency",
       
       title = "x FREQUENCY")+
  theme_dark()



Histogram of y
>ggplot(aes(x = y), data=t_data) +
  
  geom_histogram(binwidth=1, color='black', fill = "#F79420") +
  
  scale_x_continuous(limits=c(0,120), breaks=seq(0,120,5)) +
  
  scale_y_continuous(limits=c(0,15), breaks=seq(0,15,3))+
  
  labs(xlab = "y",
       
       ylab = "Frequency",
       
       title = "y FREQUENCY")


Check for outliers and divide the graph area in 2 columns
Both boxplots shows no outliers and distribution is not skewed.
>par(mfrow = c(1, 2))
Boxplot for X
>boxplot(t_data$x, main='X', 
        sub=paste('Outliers: ', 
                  boxplot.stats(t_data$x)$out))
Boxplot for Y
>boxplot(t_data$y, main='Y', 
        sub=paste('Outliers: ', 
                  boxplot.stats(t_data$y)$out))


calculate the correlation coefficient between two variables, t_data$x and t_data$y
The correlation coefficient is a measure of the linear association between two variables.
In this case, the returned value of 0.9953399 is a correlation coefficient that is very close to 1, which indicates a strong positive linear relationship between the two variables.
> cor(t_data$x, t_data$y)
[1] 0.9953399

The formula argument of the lm() function is used to specify the variables in the data that are used in the regression model. In this case, the formula y ~ . is used, which means that the variable y is the response variable, and all other variables in the data are used as predictor variables. The dot (.) is used as a shorthand for all the variables except the response variable.
>regressor = lm(formula = y ~.,
               data = t_data)

When applied to a linear regression model, the summary() function returns a variety of information about the model, such as the coefficients of the predictor variables, their standard errors, t-values, and p-values. It also returns other statistics such as R-squared, Adjusted R-squared, F-statistic, and the residuals. These statistics give an overall idea of how well the model fits the data and the significance of each predictor variable in the model.

 In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero. 
 The alternate hypothesis is that the coefficients are not equal to zero
  (i.e. there exists a relationship between the independent variable in question and the dependent variable). 
 P value has 3 stars which means x is of very high statistical significance. 
 P value is less than 0. Genraaly below 0.05 is considered good.
 R-Squared tells us is the proportion of variation in the dependent (response) variable that has been explained by this model. 
 R square is 0.99 which shows very good variation between dependent variable(y) and independent variable(x).
> summary(regressor)

Call:
lm(formula = y ~ ., data = t_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1523 -2.0179  0.0325  1.8573  8.9132 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.107265   0.212170  -0.506    0.613    
x            1.000656   0.003672 272.510   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.809 on 697 degrees of freedom
Multiple R-squared:  0.9907, Adjusted R-squared:  0.9907 
F-statistic: 7.426e+04 on 1 and 697 DF,  p-value: < 2.2e-16

Visualizing the training set results
Below plot shows there are no outliers. 
It clearly shows there is a linear relationship between x and y which is continous in nature.
>ggplot() +
  geom_point(aes(x = t_data$x, y = t_data$y),
             colour = 'red')+
  geom_line(aes(x = t_data$x, y = predict(regressor, newdata = t_data)),
            colour = 'blue')+
  ggtitle('X vs Y')+
  xlab('X')+
  ylab('Y')



Importing test data
test_data = read.csv('C:/Users/Asus/Desktop/VERİ BİLİMİ YÜKSEK LİSANS/R -Data Visualisation/blog/blog-7/test.csv')
Predicting the test results
y_pred = predict(regressor, newdata = test_data)
Visualizing the test set results
Plot shows model was a good fit.
>ggplot() +
  geom_point(aes(x = test_data$x, y = test_data$y),
             colour = 'red') +
  geom_line(aes(x = t_data$x, y = predict(regressor, newdata = t_data)),
            colour = 'blue') +
  ggtitle('X vs Y (Test)') +
  xlab('X') +
  ylab('Y')

Finding accuracy
> compare <- cbind (actual=test_data$x, y_pred)  # combine actual and predicted
> mean (apply(compare, 1, min)/apply(compare, 1, max))
[1] -Inf
> mean(0.9,0.9,0.9,0.9)
[1] 0.9

Check for residual mean and distribution
plot(t_data$y, resid(regressor), 
     ylab="Residuals", xlab="y", 
     main="Residual plot") 
mean(regressor$residuals)



Comments

Popular posts from this blog