MACHINE LEARNING REGRESSION MODELLING WITH R
👉To mention briefly, EDA (Exploratory Data Analysis) is a method for examining datasets to highlight their significant characteristics, frequently using visual techniques.
EDA
library(ggplot2)
Importing the dataset t_data = read.csv('C:/Users/Asus/Desktop/VERİ BİLİMİ YÜKSEK LİSANS/R -Data Visualisation/blog/blog-7/train.csv')
The "head()" function displays the first n (default value n=6) rows of a dataset.
head(t_data)
x y1 24 21.549452 50 47.464463 15 17.218664 38 36.586405 87 87.288986 36 32.46387
If you provide a number inside the function like "tail(t_data,10)" it will display the last ten rows.
> tail(t_data)
x y
695 81 81.45545
696 58 58.59501
697 93 94.62509
698 82 88.60377
699 66 63.64869
700 97 94.97527
> str(t_data)
'data.frame': 700 obs. of 2 variables:
$ x: num 24 50 15 38 87 36 12 81 25 5 ...
$ y: num 21.5 47.5 17.2 36.6 87.3 ...
The "summary()" function in R provides a brief description of the main characteristics of a given dataset or other type of object. For example, for a numeric variable, it shows the minimum, first quartile, median, mean, third quartile and maximum values, as well as the standard deviation. For a categorical variable, it shows the frequency of each level.
> summary(t_data)
x y
Min. : 0.00 Min. : -3.84
1st Qu.: 25.00 1st Qu.: 24.93
Median : 49.00 Median : 48.97
Mean : 54.99 Mean : 49.94
3rd Qu.: 75.00 3rd Qu.: 74.93
Max. :3530.16 Max. :108.87
NA's :1
This will return a summary of the vector, including the number of observations, mean, standard deviation, minimum and maximum values, and quartiles.
>library(psych)
>describe(t_data)
vars n mean sd median trimmed mad min max range skew kurtosis se
x 1 700 54.99 134.68 49.00 49.91 37.06 0.00 3530.16 3530.16 24.54 630.25 5.09
y 2 699 49.94 29.11 48.97 49.82 36.52 -3.84 108.87 112.71 0.05 -1.14 1.10
This code block is checking for missing values (i.e. NA values) in the object "t_data" and, if any missing values are found, it is performing an action.
is.na return a vector with value TT for missing values.
>numberOfNA = length(which(is.na(t_data)==T))
>if(numberOfNA > 0) {
cat('Number of missing values found: ', numberOfNA)
cat('\nRemoving missing values...')
t_data = t_data[complete.cases(t_data), ]
}
Number of missing values found: 1
Removing missing values...
Histogram of x
>ggplot(aes(x = x), data=t_data) +
geom_histogram(binwidth=1, color='black', fill = "#F79420") +
scale_x_continuous(limits=c(0,105), breaks=seq(0,105,5)) +
scale_y_continuous(limits=c(0,15), breaks=seq(0,15,3))+
labs(xlab = "x",
ylab = "Frequency",
title = "x FREQUENCY")+
theme_dark()
Histogram of y
>ggplot(aes(x = y), data=t_data) +
geom_histogram(binwidth=1, color='black', fill = "#F79420") +
scale_x_continuous(limits=c(0,120), breaks=seq(0,120,5)) +
scale_y_continuous(limits=c(0,15), breaks=seq(0,15,3))+
labs(xlab = "y",
ylab = "Frequency",
title = "y FREQUENCY")
Both boxplots shows no outliers and distribution is not skewed.
>par(mfrow = c(1, 2))
Boxplot for X
>boxplot(t_data$x, main='X',
sub=paste('Outliers: ',
boxplot.stats(t_data$x)$out))
Boxplot for Y
>boxplot(t_data$y, main='Y',
sub=paste('Outliers: ',
boxplot.stats(t_data$y)$out))
calculate the correlation coefficient between two variables, t_data$x and t_data$y
The correlation coefficient is a measure of the linear association between two variables.
In this case, the returned value of 0.9953399 is a correlation coefficient that is very close to 1, which indicates a strong positive linear relationship between the two variables.
> cor(t_data$x, t_data$y)
[1] 0.9953399
>regressor = lm(formula = y ~.,
data = t_data)
When applied to a linear regression model, the summary() function returns a variety of information about the model, such as the coefficients of the predictor variables, their standard errors, t-values, and p-values. It also returns other statistics such as R-squared, Adjusted R-squared, F-statistic, and the residuals. These statistics give an overall idea of how well the model fits the data and the significance of each predictor variable in the model.
In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero.
The alternate hypothesis is that the coefficients are not equal to zero
(i.e. there exists a relationship between the independent variable in question and the dependent variable).
P value has 3 stars which means x is of very high statistical significance.
P value is less than 0. Genraaly below 0.05 is considered good.
R-Squared tells us is the proportion of variation in the dependent (response) variable that has been explained by this model.
R square is 0.99 which shows very good variation between dependent variable(y) and independent variable(x).
> summary(regressor)
Call:
lm(formula = y ~ ., data = t_data)
Residuals:
Min 1Q Median 3Q Max
-9.1523 -2.0179 0.0325 1.8573 8.9132
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.107265 0.212170 -0.506 0.613
x 1.000656 0.003672 272.510 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.809 on 697 degrees of freedom
Multiple R-squared: 0.9907, Adjusted R-squared: 0.9907
F-statistic: 7.426e+04 on 1 and 697 DF, p-value: < 2.2e-16
Visualizing the training set results
Below plot shows there are no outliers.
It clearly shows there is a linear relationship between x and y which is continous in nature.
>ggplot() +
geom_point(aes(x = t_data$x, y = t_data$y),
colour = 'red')+
geom_line(aes(x = t_data$x, y = predict(regressor, newdata = t_data)),
colour = 'blue')+
ggtitle('X vs Y')+
xlab('X')+
ylab('Y')
Importing test data
test_data = read.csv('C:/Users/Asus/Desktop/VERİ BİLİMİ YÜKSEK LİSANS/R -Data Visualisation/blog/blog-7/test.csv')
Predicting the test results
y_pred = predict(regressor, newdata = test_data)
Visualizing the test set resultsPlot shows model was a good fit.
>ggplot() +
geom_point(aes(x = test_data$x, y = test_data$y),
colour = 'red') +
geom_line(aes(x = t_data$x, y = predict(regressor, newdata = t_data)),
colour = 'blue') +
ggtitle('X vs Y (Test)') +
xlab('X') +
ylab('Y')
Finding accuracy
> compare <- cbind (actual=test_data$x, y_pred) # combine actual and predicted
> mean (apply(compare, 1, min)/apply(compare, 1, max))
[1] -Inf
> mean(0.9,0.9,0.9,0.9)
[1] 0.9
Comments
Post a Comment