R - EDA(EXPLORATORY DATA ANALYSIS)




I'll explain the exploratory data analysis(EDA) process in this blog post, which is a crucial step in data science using the diabetes dataset.

 

EDA (Exploratory Data Analysis) is a method for examining datasets to highlight their significant characteristics, frequently using visual techniques.

 

EDA can be divided into three categories:

1. Data Comprehension

2. Eliminating Extraneous Data

3. Analyzing Data for Relationships

 

Information About The Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • kinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 or 1)

Let’s start…

library(ggplot2) # Data visualization

library(readr) # reading csv file

 Load dataset

df = read.csv("C:/Users/Asus/Desktop/VERฤฐ BฤฐLฤฐMฤฐ YรœKSEK LฤฐSANS/R -Data Visualisation/blog/blog-2/diabetes.csv")

Structural features of data

> str(df)

'data.frame':     768 obs. of  9 variables:

 $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...

 $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...

 $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...

 $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...

 $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...

 $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...

 $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...

 $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...

 $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...


Display first 6 rows of data

> head(df)

 Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI DiabetesPedigreeFunction Age Outcome

1           6              148            72                          35                 0      33.6                    0.627                   50       1

2           1               85             66                          29                 0     26.6                    0.351                    31       0

3           8              183            64                           0                  0      23.3                    0.672                   32       1

4           1               89            66                           23                 94    28.1                    0.167                   21       0

5           0              137            40                          35                 168   43.1                    2.288                  33       1

6           5              116            74                            0                   0     25.6                    0.201                  30       0

 Display last 6 rows of data

> tail(df)

    Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI DiabetesPedigreeFunction Age Outcome

763           9             89            62                       0                      0       22.5                    0.142                   33        0

764          10           101            76                      48                   180    32.9                    0.171                   63        0

765           2            122            70                      27                      0     36.8                    0.340                   27        0

766           5            121            72                      23                    112   26.2                    0.245                  30         0

767           1            126            60                       0                      0      30.1                    0.349                  47         1

768           1             93            70                       31                     0      30.4                    0.315                  23         0

 > summary(df)

Pregnancies        Glucose      BloodPressure    SkinThickness 

 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00 

 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00 

 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00 

 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54 

 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00 

 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00 

 

Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  

 

  Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000  

๐Ÿ‘‰You can examine the descriptive statistic values of the dataset.

๐Ÿ‘‰When we examine it carefully, we see features that normally cannot be 0 but have a min value of 0. These values need to be

corrected, but we will ignore it because it is not our topic right now.

> describe(df)
                         vars   n   mean     sd median trimmed   mad   min    max  range  skew kurtosis   se
Pregnancies                 1 768   3.85   3.37   3.00    3.46  2.97  0.00  17.00  17.00  0.90     0.14 0.12
Glucose                     2 768 120.89  31.97 117.00  119.38 29.65  0.00 199.00 199.00  0.17     0.62 1.15
BloodPressure               3 768  69.11  19.36  72.00   71.36 11.86  0.00 122.00 122.00 -1.84     5.12 0.70
SkinThickness               4 768  20.54  15.95  23.00   19.94 17.79  0.00  99.00  99.00  0.11    -0.53 0.58
Insulin                     5 768  79.80 115.24  30.50   56.75 45.22  0.00 846.00 846.00  2.26     7.13 4.16
BMI                         6 768  31.99   7.88  32.00   31.96  6.82  0.00  67.10  67.10 -0.43     3.24 0.28
DiabetesPedigreeFunction    7 768   0.47   0.33   0.37    0.42  0.25  0.08   2.42   2.34  1.91     5.53 0.01
Age                         8 768  33.24  11.76  29.00   31.54 10.38 21.00  81.00  60.00  1.13     0.62 0.42
Outcome                     9 768   0.35   0.48   0.00    0.31  0.00  0.00   1.00   1.00  0.63    -1.60 0.02

Data Visualization

Compute correlation matrix

install.packages("ggcorrplot") library(ggcorrplot)

๐Ÿ‘‰First, let's look at the correlation between features.

๐Ÿ‘‰The first three features associated with a patient's diabetes are Glucose, BMI, and Age.

๐Ÿ‘‰For the visualization in the EDA section, I will specifically focus on these 3 features.

> df_cor = round(cor(df),2) > df_cor Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome Pregnancies 1.00 0.13 0.14 -0.08 -0.07 0.02 -0.03 0.54 0.22 Glucose 0.13 1.00 0.15 0.06 0.33 0.22 0.14 0.26 0.47 BloodPressure 0.14 0.15 1.00 0.21 0.09 0.28 0.04 0.24 0.07 SkinThickness -0.08 0.06 0.21 1.00 0.44 0.39 0.18 -0.11 0.07 Insulin -0.07 0.33 0.09 0.44 1.00 0.20 0.19 -0.04 0.13 BMI 0.02 0.22 0.28 0.39 0.20 1.00 0.14 0.04 0.29 DiabetesPedigreeFunction -0.03 0.14 0.04 0.18 0.19 0.14 1.00 0.03 0.17 Age 0.54 0.26 0.24 -0.11 -0.04 0.04 0.03 1.00 0.24 Outcome 0.22 0.47 0.07 0.07 0.13 0.29 0.17 0.24 1.00

ggcorrplot(df_cor)



Barplot by Outcome

๐Ÿ‘‰You can see how many people have diabetes in the dataset

ggplot(aes(x = Outcome), data = df) +

geom_bar(fill='steelblue')


Histogram by Age

๐Ÿ‘‰The intensity is higher in the 20-40 age range.

๐Ÿ‘‰A right skewed dispersion graph available i.e. the mean age is greater than the median.

๐Ÿ‘‰ We can also detect outlier from the histogram chart. For example, Although we cannot say exactly an outlier,

there is someone who is 81 years old.At first glance we can look for the outlier here

ggplot(aes(x = Age), data=df) +

geom_histogram(binwidth=1, color='black', fill = "#F79420") +

scale_x_continuous(limits=c(20,90), breaks=seq(20,90,5)) +

scale_y_continuous(limits=c(0,80), breaks=seq(0,80,10))+

labs(xlab = "Age",

ylab = "Frequency",

title = "AGE FREQUENCY")



Histogram by BMI

๐Ÿ‘‰The intensity is higher in the 25-39 BMI range.

๐Ÿ‘‰A left skewed dispersion graph available i.e. the median age is greater than the mean.

๐Ÿ‘‰ Although we cannot say exactly an outlier, those with a BMI over 55 stand out as outliers.

ggplot(aes(x = BMI), data=df) + geom_histogram(binwidth=1, color='red', fill = "purple") + scale_x_continuous(limits=c(0,70), breaks=seq(0,70,5)) + scale_y_continuous(limits=c(0,60), breaks=seq(0,60,5))+ labs(xlab = "BMI", ylab = "Frequency", title = "BMI FREQUENCY")+ theme_light()





Histogram by Glucose

๐Ÿ‘‰The intensity is higher in the 100-130 range.

๐Ÿ‘‰A right skewed dispersion graph available i.e. the mean age is greater than the median.

๐Ÿ‘‰ Although we cannot say exactly an outlier, We can search for outlier values between 40 and 50.

ggplot(aes(x = Glucose), data=df) +
  geom_histogram(binwidth=2, color='yellow', fill = "#7030A0") +
  scale_x_continuous(limits=c(0,200), breaks=seq(0,200,10)) +
  scale_y_continuous(limits=c(0,30), breaks=seq(0,30,2))+
  labs(xlab = "BMI",
       ylab = "Frequency",
       title = "BMI FREQUENCY")+
  theme_dark()

๐Ÿ‘‰Outcome feature is currently int, so we will convert it to factor type

> df$Outcome = as.factor(df$Outcome) > str(df) 'data.frame': 768 obs. of 9 variables: $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ... $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ... $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ... $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ... $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ... $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ... $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ... $ Age : int 50 31 32 21 33 30 26 29 53 54 ... $ Outcome : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 2 ...

๐Ÿ‘‰You see diabetes increase as Glucose gets bigger
ggplot(data = df) + geom_bar( mapping = aes(x = Glucose, fill = Outcome), position = "dodge", size = 2 )+ labs(x = "Glucose", y = "Frequency", title = "Glucose-Outcome")


๐Ÿ‘‰The Relationship of Glucose, BMI, and Age features with Diabetes
install.packages("cowplot")
library(cowplot)
p1 = ggplot(data = df) +
       geom_bar(
         mapping = aes(x = Glucose, fill = Outcome),
         position = "dodge",
         size = 2
              )+
       labs(x = "Glucose",
            y = "Frequency",
            title = "Glucose-Outcome")+
       theme_dark()
p2 = ggplot(data = df) +
  geom_bar(
    mapping = aes(x = BMI, fill = Outcome),
    position = "nudge",
    size = 2
  )+
  labs(x = "BMI",
       y = "Frequency",
       title = "BMI-Outcome")
p3 = ggplot(data = df) +
  geom_bar(
    mapping = aes(x = Age, fill = Outcome),
    position = "stack",
    size = 2
  )+
  labs(x = "Age",
       y = "Frequency",
       title = "Age-Outcome")+
  theme_cowplot()
  

plot_grid(p1, p2, p3, labels = "AUTO", ncol=3)

๐Ÿ‘‰Let's look through the most correlated Outcome, Glucose and BMI.
ggplot(df, aes(Glucose, 
               BMI,
               color = Outcome,
               shape=Outcome))+
  geom_point()+
  geom_line()

๐Ÿ‘‰Let's check out some outliers
> boxplot(df)
Glucose outlier detection
๐Ÿ‘‰We can see that glucose feature has more outliers in people who do not have diabetes.
ggplot(data = df,
       mapping = aes(x=Glucose)) +
  stat_boxplot(aes(Glucose),
               geom="errorbar", linetype=1, width=0.5) +
  geom_boxplot(outlier.color="red",
               notch=FALSE,
               fill="purple") +
  coord_flip() +
  facet_grid(. ~Outcome)

BMI outlier detection
๐Ÿ‘‰We can see that BMI feature has more outliers in people who have diabetes.
ggplot(data = df,
       mapping = aes(x=BMI)) +
  stat_boxplot(aes(BMI),
               geom="errorbar", linetype=1, width=0.5) +
  geom_boxplot(outlier.color="red",
               notch=FALSE,
               fill="blue") +
  coord_flip() +
  facet_grid(. ~Outcome)

Age outlier detection
๐Ÿ‘‰We can see that age feature has many outliers in people who do not have diabetes. 
ggplot(data = df,
       mapping = aes(x=Age)) +
  stat_boxplot(aes(Age),
               geom="errorbar", linetype=1, width=0.5) +
  geom_boxplot(outlier.color="red",
               notch=FALSE,
               fill="orange") +
  coord_flip() +
  facet_grid(. ~Outcome)


In this blog post, I tried to examine EDA, which is an important part of data science and data analysis, with the diabetes dataset. I just wrote EDA to the GPT 3.5 version released by the OPENAI platform last week and it gave the following information. You can reach OpenAI or from the link below.
https://chat.openai.com/chat

๐Ÿ‘‰Exploratory data analysis (EDA) is a critical step in the data analysis process. 
It involves summarizing and visualizing the data to gain a better understanding of its characteristics and patterns. 
EDA is an iterative process, and it typically involves the following steps:

1. Import the data and clean it: The first step in EDA is to import the data into R and clean it to remove any errors or missing values.


2. Summarize the data: The next step is to summarize the data using statistical measures such as mean, median, mode, and standard deviation.


3. Visualize the data: The next step is to visualize the data using plots such as histograms, scatter plots, and box plots. These plots can help you understand the distribution of the data and identify any potential outliers or trends.


4. Transform the data: Depending on the characteristics of the data, you may need to transform it using techniques such as normalization or standardization to make it more suitable for analysis.


5. Repeat the process: EDA is an iterative process, and you will likely need to repeat the steps above multiple times to gain a thorough understanding of the data.
By following these steps, you can gain a better understanding of your data and identify any potential issues or trends that may impact your analysis.


Comments

Popular posts from this blog