Artificial Intelligence-Data Science-Python-R-Deep Learning

I'll explain the exploratory data analysis(EDA) process in this blog post, which is a crucial step in data science using the diabetes dataset.

EDA (Exploratory Data Analysis) is a method for examining datasets to highlight their significant characteristics, frequently using visual techniques.

EDA can be divided into three categories:

1. Data Comprehension

2. Eliminating Extraneous Data

3. Analyzing Data for Relationships

Information About The Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
kinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

Let’s start…

library(ggplot2) # Data visualization

library(readr) # reading csv file

Load dataset

df = read.csv("C:/Users/Asus/Desktop/VERİ BİLİMİ YÜKSEK LİSANS/R -Data Visualisation/blog/blog-2/diabetes.csv")

Structural features of data

> str(df)

'data.frame': 768 obs. of 9 variables:

$ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...

$ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...

$ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...

$ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...

$ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...

$ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...

$ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...

$ Age : int 50 31 32 21 33 30 26 29 53 54 ...

$ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...

Display first 6 rows of data

> head(df)

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

1 6 148 72 35 0 33.6 0.627 50 1

2 1 85 66 29 0 26.6 0.351 31 0

3 8 183 64 0 0 23.3 0.672 32 1

4 1 89 66 23 94 28.1 0.167 21 0

5 0 137 40 35 168 43.1 2.288 33 1

6 5 116 74 0 0 25.6 0.201 30 0

Display last 6 rows of data

> tail(df)

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome

763 9 89 62 0 0 22.5 0.142 33 0

764 10 101 76 48 180 32.9 0.171 63 0

765 2 122 70 27 0 36.8 0.340 27 0

766 5 121 72 23 112 26.2 0.245 30 0

767 1 126 60 0 0 30.1 0.349 47 1

768 1 93 70 31 0 30.4 0.315 23 0

> summary(df)

Pregnancies Glucose BloodPressure SkinThickness

Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00

1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00

Median : 3.000 Median :117.0 Median : 72.00 Median :23.00

Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54

3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00

Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00

Insulin           BMI        DiabetesPedigreeFunction      Age

 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00

 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00

 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00

 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24

 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00

 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00

  Outcome

 Min.   :0.000

 1st Qu.:0.000

 Median :0.000

 Mean   :0.349

 3rd Qu.:1.000

 Max.   :1.000

👉You can examine the descriptive statistic values of the dataset.

👉When we examine it carefully, we see features that normally cannot be 0 but have a min value of 0. These values need to be

corrected, but we will ignore it because it is not our topic right now.

> describe(df)

                         vars   n   mean     sd median trimmed   mad   min    max  range  skew kurtosis   se
Pregnancies                 1 768   3.85   3.37   3.00    3.46  2.97  0.00  17.00  17.00  0.90     0.14 0.12
Glucose                     2 768 120.89  31.97 117.00  119.38 29.65  0.00 199.00 199.00  0.17     0.62 1.15
BloodPressure               3 768  69.11  19.36  72.00   71.36 11.86  0.00 122.00 122.00 -1.84     5.12 0.70
SkinThickness               4 768  20.54  15.95  23.00   19.94 17.79  0.00  99.00  99.00  0.11    -0.53 0.58
Insulin                     5 768  79.80 115.24  30.50   56.75 45.22  0.00 846.00 846.00  2.26     7.13 4.16
BMI                         6 768  31.99   7.88  32.00   31.96  6.82  0.00  67.10  67.10 -0.43     3.24 0.28
DiabetesPedigreeFunction    7 768   0.47   0.33   0.37    0.42  0.25  0.08   2.42   2.34  1.91     5.53 0.01
Age                         8 768  33.24  11.76  29.00   31.54 10.38 21.00  81.00  60.00  1.13     0.62 0.42
Outcome                     9 768   0.35   0.48   0.00    0.31  0.00  0.00   1.00   1.00  0.63    -1.60 0.02

Data Visualization

Compute correlation matrix

install.packages("ggcorrplot") library(ggcorrplot)

👉First, let's look at the correlation between features.

👉The first three features associated with a patient's diabetes are Glucose, BMI, and Age.

👉For the visualization in the EDA section, I will specifically focus on these 3 features.

> df_cor = round(cor(df),2) > df_cor Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome Pregnancies 1.00 0.13 0.14 -0.08 -0.07 0.02 -0.03 0.54 0.22 Glucose 0.13 1.00 0.15 0.06 0.33 0.22 0.14 0.26 0.47 BloodPressure 0.14 0.15 1.00 0.21 0.09 0.28 0.04 0.24 0.07 SkinThickness -0.08 0.06 0.21 1.00 0.44 0.39 0.18 -0.11 0.07 Insulin -0.07 0.33 0.09 0.44 1.00 0.20 0.19 -0.04 0.13 BMI 0.02 0.22 0.28 0.39 0.20 1.00 0.14 0.04 0.29 DiabetesPedigreeFunction -0.03 0.14 0.04 0.18 0.19 0.14 1.00 0.03 0.17 Age 0.54 0.26 0.24 -0.11 -0.04 0.04 0.03 1.00 0.24 Outcome 0.22 0.47 0.07 0.07 0.13 0.29 0.17 0.24 1.00

ggcorrplot(df_cor)

Barplot by Outcome
👉You can see how many people have diabetes in the dataset
ggplot(aes(x = Outcome), data = df) +
  geom_bar(fill='steelblue')



Histogram by Age
👉The intensity is higher in the 20-40 age range.
👉A right skewed dispersion graph available i.e. the mean age is greater than the median.
   👉 We can also detect outlier from the histogram chart. For example, Although we cannot say exactly an outlier, 
       there is someone who is 81 years old.At first glance we can look for the outlier here
   ggplot(aes(x = Age), data=df) +
  geom_histogram(binwidth=1, color='black', fill = "#F79420") +
  scale_x_continuous(limits=c(20,90), breaks=seq(20,90,5)) +
  scale_y_continuous(limits=c(0,80), breaks=seq(0,80,10))+
  labs(xlab = "Age",
       ylab = "Frequency",
         title = "AGE FREQUENCY")


Histogram by BMI

👉The intensity is higher in the 25-39 BMI range.
👉A left skewed dispersion graph available i.e. the median age is greater than the mean.
   👉 Although we cannot say exactly an outlier, those with a BMI over 55 stand out as outliers.

ggplot(aes(x = BMI), data=df) +
  geom_histogram(binwidth=1, color='red', fill = "purple") +
  scale_x_continuous(limits=c(0,70), breaks=seq(0,70,5)) +
  scale_y_continuous(limits=c(0,60), breaks=seq(0,60,5))+
  labs(xlab = "BMI",
       ylab = "Frequency",
       title = "BMI FREQUENCY")+
  theme_light()




Histogram by Glucose
👉The intensity is higher in the 100-130 range.
👉A right skewed dispersion graph available i.e. the mean age is greater than the median.
   👉 Although we cannot say exactly an outlier, We can search for outlier values between 40 and 50.
ggplot(aes(x = Glucose), data=df) +
  geom_histogram(binwidth=2, color='yellow', fill = "#7030A0") +
  scale_x_continuous(limits=c(0,200), breaks=seq(0,200,10)) +
  scale_y_continuous(limits=c(0,30), breaks=seq(0,30,2))+
  labs(xlab = "BMI",
       ylab = "Frequency",
       title = "BMI FREQUENCY")+
  theme_dark()

👉Outcome feature is currently int, so we will convert it to factor type
> df$Outcome = as.factor(df$Outcome)
> str(df)
'data.frame':	768 obs. of  9 variables:
 $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
 $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
 $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
 $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
 $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
 $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
 $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
 $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
 $ Outcome                 : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 2 ...

👉You see diabetes increase as Glucose gets bigger
ggplot(data = df) +
  geom_bar(
    mapping = aes(x = Glucose, fill = Outcome),
    position = "dodge",
    size = 2
  )+
  labs(x = "Glucose",
       y = "Frequency",
       title = "Glucose-Outcome")



👉The Relationship of Glucose, BMI, and Age features with Diabetes
install.packages("cowplot")
library(cowplot)
p1 = ggplot(data = df) +
       geom_bar(
         mapping = aes(x = Glucose, fill = Outcome),
         position = "dodge",
         size = 2
              )+
       labs(x = "Glucose",
            y = "Frequency",
            title = "Glucose-Outcome")+
       theme_dark()
p2 = ggplot(data = df) +
  geom_bar(
    mapping = aes(x = BMI, fill = Outcome),
    position = "nudge",
    size = 2
  )+
  labs(x = "BMI",
       y = "Frequency",
       title = "BMI-Outcome")
p3 = ggplot(data = df) +
  geom_bar(
    mapping = aes(x = Age, fill = Outcome),
    position = "stack",
    size = 2
  )+
  labs(x = "Age",
       y = "Frequency",
       title = "Age-Outcome")+
  theme_cowplot()
  

plot_grid(p1, p2, p3, labels = "AUTO", ncol=3)

👉Let's look through the most correlated Outcome, Glucose and BMI.
ggplot(df, aes(Glucose, 
               BMI,
               color = Outcome,
               shape=Outcome))+
  geom_point()+
  geom_line()

👉Let's check out some outliers
> boxplot(df)
Glucose outlier detection
👉We can see that glucose feature has more outliers in people who do not have diabetes.
ggplot(data = df,
       mapping = aes(x=Glucose)) +
  stat_boxplot(aes(Glucose),
               geom="errorbar", linetype=1, width=0.5) +
  geom_boxplot(outlier.color="red",
               notch=FALSE,
               fill="purple") +
  coord_flip() +
  facet_grid(. ~Outcome)

BMI outlier detection
👉We can see that BMI feature has more outliers in people who have diabetes.
ggplot(data = df,
       mapping = aes(x=BMI)) +
  stat_boxplot(aes(BMI),
               geom="errorbar", linetype=1, width=0.5) +
  geom_boxplot(outlier.color="red",
               notch=FALSE,
               fill="blue") +
  coord_flip() +
  facet_grid(. ~Outcome)

Age outlier detection
👉We can see that age feature has many outliers in people who do not have diabetes. 
ggplot(data = df,
       mapping = aes(x=Age)) +
  stat_boxplot(aes(Age),
               geom="errorbar", linetype=1, width=0.5) +
  geom_boxplot(outlier.color="red",
               notch=FALSE,
               fill="orange") +
  coord_flip() +
  facet_grid(. ~Outcome)


In this blog post, I tried to examine EDA, which is an important part of data science and data analysis, with the diabetes dataset.

I just wrote EDA to the GPT 3.5 version released by the OPENAI platform last week and it gave the following information. You can reach OpenAI or from the link below.
https://chat.openai.com/chat

👉Exploratory data analysis (EDA) is a critical step in the data analysis process. 
It involves summarizing and visualizing the data to gain a better understanding of its characteristics and patterns. 
EDA is an iterative process, and it typically involves the following steps:
1. Import the data and clean it: The first step in EDA is to import the data into R and clean it to remove any errors or missing values.


2. Summarize the data: The next step is to summarize the data using statistical measures such as mean, median, mode, and standard deviation.


3. Visualize the data: The next step is to visualize the data using plots such as histograms, scatter plots, and box plots. These plots can help you understand the distribution of the data and identify any potential outliers or trends.


4. Transform the data: Depending on the characteristics of the data, you may need to transform it using techniques such as normalization or standardization to make it more suitable for analysis.


5. Repeat the process: EDA is an iterative process, and you will likely need to repeat the steps above multiple times to gain a thorough understanding of the data.
By following these steps, you can gain a better understanding of your data and identify any potential issues or trends that may impact your analysis.

Search This Blog

RESHAPE THE FUTURE WITH DATA SCIENCE

R - EDA(EXPLORATORY DATA ANALYSIS)

Comments

Post a Comment

Popular posts from this blog