DO YOU WANT TO VISUALIZE YOUR DATA MORE EASILY!! 👉THEN TRY DATAEXPLORER PACKAGE 👇

The most time-consuming and tedious data science endeavor is cleaning and organizing data. DataExplorer is one of the resources available that has the express goal of reducing the 80% and making it pleasurable. As a result, being exceedingly user-friendly is a basic design principle. One function call is typically all you need.

DataExplorer is an R package that provides a set of functions for creating summaries and visualizations of data. It is designed to make it easy for users to quickly get an overview of their data and identify patterns and trends.

Here are some key features of DataExplorer:
  👉Provides a variety of summary statistics and visualizations for different types of data, including numerical, categorical, and text data.
  👉Allows users to easily create plots, tables, and summary statistics for a single variable or for multiple variables.
  👉Offers options for customizing the appearance and formatting of plots and tables.
  👉Can handle large datasets and missing data.

install and activate the package
>install.packages("DataExplorer")
>library(DataExplorer)

You can find detailed information about dataset in my other blog posts.
library(PimaIndiansDiabetes2)
df = PimaIndiansDiabetes2
Structural Features
> str(df)

'data.frame': 768 obs. of  9 variables:
 $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
 $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
 $ pressure: num  72 66 64 66 40 74 50 NA 70 96 ...
 $ triceps : num  35 29 NA 23 35 NA 32 NA 45 NA ...
 $ insulin : num  NA NA NA 94 168 NA 88 NA 543 NA ...
 $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
 $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
 $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
 $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

👉The plot_missing function is a useful tool for identifying variables in a dataset that have a large number of missing values, which can be a problem for some machine learning algorithms. By identifying such variables, you can decide whether to exclude them from your analysis or to impute the missing values in some way.
> plot_missing(df)


👉The plot_bar function in the DataExplorer package is used to create a bar plot of a categorical variable.
>plot_bar(df)


👉The plot_histogram function in the DataExplorer package is used to create a histogram of one or more numerical variables. In this case, the function is being applied to a subset of the variables in a data frame called df.

👉The reason why we only get 3 features here is that it gives an error because other features have null values.

plot_histogram(df[c("age", "pregnant", "pedigree")])



👉The plot_boxplot function in the DataExplorer package is used to create a box plot of a numerical variable. In this case, the function is being applied to a data frame called df, and the by argument is being used to specify a categorical variable or numerical variable.
> plot_boxplot(df, by = "age")



> plot_boxplot(df, by = "diabetes")


👉plot_scatterplot is a function from the DataExplorer package that creates a scatter plot of two variables in a data frame. subset(df) is a function that returns a subset of the data frame df, based on the criteria specified in the subset function. Because no criteria are specified, so the function will return the entire data frame. by = "diabetes" is an argument to the plot_scatterplot function that specifies the  variable by which the data should be grouped. In this case, the data will be grouped by the "diabetes" variable.

> plot_scatterplot(
    subset(df), 
    by = "diabetes")


👉plot_correlation is a function from the DataExplorer package that creates a matrix of scatter plots showing the relationships between pairs of variables in a data frame.
> plot_correlation(df)



You can follow me on Linkedin and github.

👉LINKEDIN👈

👉GITHUB👈




Comments

Popular posts from this blog