EDA

Description of EDA (Exploratory Data Analysis)

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.

Role / Importance

An important key component to any data science task frequently undervalued is the exploratory data analysis (EDA).
At a high level, EDA is the practice of describing the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis.
This involves looking at your data set from many angles, describing it, and summarizing it without making any assumptions about its contents.
This is a significant step to take before diving into machine learning or statistical modeling, to make sure the data are really what they are claimed to be and that there are no obvious problems.
EDA should be part of the way data science operates in your organization.

PROBLEM

About data set

The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Data Description

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U / ml)
BMI: Body mass index (weight in kg / (height in m) ^ 2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

1 indicates diabetes is present

Perform Exploratory data analysis and give your analytical observations using numerical and graphical summaries for given data set.

Source Code

diabet<-read.csv('C:/Semester 6/Data Science/EDA/diabetes.csv')
head(diabet)  
str(diabet)
summary(diabet)
diabet[1:10,]
diabet[,1:2]
diabet[1:10,1:2]
newdata1<-subset(diabet,diabet$Outcome=="1")
newdata1
newdata2<-subset(diabet,diabet$Pregnancies=="1" &diabet$Outcome=="0")
newdata2
newdata3<-subset(diabet,diabet$Pregnancies=="1" | diabet$Outcome=="0",select=c(1,2))
newdata3
newdata4<-diabet[order(diabet$BMI), ]
newdata4
newdata5<-diabet[order(-diabet$BMI),]
newdata5
newdata6<-aggregate(BMI~Outcome,data=diabet,FUN=mean)
newdata6
names(diabet)
colSums(is.na(diabet))
hist(diabet$BMI,col='RED')
plot(diabet$BMI)
boxplot(diabet$BMI)
mean(diabet$BMI)
median(diabet$BMI)
max(diabet$BMI)
min(diabet$BMI)
boxplot(newdata2$SkinThickness)
data<-read.csv('C:/Semester 6/Data Science/EDA/diabetes.csv')
attach(data)
data
class(BMI)
table(Outcome)
count<-table(Outcome)
barplot(count,col=2)
pie(count)
table(Pregnancies)
count<-table(Pregnancies)
barplot(count)
pie(count)

Output

Data read

When we use head: 1-6 lines of dataset will be produced

str will show the strings

Summary summarizes diabet dataset

Displays the range from dataset from 1-10

Displays the columns 1 and 2 from dataset

Displays 10 data from column 1 and 2

The dataset whose outcome is 1 is displayed

The dataset whose outcome is 0 and value of pregnancies column which is 1 is displayed

The dataset whose outcome is 0 and value of pregnancies column which is 1 and only 1st and 2nd column of data is displayed

The dataset is sorted (i.e all rows are sorted).

Sorts the BMI data

It takes diabet table calculates the mean and aggregates the BMI data

names of the title(rows) of diabet table

ColSums Form row and column sums and means for numeric arrays and set the elements of diabet table to 0

hist computes a histogram of the given data values. If plot = TRUE, the resulting object of class "histogram" is plotted by plot.histogram, before it is returned.