EDA (Exploratory Data Analysis) IN R - Medical Data

                         

 EDA



Description of EDA (Exploratory Data Analysis)

  • Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

  • Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

  • statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. 

  • Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. 

Role / Importance

  • An important key component to any data science task frequently undervalued is the exploratory data analysis (EDA).

  • At a high level, EDA is the practice of describing the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis. 

  • This involves looking at your data set from many angles, describing it, and summarizing it without making any assumptions about its contents. 

  • This is a significant step to take before diving into machine learning or statistical modeling, to make sure the data are really what they are claimed to be and that there are no obvious problems. 

  • EDA should be part of the way data science operates in your organization.



PROBLEM

About data set 

The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Data Description

  • Pregnancies: Number of times pregnant

  • Glucose: Plasma glucose concentration 2 hours in an oral glucose tolerance test

  • BloodPressure: Diastolic blood pressure (mm Hg)

  • SkinThickness: Triceps skin fold thickness (mm)

  • Insulin: 2-Hour serum insulin (mu U / ml)

  • BMI: Body mass index (weight in kg / (height in m) ^ 2)

  • DiabetesPedigreeFunction: Diabetes pedigree function

  • Age: Age (years)

  • Outcome: Class variable (0 or 1) 

1 indicates diabetes is present 

Perform Exploratory data analysis and give your analytical observations using numerical and graphical summaries for given data set.


Source Code


diabet<-read.csv('C:/Semester 6/Data Science/EDA/diabetes.csv')
head(diabet)  
str(diabet)
summary(diabet)
diabet[1:10,]
diabet[,1:2]
diabet[1:10,1:2]
newdata1<-subset(diabet,diabet$Outcome=="1")
newdata1
newdata2<-subset(diabet,diabet$Pregnancies=="1" &diabet$Outcome=="0")
newdata2
newdata3<-subset(diabet,diabet$Pregnancies=="1" | diabet$Outcome=="0",select=c(1,2))
newdata3
newdata4<-diabet[order(diabet$BMI), ]
newdata4
newdata5<-diabet[order(-diabet$BMI),]
newdata5
newdata6<-aggregate(BMI~Outcome,data=diabet,FUN=mean)
newdata6
names(diabet)
colSums(is.na(diabet))
hist(diabet$BMI,col='RED')
plot(diabet$BMI)
boxplot(diabet$BMI)
mean(diabet$BMI)
median(diabet$BMI)
max(diabet$BMI)
min(diabet$BMI)
boxplot(newdata2$SkinThickness)
data<-read.csv('C:/Semester 6/Data Science/EDA/diabetes.csv')
attach(data)
data
class(BMI)
table(Outcome)
count<-table(Outcome)
barplot(count,col=2)
pie(count)
table(Pregnancies)
count<-table(Pregnancies)
barplot(count)
pie(count)

Output


Data read 

When we use head: 1-6 lines of dataset will be produced

str will show the strings



Summary summarizes diabet dataset


Displays the range from dataset from 1-10


Displays the columns 1 and 2 from dataset 



Displays 10 data from column 1 and 2


The dataset whose outcome is 1 is displayed





The dataset whose outcome is 0 and value of pregnancies column which is 1 is displayed 













The dataset whose outcome is 0 and value of pregnancies column which is 1 and only 1st and 2nd column of data is displayed 


The dataset is sorted (i.e all rows are sorted).


Sorts the BMI data


It takes diabet table calculates the mean and aggregates the BMI data


names of the title(rows) of diabet table


ColSums Form row and column sums and means for numeric arrays and set the elements of diabet table to 0

 hist computes a histogram of the given data values. If plot = TRUE, the resulting object of class "histogram" is plotted by plot.histogram, before it is returned.



Shows the histogram for BMI from diabet dataset

30-40 range of BMI has highest Frequency.



plots the diabet BMI column



No correlation between the diabetes and BMI


Boxplot for BMI

Outliers are there above 50 and below 10.

The data above 50 and below 10 can be removed and remaining data can be taken for further analysis.

The symmetricity of boxplot above median and below median is observed.



Calculates the mean, median, max and min value for BMI



It's a symmetric data and normal distribution can be fitted. 

Hence, BMI is symmetric

It retrieves data from newdata2 about sick thickness and plots the box plot for the following

No outliers for sick thickness






Bar Graph

Most of the people are Diabetic




Pie Chart

Plots the count






Considers table pregnancies



Plots Bar graph for pregnancy table








Plots pie chart for pregnancy 



0, 1 and 2 as maximum pregnancies for diabetic patients. 



Comments