Description of EDA (Exploratory Data Analysis)
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.
Role / Importance
An important key component to any data science task frequently undervalued is the exploratory data analysis (EDA).
At a high level, EDA is the practice of describing the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis.
This involves looking at your data set from many angles, describing it, and summarizing it without making any assumptions about its contents.
This is a significant step to take before diving into machine learning or statistical modeling, to make sure the data are really what they are claimed to be and that there are no obvious problems.
EDA should be part of the way data science operates in your organization.
PROBLEM
About data set
The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Data Description
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U / ml)
BMI: Body mass index (weight in kg / (height in m) ^ 2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
1 indicates diabetes is present
Perform Exploratory data analysis and give your analytical observations using numerical and graphical summaries for given data set.
Source Code
Output
Data read
When we use head: 1-6 lines of dataset will be produced
str will show the strings
Summary summarizes diabet dataset
Displays the range from dataset from 1-10
Displays the columns 1 and 2 from dataset
Displays 10 data from column 1 and 2
The dataset whose outcome is 1 is displayed
The dataset whose outcome is 0 and value of pregnancies column which is 1 is displayed
The dataset whose outcome is 0 and value of pregnancies column which is 1 and only 1st and 2nd column of data is displayed
The dataset is sorted (i.e all rows are sorted).
Sorts the BMI data
It takes diabet table calculates the mean and aggregates the BMI data
names of the title(rows) of diabet table
ColSums Form row and column sums and means for numeric arrays and set the elements of diabet table to 0
hist computes a histogram of the given data values. If plot = TRUE, the resulting object of class "histogram" is plotted by plot.histogram, before it is returned.
Shows the histogram for BMI from diabet dataset
30-40 range of BMI has highest Frequency.
plots the diabet BMI column
No correlation between the diabetes and BMI
Boxplot for BMI
Outliers are there above 50 and below 10.
The data above 50 and below 10 can be removed and remaining data can be taken for further analysis.
The symmetricity of boxplot above median and below median is observed.
Calculates the mean, median, max and min value for BMI
It's a symmetric data and normal distribution can be fitted.
Hence, BMI is symmetric
It retrieves data from newdata2 about sick thickness and plots the box plot for the following
No outliers for sick thickness
Bar Graph
Most of the people are Diabetic
Pie Chart
Plots the count
Considers table pregnancies
Plots Bar graph for pregnancy table
Plots pie chart for pregnancy
0, 1 and 2 as maximum pregnancies for diabetic patients.
Comments
Post a Comment