EDA
Description of EDA (Exploratory Data Analysis)
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.
Role / Importance
An important key component to any data science task frequently undervalued is the exploratory data analysis (EDA).
At a high level, EDA is the practice of describing the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis.
This involves looking at your data set from many angles, describing it, and summarizing it without making any assumptions about its contents.
This is a significant step to take before diving into machine learning or statistical modeling, to make sure the data are really what they are claimed to be and that there are no obvious problems.
EDA should be part of the way data science operates in your organization.
PROBLEM
About dataset
The data set contains birds bone measurements, ecological groups of birds and their living habit.
There are 420 birds contained in this dataset. Each bird is represented by 10 measurements (features):
Length and Diameter of Humerus
Length and Diameter of Ulna
Length and Diameter of Femur
Length and Diameter of Tibiotarsus
Length and Diameter of Tarsometatarsus
All measurements are continuous float numbers (mm). The skeletons of this dataset are collections of Natural History Museum of Los Angeles County. They belong to 21 orders, 153 genera, 245 species.
According to their living environments and living habits, birds are classified into different ecological groups. There are 8 ecological groups of birds:
Swimming Birds
Wading Birds
Terrestrial Birds
Raptors
Scansorial Birds
Singing Birds
Each bird has a number for its ecological group:
1: Swimming Birds
2: Wading Birds
3: Terrestrial Birds
4: Raptors
5: Scansorial Birds
6: Singing Birds
SOURCE CODE
bird <- read.csv('C:/SK/bird.csv')
head(bird)
str(bird)
summary(bird)
bird[1:10,]
bird[,1:2]
bird[1:10,1:2]
Swimming_birds<-subset(bird,bird$type=="1")
Swimming_birds
Wading_birds<-subset(bird,bird$type=="2")
Wading_birds
Terrestrial_birds<-subset(bird,bird$type=="3")
Terrestrial_birds
newdata1<-subset(bird,bird$ulnaw>="5" & bird$type=="4")
newdata1
newdata2<-subset(bird,bird$tibl>="50" & bird$type=="6")
newdata2
newdata3<-bird[order(bird$huml), ]
newdata3
plot(bird$huml,bird$humw)
plot(bird$ulnal,bird$ulnaw)
plot(bird$feml,bird$femw)
plot(bird$tibl,bird$tibw)
plot(bird$tarl,bird$tarw)
par(mfrow=c(2,5))
boxplot(bird$huml,xlab="Length of Humerus (mm)",ylab="Frequency")
boxplot(bird$humw,xlab="Diameter of Humerus (mm)",ylab="Frequency")
boxplot(bird$ulnal,xlab="Length of Ulna (mm)",ylab="Frequency")
boxplot(bird$ulnaw,xlab="Diameter of Ulna (mm)",ylab="Frequency")
boxplot(bird$feml,xlab="Length of Femur (mm)",ylab="Frequency")
boxplot(bird$femw,xlab="Diameter of Femur (mm)",ylab="Frequency")
boxplot(bird$tibl,xlab="Length of Tibiotarsus (mm)",ylab="Frequency")
boxplot(bird$tibw,xlab="Diameter of Tibiotarsus (mm)",ylab="Frequency")
boxplot(bird$tarl,xlab="Length of Tarsometatarsus (mm)",ylab="Frequency")
boxplot(bird$tarw,xlab="Diameter of Tarsometatarsus (mm)",ylab="Frequency")
mean(bird$huml)
median(bird$huml)
max(bird$huml)
min(bird$huml)
mean(bird$ulnal)
median(bird$ulnal)
max(bird$ulnal)
min(bird$ulnal)
mean(bird$feml)
median(bird$feml)
max(bird$feml)
min(bird$feml)
mean(bird$tibl)
median(bird$tibl)
max(bird$tibl)
min(bird$tibl)
mean(bird$tarl)
median(bird$tarl)
max(bird$tarl)
min(bird$tarl)
#********************************************************
mean(bird$humw)
median(bird$humw)
max(bird$humw)
min(bird$humw)
mean(bird$ulnaw)
median(bird$ulnaw)
max(bird$ulnaw)
min(bird$ulnaw)
mean(bird$femw)
median(bird$femw)
max(bird$femw)
min(bird$femw)
mean(bird$tibw)
median(bird$tibw)
max(bird$tibw)
min(bird$tibw)
mean(bird$tarw)
median(bird$tarw)
max(bird$tarw)
min(bird$tarw)
bird<-read.csv('C:/SK/bird.csv')
attach(bird)
table(type)
count<-table(type)
barplot(count,col=2)
pie(count)
OUTPUT
Displays 10 data from column 1 and 2
The dataset whose type is 1 is displayed
The dataset whose type is 2 is displayed
The dataset whose type is 3 is displayed
The dataset in which ulna length is >=5 and value of type column as 4 is displayed
The dataset in which Tibiotarsus length is >=5 and value of type column as 6 is displayed
The dataset is sorted (i.e all rows are sorted).
Histogram
Histogram of length of Humerus and Diameter of Humerus
The Length of humerus between range 0-50 mm has the highest frequency and the diameter ranging from 2-4 mm has the highest frequency.
Histogram of length of ulna and Diameter of ulna
The Length of ulna between range 0-50 mm has the highest frequency and the diameter ranging from 0-2 mm has the highest frequency.
Histogram of length of Femur and Diameter of Femur
The Length of Femur between range 20-30 mm has the highest frequency and the diameter ranging from 1-2 mm has the highest frequency.
Histogram of length of Tibiotarsus and Diameter of Tibiotarsus
The Length of Tibiotarsus between range 20-40 mm has the highest frequency and the diameter ranging from 1-2 mm has the highest frequency.
Histogram of length of Tarsometatarsus and Diameter of Tarsometatarsus
The Length of Tarsometatarsus between range 20-40 mm has the highest frequency and the diameter ranging from 1-2 mm has the highest frequency.
Scatter Plot
Moderately correlated.
Moderately correlated.
Highly correlated
Moderately correlated
No correlation
Box plot
Boxplot of length of Humerus and Diameter of Humerus
Outliers are there above 190mm in length of humerus and above 30mm in diameter
The data above 190mm in length of humerus and above 30mm in diameter can be removed and remaining data can be taken for further analysis.
The symmetricity of boxplot above median and below median is observed.
Boxplot of length of ulna and Diameter of ulna
Outliers are there above 200mm in length of ulna and above 9mm in diameter
The data above 200mm in length of ulna and above 30mm in diameter can be removed and remaining data can be taken for further analysis.
The symmetricity of boxplot above median and below median is observed.
Boxplot of length of Femur and Diameter of Femur
Outliers are there above 85mm in length of Femur and above 8mm in diameter
The data above 85mm in length of Femur and above 8mm in diameter can be removed and remaining data can be taken for further analysis.
The symmetricity of boxplot above median and below median is observed.
Boxplot of length of Tibiotarsus and Diameter of Tibiotarsus
Outliers are there above 150mm in length of Tibiotarsus and above 8.2mm in diameter
The data above 150mm in length of Tibiotarsus and above 8.2mm in diameter can be removed and remaining data can be taken for further analysis.
The symmetricity of boxplot above median and below median is observed.
Boxplot of length of Tarsometatarsus and Diameter of Tarsometatarsus
Outliers are there above 95mm in length of Tarsometatarsus and above 7mm in diameter
The data above 95mm in length of Tarsometatarsus and above 7mm in diameter can be removed and remaining data can be taken for further analysis.
The symmetricity of boxplot above median and below median is observed.
Finding mean, median, max, min of bone length
Finding mean, median, max, min of bone diameter
Hence, it is observed that mean is typically greater than median. Therefore, the right skewed(distribution) distribution is fitted.
Most of the birds are singing and swimming birds
More than half of the birds are classified as singing or swimming birds. Terrestrial birds are in the smallest group.
Comments
Post a Comment