EDA (Exploratory Data Analysis) IN R - Birds Data

                         

 EDA



Description of EDA (Exploratory Data Analysis)

  • Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

  • Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

  • statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. 

  • Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. 

Role / Importance

  • An important key component to any data science task frequently undervalued is the exploratory data analysis (EDA).

  • At a high level, EDA is the practice of describing the data by means of statistical and visualization techniques to bring important aspects of that data into focus for further analysis. 

  • This involves looking at your data set from many angles, describing it, and summarizing it without making any assumptions about its contents. 

  • This is a significant step to take before diving into machine learning or statistical modeling, to make sure the data are really what they are claimed to be and that there are no obvious problems. 

  • EDA should be part of the way data science operates in your organization.



PROBLEM

About dataset

The data set contains birds bone measurements, ecological groups of birds and their living habit.

There are 420 birds contained in this dataset. Each bird is represented by 10 measurements (features):

  • Length and Diameter of Humerus

  • Length and Diameter of Ulna

  • Length and Diameter of Femur

  • Length and Diameter of Tibiotarsus

  • Length and Diameter of Tarsometatarsus

All measurements are continuous float numbers (mm). The skeletons of this dataset are collections of Natural History Museum of Los Angeles County. They belong to 21 orders, 153 genera, 245 species.

According to their living environments and living habits, birds are classified into different ecological groups. There are 8 ecological groups of birds:

  • Swimming Birds

  • Wading Birds

  • Terrestrial Birds

  • Raptors

  • Scansorial Birds

  • Singing Birds

Each bird has a number for its ecological group:

  • 1: Swimming Birds

  • 2: Wading Birds

  • 3: Terrestrial Birds

  • 4: Raptors

  • 5: Scansorial Birds

  • 6: Singing Birds




SOURCE CODE

bird <- read.csv('C:/SK/bird.csv')

head(bird)  

str(bird)

summary(bird)

bird[1:10,]

bird[,1:2]

bird[1:10,1:2]

Swimming_birds<-subset(bird,bird$type=="1")

Swimming_birds

Wading_birds<-subset(bird,bird$type=="2")

Wading_birds

Terrestrial_birds<-subset(bird,bird$type=="3")

Terrestrial_birds

newdata1<-subset(bird,bird$ulnaw>="5" & bird$type=="4")

newdata1

newdata2<-subset(bird,bird$tibl>="50" & bird$type=="6")

newdata2

newdata3<-bird[order(bird$huml), ]

newdata3



plot(bird$huml,bird$humw)

plot(bird$ulnal,bird$ulnaw)

plot(bird$feml,bird$femw)

plot(bird$tibl,bird$tibw)

plot(bird$tarl,bird$tarw)


par(mfrow=c(2,5))

boxplot(bird$huml,xlab="Length of Humerus (mm)",ylab="Frequency")

boxplot(bird$humw,xlab="Diameter of Humerus (mm)",ylab="Frequency")

boxplot(bird$ulnal,xlab="Length of Ulna (mm)",ylab="Frequency")

boxplot(bird$ulnaw,xlab="Diameter of Ulna (mm)",ylab="Frequency")

boxplot(bird$feml,xlab="Length of Femur (mm)",ylab="Frequency")

boxplot(bird$femw,xlab="Diameter of Femur (mm)",ylab="Frequency")

boxplot(bird$tibl,xlab="Length of Tibiotarsus (mm)",ylab="Frequency")

boxplot(bird$tibw,xlab="Diameter of Tibiotarsus (mm)",ylab="Frequency")

boxplot(bird$tarl,xlab="Length of Tarsometatarsus (mm)",ylab="Frequency")

boxplot(bird$tarw,xlab="Diameter of Tarsometatarsus (mm)",ylab="Frequency")


mean(bird$huml)

median(bird$huml)

max(bird$huml)

min(bird$huml)


mean(bird$ulnal)

median(bird$ulnal)

max(bird$ulnal)

min(bird$ulnal)


mean(bird$feml)

median(bird$feml)

max(bird$feml)

min(bird$feml)


mean(bird$tibl)

median(bird$tibl)

max(bird$tibl)

min(bird$tibl)



mean(bird$tarl)

median(bird$tarl)

max(bird$tarl)

min(bird$tarl)


#********************************************************

    

mean(bird$humw)

median(bird$humw)

max(bird$humw)

min(bird$humw)


mean(bird$ulnaw)

median(bird$ulnaw)

max(bird$ulnaw)

min(bird$ulnaw)


mean(bird$femw)

median(bird$femw)

max(bird$femw)

min(bird$femw)


mean(bird$tibw)

median(bird$tibw)

max(bird$tibw)

min(bird$tibw)



mean(bird$tarw)

median(bird$tarw)

max(bird$tarw)

min(bird$tarw)



bird<-read.csv('C:/SK/bird.csv')

attach(bird)

table(type)

count<-table(type)

barplot(count,col=2)

pie(count)


OUTPUT






Displays 10 data from column 1 and 2

The dataset whose type is 1 is displayed


The dataset whose type is 2 is displayed


The dataset whose type is 3 is displayed


The dataset in which ulna length is >=5 and value of type column as 4 is displayed 

The dataset in which Tibiotarsus length is >=5 and value of type column as 6 is displayed 


The dataset is sorted (i.e all rows are sorted).



Histogram


Histogram of length of Humerus and Diameter of Humerus

The Length of humerus between range 0-50 mm has the highest frequency and the diameter ranging from 2-4 mm has the highest frequency. 


Histogram of length of ulna and Diameter of ulna

The Length of ulna between range 0-50 mm has the highest frequency and the diameter ranging from 0-2 mm has the highest frequency. 


Histogram of length of Femur and Diameter of Femur

The Length of Femur between range 20-30 mm has the highest frequency and the diameter ranging from 1-2 mm has the highest frequency. 


Histogram of length of Tibiotarsus and Diameter of Tibiotarsus

The Length of Tibiotarsus between range 20-40 mm has the highest frequency and the diameter ranging from 1-2 mm has the highest frequency. 


Histogram of length of Tarsometatarsus and Diameter of Tarsometatarsus

The Length of Tarsometatarsus between range 20-40 mm has the highest frequency and the diameter ranging from 1-2 mm has the highest frequency. 


Scatter Plot

Moderately correlated.

Moderately correlated.


Highly correlated


Moderately correlated



No correlation


Box plot



Boxplot of length of Humerus and Diameter of Humerus

Outliers are there above 190mm in length of humerus and above 30mm in diameter

The data above 190mm in length of humerus and above 30mm in diameter can be removed and remaining data can be taken for further analysis.

The symmetricity of boxplot above median and below median is observed.


Boxplot of length of ulna and Diameter of ulna

Outliers are there above 200mm in length of ulna and above 9mm in diameter

The data above 200mm in length of ulna and above 30mm in diameter can be removed and remaining data can be taken for further analysis.

The symmetricity of boxplot above median and below median is observed.


Boxplot of length of Femur and Diameter of Femur

Outliers are there above 85mm in length of Femur and above 8mm in diameter

The data above 85mm in length of Femur and above 8mm in diameter can be removed and remaining data can be taken for further analysis.

The symmetricity of boxplot above median and below median is observed.


Boxplot of length of Tibiotarsus and Diameter of Tibiotarsus

Outliers are there above 150mm in length of Tibiotarsus and above 8.2mm in diameter

The data above 150mm in length of Tibiotarsus and above 8.2mm in diameter can be removed and remaining data can be taken for further analysis.

The symmetricity of boxplot above median and below median is observed.


Boxplot of length of Tarsometatarsus and Diameter of Tarsometatarsus

Outliers are there above 95mm in length of Tarsometatarsus and above 7mm in diameter

The data above 95mm in length of Tarsometatarsus and above 7mm in diameter can be removed and remaining data can be taken for further analysis.

The symmetricity of boxplot above median and below median is observed.


Finding mean, median, max, min of bone length






Finding mean, median, max, min of bone diameter




Hence, it is observed that mean is typically greater than median. Therefore, the right skewed(distribution) distribution is fitted.











Most of the birds are singing and swimming birds 





More than half of the birds are classified as singing or swimming birds. Terrestrial birds are in the smallest group.










Comments