PREDICTION LOGISTIC REGRESSION IN R - BIRD DATA SET

   

 PREDICTION 

Description of Logistic Regression

Logistic regression is a statistical analysis method used to predict a data value based on prior observations of a data set. Logistic regression has become an important tool in the discipline of machine learning. The approach allows an algorithm being used in a machine learning application to classify incoming data based on historical data. As more relevant data comes in, the algorithm should get better at predicting classifications within data sets. Logistic regression can also play a role in data preparation activities by allowing data sets to be put into specifically predefined buckets during the extract, transform, load (ETL) process in order to stage the information for analysis.

Role / Importance

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.


PROBLEM : BIRD DATA SET

Source Code

bird <- read.csv('C:/sk/bird.csv')

#notice blanks were not changed to NA while reading.

#so change the above code using c(""," ")

bird<-read.csv(file.choose(),header = T,na.strings=c(""," ","NA"))

class(bird$type)

bird$type=as.factor(bird$type)

boxplot(bird)

#Splitting

set.seed(1234)

pd<-sample(2,nrow(bird),replace = TRUE, prob=c(0.8,0.2))#two samples with distribution 0.8 and

0.2

trainingset<-bird[pd==1,]#first partition

validationset<-bird[pd==2,]#second partition

#Model fitting

attach(trainingset)

model1 <- glm(type~.,family=binomial(link='logit'),data=trainingset)

summary(model1)

pred <- predict(model1,newdata=validationset,type='response')

pred_status <- ifelse(pred >=0.75,1,0)

pred_status

#confusionmatrix

cf1<-table(pred_status,validationset$type)

cf1

Output

notice blanks were not changed to NA while reading. 

so change the above code using c(""," ")

The na.strings argument is for substitution within the body of the file, that is, matching strings that should be replaced with NA.




convert categorical variable to factor

As far as categorical variables are concerned, using the read.table() or read.csv() by default will

encode the categorical variables as factors. A factor is how R deals categorical variables.


Total 400 data variables are present in the data.

All the data in each column is lying between 0 to 100. There are outliers present in huml, humw, ulnal, ulnaw, feml, femw, tibl, tibw, tarl, tarw.



We split the data into two chunks: training and validation set. The training set will be used to fit our model which we will be testing over the validation set.



Comments