PREDICTION
Description of Logistic Regression
Logistic regression is a statistical analysis method used to predict a data value based on prior observations of a data set. Logistic regression has become an important tool in the discipline of machine learning. The approach allows an algorithm being used in a machine learning application to classify incoming data based on historical data. As more relevant data comes in, the algorithm should get better at predicting classifications within data sets. Logistic regression can also play a role in data preparation activities by allowing data sets to be put into specifically predefined buckets during the extract, transform, load (ETL) process in order to stage the information for analysis.
Role / Importance
Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.
PROBLEM : BIRD DATA SET
Source Code
bird <- read.csv('C:/sk/bird.csv')
#notice blanks were not changed to NA while reading.
#so change the above code using c(""," ")
bird<-read.csv(file.choose(),header = T,na.strings=c(""," ","NA"))
class(bird$type)
bird$type=as.factor(bird$type)
boxplot(bird)
#Splitting
set.seed(1234)
pd<-sample(2,nrow(bird),replace = TRUE, prob=c(0.8,0.2))#two samples with distribution 0.8 and
0.2
trainingset<-bird[pd==1,]#first partition
validationset<-bird[pd==2,]#second partition
#Model fitting
attach(trainingset)
model1 <- glm(type~.,family=binomial(link='logit'),data=trainingset)
summary(model1)
pred <- predict(model1,newdata=validationset,type='response')
pred_status <- ifelse(pred >=0.75,1,0)
pred_status
#confusionmatrix
cf1<-table(pred_status,validationset$type)
cf1
Output
notice blanks were not changed to NA while reading.
so change the above code using c(""," ")
The na.strings argument is for substitution within the body of the file, that is, matching strings that should be replaced with NA.
convert categorical variable to factor
As far as categorical variables are concerned, using the read.table() or read.csv() by default will
encode the categorical variables as factors. A factor is how R deals categorical variables.
Total 400 data variables are present in the data.
All the data in each column is lying between 0 to 100. There are outliers present in huml, humw, ulnal, ulnaw, feml, femw, tibl, tibw, tarl, tarw.
We split the data into two chunks: training and validation set. The training set will be used to fit our model which we will be testing over the validation set.
Comments
Post a Comment