COMPUTER SCIENCE SPACE

DATA PREPARATION - CHI SQUARE TEST

The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite.

Examples :

1 - Is gender independent of education level ? >

2 - Student choices is dependent on grades ? >

3 - Does gender affect preferred holiday ? >

DATA PREPARATION - CORRELATION COEFFICIENT

A correlation is a relationship between two variables.

Typically, we take x to be the independent variable. We take y to be the dependent variable. Data is represented by a collection of ordered pairs (x, y).

Mathematically, the strength and direction of a linear relationship between two variables is represented by the correlation coefficient. Suppose that there are n ordered pairs (x, y) that make up a sample from a population.

Example :

DATA PREPARATION - CORRELATION COEFFICIENT >

EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.

Examples :

1 - EDA - MEDICAL DATA >

2 - EDA - BIRDS DATA >

DECISION TREE

Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

Examples :

1 - DECISION TREE >

2 - DECISION TREE >

NAIVE BAYES

Naive Bayes is a probabilistic technique for constructing classifiers.

The characteristic assumption of the naive Bayes classifier is to consider that the value of a particular feature is independent of the value of any other feature, given the class variable.

Despite the oversimplified assumptions mentioned previously, naive Bayes classifiers have good results in complex real-world situations.

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification and that the classifier can be trained incrementally.

Examples :

1 - Diabetes Dataset >

2 - Iris Dataset >

PRINCIPLE COMPONENT ANALYSIS

Principal-component analysis proposed by Hotelling (1933) is one of the most familiar methods of multivariate analysis which uses the spectral decomposition of a correlation coefficient or covariance matrix.

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

Examples :

1 - Iris Data Set >

2 - Diabetes Data Set >

MODEL VALIDATION

Model validation is the process of evaluating a trained model on test data set. This provides the generalisation ability of a trained model.

There are many ways to get the training and test data sets for model validation like:

3-way holdout method of getting training, validation and test data sets.
k-fold cross-validation with independent test data set.
Leave-one-out cross-validation with independent test data set.

Examples :

1 - Model Validation >

2 - K-Fold Cross Validation >

3 - Repeated K-Fold Cross Validation >

LOGISTIC REGRESSION

Logistic regression is a statistical analysis method used to predict a data value based on prior observations of a data set. Logistic regression has become an important tool in the discipline of machine learning. The approach allows an algorithm being used in a machine learning application to classify incoming data based on historical data. As more relevant data comes in, the algorithm should get better at predicting classifications within data sets. Logistic regression can also play a role in data preparation activities by allowing data sets to be put into specifically predefined buckets during the extract, transform, load (ETL) process in order to stage the information for analysis.

Examples :

1 - Bank Loan >

2 - Bird Data Set >

TIME SERIES

Time series analysis allows both descriptive and predictive analytics. Many industries, mine included, have very noisy time based datasets and many dashboards filled with time series data. Being able to separate trend, seasonality and error and then predict where will be in x units of time is very powerful from a decision making point of view.

The time-series analysis is a very important concept in Data Science.

It is basically done in two domains, frequency-domain and the time-domain.
Both of them play a vital role in intense computational analysis and also optimisation science.

Examples :

1 - TIME SERIES : Jobs >

CLUSTERING

It is basically a type of unsupervised learning method . An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labelled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples.

Examples :

1 - Clustering - Birds Data Set >

2 - Clustering - Iris Data Set >

ASSOCIATION

Association rule mining finds interesting associations and relationships among large sets of data items. This rule shows how frequently a item-set occurs in a transaction. A typical example is Market Based Analysis.

Market Based Analysis is one of the key techniques used by large relations to show associations between items. It allows retailers to identify relationships between the items that people buy together frequently.

Examples :

1 - Association - Groceries Data Set

Computer Science Space

Search This Blog

ARTICLES

COMPUTER SCIENCE SPACE

Comments

Post a Comment