DATA PREPARATION - CHI SQUARE TEST

What is Chi Square Test ?

The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite.

FORMULA

Types Chi Square Test

There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:

A chi-square goodness of fit test determines if a sample data matches a population. For more details on this type, see: Goodness of Fit Test.
A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.
- A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.
- A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

Role and Importance of Chi Square Test

The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables. The test procedure described in this lesson is appropriate when the following conditions are met:

The sampling method is simple random sampling.
The variables under study are each categorical.
If sample data are displayed in a contingency table, the expected frequency count for each cell of the table is at least 5.

PROBLEM

Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:

	High School	Bachelors	Masters	Ph.d.	Total
Female	60	54	46	41	201
Male	40	44	53	57	194
Total	100	98	99	98	395

Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?

PROBLEM

H0: Gender is dependent on educational level
H1: Gender is independent on educational level

SOURCE CODE

OUTPUT

Interpretation of Result

P-value(Probability value)

The p-value is the probability of obtaining the observed results of a test, assuming that the null hypothesis is correct.
The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected.

df (degree of freedom)

X-squared value (chi-square value)

Dimnames

The dimnames() command can set or query the row and column names of a matrix.

Conclusion

p-value - 0.04589
p- value is less than 0.05, hence Gender is not independent on educational level.
Therefore, the data collected above, has relationship between the gender of an individual and the level of education that they have obtained.
The result shows the p-value (0.04589) less than significance value 0.05. The variables (Gender & Degrees) are not independent of each other.
Hence, we reject the null Hypothesis.

Computer Science Space

Search This Blog