DATA PREPARATION - CHI SQUARE TEST
What is Chi Square Test ?
The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite.
FORMULA
Types Chi Square Test
There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:
- A chi-square goodness of fit test determines if a sample data matches a population. For more details on this type, see: Goodness of Fit Test.
-
A chi-square test for independence compares two variables in
a contingency table to see if they are related. In a more general
sense, it tests to see whether distributions of categorical
variables differ from each another.
- A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.
- A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.
Role and Importance of Chi Square Test
The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables. The test procedure described in this lesson is appropriate when the following conditions are met:
- The sampling method is simple random sampling.
- The variables under study are each categorical.
- If sample data are displayed in a contingency table, the expected frequency count for each cell of the table is at least 5.
PROBLEM
Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:
High School | Bachelors | Masters | Ph.d. | Total | |
---|---|---|---|---|---|
Female | 60 | 54 | 46 | 41 | 201 |
Male | 40 | 44 | 53 | 57 | 194 |
Total | 100 | 98 | 99 | 98 | 395 |
Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?
PROBLEM
H0: Gender is dependent on educational level
H1: Gender is independent on educational level
Interpretation of Result
P-value(Probability value)
The p-value is the probability of obtaining the observed results of a
test, assuming that the null hypothesis is correct.
The p-value is used as an alternative to rejection points to provide the
smallest level of significance at which the null hypothesis would be
rejected.
df (degree of freedom)
The p-value is the probability of obtaining the observed results of a
test, assuming that the null hypothesis is correct.
The p-value is used as an alternative to rejection points to provide the
smallest level of significance at which the null hypothesis would be
rejected.
X-squared value (chi-square value)
Dimnames
The dimnames() command can set or query the row and column names of a matrix.
Conclusion
p-value - 0.04589
p- value is less than 0.05, hence Gender is not independent on
educational level.
Therefore, the data collected above, has relationship between the gender
of an individual and the level of education that they have obtained.
The result shows the p-value (0.04589) less than significance value
0.05. The variables (Gender & Degrees) are not independent of each
other.
Hence, we reject the null Hypothesis.
Comments
Post a Comment