DATA PREPARATION - CHI SQUARE TEST
What is Chi Square Test ?
The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite.
FORMULA
Types Chi Square Test
There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:
- A chi-square goodness of fit test determines if a sample data matches a population. For more details on this type, see: Goodness of Fit Test.
-
A chi-square test for independence compares two variables in
a contingency table to see if they are related. In a more general
sense, it tests to see whether distributions of categorical
variables differ from each another.
- A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.
- A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.
Role and Importance of Chi Square Test
The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables. The test procedure described in this lesson is appropriate when the following conditions are met:
- The sampling method is simple random sampling.
- The variables under study are each categorical.
- If sample data are displayed in a contingency table, the expected frequency count for each cell of the table is at least 5.
PROBLEM
In the dataset "Popular Kids," students in grades 4-6 were asked whether good grades, athletic ability, or popularity was most important to them. A two-way table separating the students by grade and by choice of most important factor is shown below:
Goals | Grades | Total | ||
---|---|---|---|---|
4 | 5 | 6 | ||
Grades | 49 | 50 | 69 | 168 |
Popular | 24 | 36 | 38 | 98 |
Sports | 19 | 22 | 28 | 69 |
Total | 92 | 108 | 135 | 335 |
Question: To investigate possible dependencies among the students' choices by grade and significance value is 0.05
PROBLEM
H0: Student choices is dependent on grades
H1: Student choices is independent on grades
Interpretation of Result
P-value(Probability value)
The p-value is the probability of obtaining the observed results of a
test, assuming that the null hypothesis is correct.
The p-value is used as an alternative to rejection points to provide the
smallest level of significance at which the null hypothesis would be
rejected.
df (degree of freedom)
The p-value is the probability of obtaining the observed results of a
test, assuming that the null hypothesis is correct.
The p-value is used as an alternative to rejection points to provide the
smallest level of significance at which the null hypothesis would be
rejected.
X-squared value (chi-square value)
Dimnames
The dimnames() command can set or query the row and column names of a matrix.
Conclusion
p-value - 0.8244
p- value is more than 0.05, hence student choices is not dependent by grade.
Therefore, the data collected above, has no relationship between the individual student and the grade that they have obtained.
The result shows the p-value (0.8244) more than significance 0.05. The variables (Goals & Grade) are not dependent of each other.
Hence, we accept the null Hypothesis
Comments
Post a Comment