DATA PREPARATION IN R - CHI SQUARE TEST - Does Gender affect Preferred Holiday?

DATA PREPARATION


Description of Chi Square Test

The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite.


Formula

There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:

  • A chi-square goodness of fit test determines if a sample data matches a population. For more details on this type, see: Goodness of Fit Test.

  • chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.

    • very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.

    • very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.



Role / Importance

The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

The test procedure described in this lesson is appropriate when the following conditions are met:

  • The sampling method is simple random sampling.

  • The variables under study are each categorical.

  • If sample data are displayed in a contingency table, the expected frequency count for each cell of the table is at least 5.

PROBLEM

"Which holiday do you prefer?"

 

Beach

Cruise

Men

209

280

Women

225

248

Does Gender affect Preferred Holiday?

The significance is 0.05.


Problem

H0: Gender is dependent on Preferred holiday

H1: Gender is independent on Preferred holiday



Source Code

Output

Interpretation of Result

P-value(Probability value)

The p-value is the probability of obtaining the observed results of a test, assuming that the null hypothesis is correct.

The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. 

 

df (degree of freedom)
The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it, is called number of degrees of freedom.

  

X-squared value (chi-square value)

Dimnames

The dimnames() command can set or query the row and column names of a matrix. 


Conclusion

p-value - 0.1499

p- value is more than 0.05, hence student choices is not dependent on preferred holiday.

Therefore, the data collected above, has no relationship between the Gender and the preferred holiday.

The result shows the p-value (0.1499) more than significance 0.05. The variables (Gender & Places) are not dependent of each other.

In other words, Men and Women probably do not have a different preference for Beach Holidays or Cruises.

Hence, we accept the null Hypothesis 


Comments