2012 General Social Survey 4 variable report

Statistics for Social Research SOC310/SWK310

4. Two Categorical Variables: abany and advfront.

DUE DATE: This exam is due on Thursday, December 18, by 5:00 p.m. Remember, exams will not be accepted late, unless you are struck by an incapacitating illness or injury. Even then, I will want to see evidence of just how far you got on the exam before you were stricken.

OUTSIDE HELP: The exam is open book, open notes, and has no time limit other than the due date. You may discuss the exam with anyone before you begin the computer analysis. Once you have started, you may only ask questions of the instructor or the TA.

GENERAL INSTRUCTIONS: For this exam you will be working with four or more variables from the 2012 General Social Survey. A portion of the data from this survey is available on the class web page in a file named gss12.sav. At least two of your variables must be categorical in nature (i.e., either nominal or ordinal), while at least two others must be interval or ratio.

QUESTIONS: As you work through this exam, you will be writing a report on the data and your findings. Such a report typically includes graphs, statistical tables, and text. The text is critical insofar as it highlights, interprets, and summarizes what the graphs and statistics reveal. Answer the following questions in order and in prose form. Report and interpret fully the values of relevant statistics in the text. Try to come as close to a realworld, nontechnical explanation of the meaning of each number as possible. And as always, do not forget to label your numbers wherever appropriate.

d. Test the hypothesis that there is a relationship between the two variables in the population from which the sample is drawn

by Jacob Stephens. Published October 27, 2014

Here is a description of the assignment, and my answers are below the assignment.

Fall 2014

Final Examination

As with the midterm, you will be fine provided you plan ahead carefully and leave yourself plenty of time to complete the exam. The computer analysis may take anywhere from fifteen minutes to two or three hours, while it will take several more hours to answer all the questions.

Before anything else, you should check your variables to make sure that values that should be treated as missing have indeed been defined as such in SPSS. You will then describe your variables and test the relationships between certain of them based on information obtained from the FREQUENCIES, CROSSTABS, TTEST, MEANS, SCATTERPLOT and REGRESSION commands.

In addition to your answers to the questions that follow, you will need to turn in all of the relevant output generated by SPSS. Be sure to get rid of any output that is not relevant to the analyses that you have conducted to answer the exam questions. Take care also to document which specific section(s) of your output you are referring to as you answer each question.

For starters:
1. What population does your sample represent?
2. How large is your sample?
3. How was the sample drawn?
For each of your variables:
1. What is the name of each variable in SPSS?
2. What does each variable aim to measure and how is that operationalized?
3. What level of measurement does each variable achieve? Explain.
For one interval/ratio level variable of your choosing:
1. Generate a frequency distribution and a bar chart or histogram for your chosen variable. If appropriate, describe the shape of the distribution.
2. Report and interpret the values for all of the measures of central tendency that are appropriate for your variable.
3. Report and interpret the values for all of the measures of dispersion that are appropriate for your variable.
For your two categorical variables:
1. State a research hypothesis concerning the relationship between the two categorical variables you have chosen. In doing so, specify theoretically which variable is the independent variable and which is the dependent variable, as well as why you expect there to be a relationship between the two of them.
2. Generate a crossclassification table for the two variables.
3. Describe the pattern and size of any observed relationship between your two variables by making the appropriate percentage comparisons.
4. Test the hypothesis that there is a relationship between the two variables in the population from which the sample is drawn.
5. Report and interpret an appropriate chisquare based measure of association assessing the strength of the observed relationship. Explain why you chose the measure of association you did.
6. Report and interpret an appropriate PRE measure of association assessing the strength of the observed relationship. Explain why you chose the measure of association you did.
For one of your categorical variables and one of your interval/ratio level variables:
1. State a research hypothesis concerning the relationship between the two variables that you have chosen. Be sure to state clearly which is the independent variable, which is the dependent variable, and why you expect there to be a relationship between them.
  
  [NOTE: You have a couple of choices here: if the categorical variable has just two categories, or if you wish to focus on just two of its categories, then your research hypothesis can be of the sort expected in a standard difference of means test (ttest); if the categorical variable has more than two categories (and you wish to treat it that way), then your research hypothesis should be of the sort appropriate for an ANOVA test.]
2. Report the overall mean on the interval/ratio level variable as well as the means on that same variable for each of the subgroups defined by the categorical variable. What do those subgroup means suggest about whether there is a relationship between your two variables?
3. Test the hypothesis that there is a relationship between the two variables in the population from which the samples are drawn.
4. Assess the strength of this relationship using statistics appropriate to the sort of analysis you have conducted.
For your two interval/ratio level variables:
1. Briefly explain which variable is the independent variable, which variable is the dependent variable, and why you expect a relationship between the two.
2. Produce a scatterplot summarizing the relationship between the two variables. Does it look like there is a relationship there? If so, how would you describe what that relationship looks like?
3. Report the coefficients of the OLS regression line describing the relationship between the two variables. Explain what these coefficients tell you.
4. Calculate, by hand, predicted values of the dependent variable for two nonzero values of the independent variable.
5. Test the hypothesis that there is a relationship between the two variables in the population from which the sample is drawn.
6. Give and interpret the values of Pearson’s r and Rsquared for this relationship.

a. U.S. Americans

b. 1974 people

c. multistage cluster sampling

a. What is the name of each variable in SPSS?

abany, advfront, adults, and age

b. What does each variable aim to measure?

· abany aims to measure if a female respondent answering this question wants abortion legal for any reason. abany is operationalized by requesting respondents to answer “yes” or “no” in response to the question, “Is there any reason you think abortion should be legal?”.

· advfront aims to measure how much people think science research is necessary and how much it should be funded by the government. advfront is operationalized by having respondents answer either “strongly agree”, “agree”, “disagree”, or “strongly disagree”, according to their corresponding opinion.

· adults aims to measure the number of children the respondent has. adults is operationalized by having the respondent enter how many adults above 18 they have in the household.

· age aims to measure the age of the respondent in years. age is operationalized by having the respondent enter a number for their age.

· abany achieves nominal measurement by having answers without associated magnitude: yes and no.

· advfront achieves ordinal measurement by having answers that measure magnitude, but the distance between these measurements is not numeric; there is no numeric difference between agree and disagree.

· adults achieves ratio measurement by having a meaningful zero value (no adults 18 years and above), and by having numeric differences between answers.

· age achieves ratio level measurement by having a meaningful zero value (zero means that there is no age, and the person is not yet one year old), and by having numeric differences between answers.

Here is a frequency distribution followed by a bar chart for the abany variable.

ABORTION IF WOMAN WANTS FOR ANY REASON
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	YES	554	28.1	44.4	44.4
	NO	694	35.2	55.6	100.0
	Total	1248	63.2	100.0
Missing	IAP	666	33.7
	DK	40	2.0
	NA	20	1.0
	Total	726	36.8
Total		1974	100.0

Statistics
ABORTION IF WOMAN WANTS FOR ANY REASON
N	Valid	1248
	Missing	726
Mode		2

The mode is a measure of central tendency computed by determining which response is most common. A mode of 2 (no) for abany means that the response “no” was given most frequently.

Statistics
ABORTION IF WOMAN WANTS FOR ANY REASON
N	Valid	1248
	Missing	726

Minimum		1
Maximum		2
Percentiles	25	1.00
	50	2.00
	75	2.00

Percentile is a measure of dispersion. Here the 25^th percentile is 1 (yes), calculated by finding the case 25% of the way up through a list of cases ordered from smallest magnitude to greatest magnitude. 50^th percentile and 75^th percentile can be calculated the same way. A 25^th percentile of 1 (yes) means that 25% of the data has a value of 1 (yes) or less. A 50^th percentile of 2 (no) means that 2 (no) is a value equal to or greater than 50% of the data.

a. abany and advfront are my two categorical variables. abany is the independent variable. This is because a person’s view on whether or not abortion ought to be legalized is more likely to affect their opinion as to how important science research is. The logic here is that if someone wants to have abortions legalized, they will want better science in order to have smoother abortions, therefore, more money ought to be put into science research.

b. Crosstab between advfront and abany:

*SCI RSCH IS NECESSARY AND SHOULD BE SUPPORTED BY FEDERAL GOVT ABORTION IF WOMAN WANTS FOR ANY REASON Crosstabulation**
			ABORTION IF WOMAN WANTS FOR ANY REASON		Total
			YES	NO	Total
SCI RSCH IS NECESSARY AND SHOULD BE SUPPORTED BY FEDERAL GOVT	Strongly agree	Count	80	68	148
	Strongly agree	% within ABORTION IF WOMAN WANTS FOR ANY REASON	27.9%	20.7%	24.0%
	Agree	Count	175	215	390
	Agree	% within ABORTION IF WOMAN WANTS FOR ANY REASON	61.0%	65.3%	63.3%
	Disagree	Count	23	37	60
	Disagree	% within ABORTION IF WOMAN WANTS FOR ANY REASON	8.0%	11.2%	9.7%
	Strongly disagree	Count	9	9	18
	Strongly disagree	% within ABORTION IF WOMAN WANTS FOR ANY REASON	3.1%	2.7%	2.9%
Total		Count	287	329	616
Total		% within ABORTION IF WOMAN WANTS FOR ANY REASON	100.0%	100.0%	100.0%

c. 61 percent of women who want abortion legalized agree that science research is necessary, while 65 percent who do not want abortion legalized agree that science research is necessary. This begins to lead me to reject my anticipated results, because there are more women agreeing for science who do not want abortion legalized. However, when looking at the strongly agree row, my hypothesis begins to look true again, because there are 7.2% more abany respondents who said yes than those who said no.

Statistics

ABORTION IF WOMAN WANTS FOR ANY REASON

AGE OF RESPONDENT

I. Assumptions

a. L of M: 2 nominal

b. Sampling: random

c. sample size: 616, 4 x 2 table

d. Population distributions: N/A

II. Hypotheses:

a. Ho = no relationship between groups abany and advfront

b. Ha = relationship between the two variables

III. Test Statistic

a. chi-squared

b. df = 3

c. X^2 = 5.504

Chi-Square Tests
	Value	df	Asymp. Sig. (2-sided)
Pearson Chi-Square	5.504^a	3	.138
Likelihood Ratio	5.515	3	.138
Linear-by-Linear Association	3.177	1	.075
N of Valid Cases	616
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 8.39.

IV. p-value and interpretation

a. p = .138

b. If Ho were true, the probability of getting a crosstab like we got is less than .138.

V. Conclusion

a. Given p = .138, we fail to reject the Ho and do not find support for Ha.

e. Cramer’s V X^2 based measure of association test. I chose this test because it is based off of X^2, and also is suitable for tables bigger than 2×2, because it corrects a problem that comes up in the Phi value test for tables bigger than 2×2. Here we observe a very weak relationship, because the value of Cramer’s V (0.95) is on the low end of the “weak relationship” scale, which extends from 0 – 0.3.

Symmetric Measures
		Value	Approx. Sig.
Nominal by Nominal	Phi	.095	.138
	Cramer’s V	.095	.138
N of Valid Cases		616
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.

f. Gamma PRE measure of association test. The value of gamma between abany and advfront is .156, which means that we do 15.6% better predicting values of advfront when we know whether the woman respondent thinks abortion should be legalized for any reason than if we don’t. I chose this measure of association because I have an ordinal level variable as well as a nominal level variable. Ordinal measures of association are appropriate for a mixed test like this (one ordinal and one nominal level variable), unless there are more than two options for the nominal level variable (Agresti and Finlay 246).

Symmetric Measures
		Value	Asymp. Std. Error^a	Approx. T^b	Approx. Sig.
Ordinal by Ordinal	Gamma	.156	.074	2.088	.037
N of Valid Cases		616
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.

a. research hypothesis between abany (dependent variable) and age (independent variable):

Ho: There is no relationship between abany and age. μ1 – μ2 </= 0

Ha: There is a relationship between abany and age. μ1 – μ2 > 0

I expect there to be a relationship between these two variables because as someone gets older, their perspectives tend to change, and there may be a common direction in which the beliefs tilt. Another, more realistic reason is that the older respondents matured during a time when the common belief was against abortion, while younger respondents matured when pro-abortion was a more common belief.

b. The mean for age is 48.19. For Yes the mean age is 46.63, and for No the mean age is 48.88. There may be a slight relationship between the two variables, but age doesn’t seem to have much affect on abany, because the mean age for No was only 2.25 years different from the mean age for Yes.

Descriptives

ABORTION IF WOMAN WANTS FOR ANY REASON

Statistic

Std. Error

Kurtosis

-.918

.186

AGE OF RESPONDENT

YES

Mean

46.63

.691

95% Confidence Interval for Mean

Lower Bound

45.27

Upper Bound

47.99

5% Trimmed Mean

46.18

Median

46.00

Variance

263.871

Std. Deviation

16.244

Minimum

Maximum

Range

Interquartile Range

Skewness

.310

.104

Kurtosis

-.531

.207

Mean

48.88

.689

95% Confidence Interval for Mean

Lower Bound

47.52

Upper Bound

50.23

5% Trimmed Mean

48.45

Median

49.00

Variance

327.119

Std. Deviation

18.086

Minimum

Maximum

Range

Interquartile Range

Skewness

.236

.093

I. Assumptions

a. L of M: 1 nominal (abany), 1 ratio (age).

b. M of S: random

c. Sample size: 1248 valid responses = n (variables from same sample, so there’s only one sample size to report)

d. Population distribution: N/A

II. Hypotheses

a. Ho: there is no relationship between abany and age, μ1 – μ2 </= 0

b. Ha: there is a relationship between abany and age, μ1 – μ2 > 0

III. Test

One-Sample Test
	Test Value = 0
	t	df	Sig. (2-tailed)	Mean Difference	95% Confidence Interval of the Difference
					Lower	Upper
ABORTION IF WOMAN WANTS FOR ANY REASON	110.598	1247	.000	1.556	1.53	1.58
AGE OF RESPONDENT	120.908	1968	.000	48.193	47.41	48.98

IV. p-value and interpretation

a. p < .000

b. If Ho were true (μ1 – μ2 </= 0), then the probability of getting a sample difference of means as far above 0 as we got (2.25 years) is .000.

V. Conclusion

a. We reject Ho, therefore, by proof of contradiction, we find support for the alternative Ha (μ1 – μ2 > 0).

d. This relationship is very weak, because the gamma value is .067.

Symmetric Measures
		Value	Asymp. Std. Error^a	Approx. T^b	Approx. Sig.
Ordinal by Ordinal	Gamma	.067	.033	2.021	.043
N of Valid Cases		1243
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.

a. My two variables are age (ratio) and adults (ratio). Age is the independent variable, because it affects the amount of adults over age 18 living in one’s household.

b. scatter plot of age (x) vs. adults (y)

The younger the respondent the higher the likelihood that they have a greater number of members in their household.

c. The unstandardized coefficient of the age of respondent is -.012, which tells us that there is a negative slope, where for every increase in year of age there is a decrease of .012 in number of people over 18 in one’s household. The unstandardized coefficient under column b, in row constant, has a value of 2.473. This is the y intercept of the graph.

Coefficients^a
Model		Unstandardized Coefficients				Standardized Coefficients		t			Sig.
		B		Std. Error		Beta
1	(Constant)		2.473		.053				47.022	.000
	AGE OF RESPONDENT		-.012		.001		-.258		-11.771	.000
a. Dependent Variable: HOUSEHOLD MEMBERS 18 YRS AND OLDER

d. y = -.012x + 2.473.

y = -.012(21) + 2.473

y = 2.221 people. The number of people over 18 most likely to be in a 21 year old’s household is 2.221.

y = -.012(55) + 2.473

y = 1.813 people. The number of people over 18 most likely to be in a 55 year old’s household is 1.813.

I. assumptions

a. L of M: two ratio (from the same sample): adults and age

b. M of S: random

c. sample size: n = 1958 people

d. assumptions: N/A, b/c of C.L.T. (sample size over 30).

II. hypotheses test:

a. Ho: there is no relationship, population slope = 0

b. Ha: there is a relationship, population slope =/ 0

III. test statistic

a. t = 120.908


	Test Value = 0
	t	df	Sig. (2-tailed)	Mean Difference	95% Confidence Interval of the Difference
					Lower	Upper
AGE OF RESPONDENT	120.908	1968	.000	48.193	47.41	48.98
HOUSEHOLD MEMBERS 18 YRS AND OLDER	101.007	1957	.000	1.891	1.85	1.93

IV. interpret p-value

a. a. p < .000

b. If Ho were true (population slope = 0), then the probability of getting a slope like we got (-.012 household members over 18 per year of age) is .000.

V. conclusion

a. We reject Ho, therefore, by proof of contradiction, we find support for the alternative Ha (population slope =/ 0).

f. R square is .066. The average error from a straight line that our results deviate is 0.66. The Pearson’s r is -.258., meaning there is a negative relationship, but it is not perfect.

Model Summary and Parameter Estimates
Dependent Variable: HOUSEHOLD MEMBERS 18 YRS AND OLDER
Equation	Model Summary					Parameter Estimates
Equation	R Square	F	df1	df2	Sig.	Constant	b1
Linear	.066	138.559	1	1951	.000	2.473	-.012
The independent variable is AGE OF RESPONDENT.

Correlations
		HOUSEHOLD MEMBERS 18 YRS AND OLDER	AGE OF RESPONDENT
HOUSEHOLD MEMBERS 18 YRS AND OLDER	Pearson Correlation	1	-.258^**
	Sig. (2-tailed)		.000
	Sum of Squares and Cross-products	1342.611	-7374.530
	Covariance	.686	-3.778
	N	1958	1953
AGE OF RESPONDENT	Pearson Correlation	-.258^**	1
	Sig. (2-tailed)	.000
	Sum of Squares and Cross-products	-7374.530	615657.277
	Covariance	-3.778	312.834
	N	1953	1969
**. Correlation is significant at the 0.01 level (2-tailed).

I wrote this report for Gordon College course, Statistics for Social Research. Find the syllabus below.

SOC310-Syllabus Download

Categorization:

Mathematics

Sociology