Significance of Sample Size Types of Error Normal Distribution, a Review Z Scores in addition to the Normal Table Z Scores
Innes, Stephanie, Health/Medicine Reporter has reference to this Academic Journal, PHwiki organized this Journal Review of material from previous week: Variance The variance (s2) is the sum of the squared deviations from the sample mean, divided by N-1 where N is the number of cases. (Note: in the as long as mula as long as the population variance, the denominator is N rather than N-1) Formula as long as sample variance To find the variance in addition to its square root, the st in addition to ard deviation, use the comm in addition to Analyze/ Descriptive Statistics/ Frequencies in addition to move the variable of interest into the Variables box. Click on Statistics, check st in addition to ard deviation in addition to variance in addition to then click OK. st in addition to ard deviation-(s or SD) Review from previous week: St in addition to ard deviation in addition to normal distribution The st in addition to ard deviation is the square root of the variance. It measures how much a typical case departs from the mean of the distribution. The size of the SD is generally about one-sixth the size of the value of the range The st in addition to ard deviation becomes meaningful in the context of the normal distribution. In a normal distribution The mean, median in addition to mode of responses as long as a variable coincide The distribution is symmetrical in that it divides into two equal halves at the mean, so that 50% of the scores fall below the mean in addition to 50% above it (sum of probabilities of all categories of outcomes = 1.0 (total area = 1) 68.26 % of the scores in a normal distribution fall within plus or minus one st in addition to ard deviation of the mean. 95.44% fall within 2 SDs. The curve has a st in addition to ard deviation of 1 Thus we are able to use the SD to assess the relative st in addition to ing of a score within a distribution, to say that it is 2 SDs above or below the average, as long as example The normal distribution has a skewness equal to zero Some normal curves in nature: heights within gender; LDL cholesterol Review from last week: Histogram with superimposed normal curve Histogram of the vehicle weight variable with a superimposed curve. This is what a normal distribution of a variable with the same mean in addition to st in addition to ard deviation would look like. This distribution has a positive skew in addition to is more platykurtic Definition: Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, in addition to have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uni as long as m distribution would be the extreme case.
This Particular University is Related to this Particular Journal
Descriptive vs. Inferential Statistics Descriptive statistics: such measures as the mean, st in addition to ard deviation, correlation coefficient when used to summarize sample data in addition to reduce it to manageable proportions Inferential (sampling) statistics: use of sample characteristics to infer parameters or properties of a population on the basis of known sample results. Based on probability theory. Statistical inference is the process of estimating parameters from statistics. Inferential statistics require that certain assumptions be met about the nature of the population. Assumptions of Inferential Statistics Lets consider an example of an application of parametric statistics, the t-test. Suppose we have drawn a sample of Chinese Americans in addition to a sample of Korean Americans in addition to have a belief that the two populations are likely to differ in their attitudes toward aging. The purpose of the t-test is to determine if the means of the populations from which these two samples were drawn are significantly different. There are certain assumptions that have to be met be as long as e you can per as long as m this test: You must have at least interval level data The data from both samples have to represent scores on the same variable, that is, you cant measure attitudes toward aging in different ways in different populations The populations from which the samples are drawn are normally distributed with respect to the variable The variances of the two populations are equal The samples have been r in addition to omly drawn from comprehensive sampling frames; that is, each element or unit in the population has an equal chance of being selected (r in addition to om sampling permits us to invoke statistical theories about the relationships of sample in addition to population characteristics) Population Parameters vs. Sample Statistics Purpose of statistical generalizations is to infer the parameters of a population from statistics known about a sample drawn from the population Greek letters usually refer to population characteristics: population st in addition to ard deviation = , population mean = µ Roman letters usually refer to sample characteristics: sample st in addition to ard deviation = s, sample mean = The as long as mula as long as the variance in a population is: The as long as mula as long as the variance in a sample is:
Frequency Distributions vs. Probability Distributions The general way in which statistical hypothesis testing is done is to compare obtained frequencies to theoretical probabilities. Compare the probability distribution as long as number of heads in two, four in addition to 12 coin flips vs. an actual example of coin flipping. (from D. Lane, History of Normal Distribution Tails, tails Heads, heads Tails, heads or Heads, tails Formula as long as binomial probabilities Comparing Obtained to Theoretical Outcomes If you did a sample experiment in addition to you got, say, two heads in two flips 90% of the time you would say that there was a very strong difference between the obtained frequencies in addition to the expected frequencies, or between the obtained frequency distribution in addition to the probability distribution Over time, if we were to carry out lots in addition to lots of coin-flipping experiments, the occasions when we got 90% occurrence of two heads in two flips would start to be balanced out by results in the opposite direction, in addition to eventually with enough cases our obtained frequency distribution would start to look like the theoretical probability distribution. For an infinite number of experiments, the frequency in addition to probability distributions would be identical. Significance of Sample Size Central Limit Theorem: the larger the sample size, the greater the probability that the obtained sample mean will approximate the population mean Another way to put it is that the larger the sample size, the greater the likelihood that the sample distribution will approximate the shape of a normal distribution as long as a variable with that mean in addition to st in addition to ard deviation
Reasoning about Populations from Sample Statistics Parameters are fixed values in addition to are generally unknown Statistics vary from one sample to another, are known or can be computed In testing hypotheses, we make assumptions about parameters in addition to then ask how likely our sample statistics would be if the assumptions we made were true Its useful to think of an hypothesis as a prediction about an event that will occur in the future, that we state in such a way that we can reject that prediction We might reason that if what we assume about the population in addition to our sampling procedures are correct, then our sample results will usually fall within some specified range of outcomes. Reasoning about Populations from Sample Statistics If our sample results fall outside this range into a critical region, we must reject our assumptions. For example, if we assume that two populations, say males in addition to females, have the same views on increasing the state sales tax but we obtain results from our r in addition to omly drawn samples indicating that their mean scores on the attiitude toward sales tax measure are so different that this difference falls into the far reaches of a distribution of such sample differences, we would have to reject our assumption that the populations do not differ. But our decision would have a lot to do with how we defined the far reaches of this distribution, called the critical region. Reasoning about Populations from Sample Statistics, contd We can say that we have carried out statistical hypothesis testing if We have allowed as long as all potential outcomes of our experiment or survey results ahead of the test We have committed be as long as eh in addition to to a set of procedures or requirements that we will use to determine if the hypothesis should be rejected in addition to We agree in advance on which outcomes would mean that the hypothesis should be rejected Probability theory lets us assess the risk of error in addition to take these risks into account in making a determination about whether the hypothesis should be rejected
Types of Error Error risks are of two types: Type I error, also called alpha () error, is the risk of rejecting the null hypothesis (H0:hypothesis of no difference between two populations, or no difference between the sample mean in addition to the population mean) when it is in fact true. (we set our confidence level too low) Type II error, or beta () error, is the risk of failing to reject a null hypothesis when it is in fact false. (set our confidence level too high) When we report the results of our test, it is often expressed in terms of the degree of confidence we have in our result ( as long as example, we are confident that there is less than a 5% or 2.5% or 1% probability that the result we got was obtained by chance in addition to that in fact we should fail to reject the null hypothesis. This is usually referred to as the confidence level or the significance level. Why We are Willing to Generalize from Sample Data Why should we generalize on the basis of limited in as long as mation Time in addition to cost factors Inability to define a population in addition to list all of its elements R in addition to om sampling: every member of the population has an equal chance of being selected as long as the sample Theoretically, to do this requires that you have a list of all the members of the population To survey the full-time faculty at UPC, as long as example, you might obtain a list of all the faculty, number them from one to N, in addition to then use a r in addition to om number table to draw the numbered cases as long as your sample. R in addition to om sampling can be done with in addition to without replacement SPSS will draw a r in addition to om sample as long as you from your list of cases (Data/Select Cases/R in addition to om Sample of Cases) of a desired size Normal Distribution, a Review The normal curve is an essential component of decision-making by which you can generalize your sample results to population parameters Notion of the area under the normal curve the area between the curve in addition to the baseline which contains 100% of the cases
Characteristics of the normal curve, contd Constant proportion of the area under the curve will lie between the mean in addition to a given point on the baseline expressed in st in addition to ard score units (Zs), in addition to this holds in both directions (both above in addition to below the mean). That is, as long as any given distance in st in addition to ard (sigma) scores the area under the curve (proportion of cases) will be the same both above in addition to below the mean The most commonly occurring scores cluster around the mean, where the curve is the highest, while the extremely high or extremely low scores occur in the tails in addition to become increasingly rare (the height of the curve is lower in addition to in the limit approaches the baseline. The total area (sum of individual probabilities) sums to 1.0 Table of the Area under the Normal Curve Tables of the Area under the Normal Curve are available in your supplemental readings, p. 469 in Kendrick in addition to p. 299 in Levin in addition to Fox, in addition to can be found on the Web. You can use this table to find out the area under the normal curve (the proportion of cases) which theoretically are likely to fall between the population mean in addition to some score expressed in st in addition to ard unit or Z scoress. For example, lets find what proportion of cases in a normal distribution would lie between the population mean in addition to a st in addition to ard score of 2.2 (that is, a score on the variable that is 2.2 st in addition to ard deviations above the mean-also called a Z score) Z Scores in addition to the Normal Table In the normal table you look up the Z score of 2.2 in addition to to the right of that you will find the proportional area between the mean in addition to Z, which is .4861. Thus 48.61% of the cases in the normal distribution lie between the mean in addition to Z=2.2. What proportion of cases lie below this Add 50% to 48.61% (because 50% of the cases lie below the mean). What proportion of cases lie above this 100% (100% of cases) minus 50% + 48.61%, or 1.39% of cases What proportion of cases lie between -2.2 in addition to +2.2 Some Tables will express the values in percentages, some in proportions.
Using the Mean in addition to St in addition to ard Deviation to find Where a Particular Value Might Fall Lets consider the vehicle weight variable from the cars. sav file. From a previous analysis we learned that the distribution looked like this histogram on the right in addition to it had the sample statistics reported in the table, including a mean of 2969.56 in addition to a st in addition to ard deviation of 849.827. What would be the weight of a vehicle that was one st in addition to ard deviation above the mean Add one sd to the mean, in addition to you get 3819.387 One st in addition to ard deviation below the meansubtract one sd from the mean, in addition to you get 2119.733 What percent of vehicles have weights between those two values, assuming a r in addition to om, representative sample in addition to no measurement error 68.26 What would be the weight of a vehicle that was two st in addition to ard deviations above the mean Two st in addition to ard deviations below the mean What percent of vehicles have weights between those two values, assuming a r in addition to om, representative sample in addition to no measurement error 95.44% Z Scores The Z score expresses the relationship between the mean score on the variable in addition to the score in question in terms of st in addition to ardized units (units of the st in addition to ard deviation) Thus from the calculations we just did we can say that the value of vehicle weight of 3819.387 has a Z score of +1 in addition to the weight of 2119.733 has a Z score of -1 Turning the question around, suppose we wanted to know where in the distribution we would find a car that weighed 4500 pounds. To answer that question we would need to find the Z score as long as that value. The computing as long as mula as long as finding a Z score is Thus, the z score as long as the vehicle weight 4500 pounds (X) is 4500-2969.56 (the mean)/849.827 (the st in addition to ard deviation), or Z=1.80. What about a 1000-lb car (Z=-2.31) How to Interpret in addition to Use the Z Score: Table of the Area under the Normal Curve Suppose we know that a vehicle weight has a Z score of +1 (is 1SD above the mean). Where does that score st in addition to in relation to the other scores Lets think of the distribution image again. Recall that we said that half of the cases fall below the mean, in addition to that 34.13% of the cases fall between the mean in addition to one SD below it, in addition to 34.13% of the cases fall between the mean in addition to one sd above it. So if a vehicle weight has a Z score of +1, what proportion of cases are above in addition to what percent are below it Lets look at the next slide
Table of the Area under the Normal Curve, continued Consider the z score of 1.00.3413 of scores lie between z in addition to the mean; .1587 of scores lie above a z of 1.00, in addition to .8413 lie below it. Now suppose z was -1.0.3413 of scores would still lie between z in addition to the mean; what percent of scores would lie above it in addition to below it Remember that the normal distribution is symmetrical Sampling Distribution of Sample Means in addition to the St in addition to ard Error of the Mean The characteristics of populations, or parameters, are usually not known. All we can do is estimate them from sample statistics. What gives us confidence that a sample of, say, 100 or 1000 people permits us to generalize to millions of people The key concept is the notion that theoretically we could draw all possible samples from the population of interest in addition to that as long as the sample statistics that we collect, such as the mean, there will be a sampling distribution with its own mean in addition to st in addition to ard deviation. In the case of the mean, this is called the sampling distribution of sample means, in addition to its mean is represented as µ Characteristics: (1) approximates a normal curve (2) its mean is equal to the population mean (3) its st in addition to ard deviation is smaller that that of the population (sample mean more stable than scores which comprise it) We can also estimate the st in addition to ard deviation of the sampling distribution of sample means, which would give us an indicator of the amount of variability in the distribution of sample means. This value, known as the st in addition to ard error of the mean, is represented by the symbol Basically, it tells you how much statistics can be expected to deviate from parameters when sampling r in addition to omly from the population Estimating the St in addition to ard Error The st in addition to ard error of the mean is hypothetical in addition to unknowable; consequently we estimate it with sample statistics using the as long as mula: st in addition to ard deviation of the sample divided by the square root of the sample size. (SPSS uses N in the denominator; Levin in addition to Fox advocate N-1 as long as obtaining an unbiased estimate of the st in addition to ard error Makes little difference with large N) As you will quickly note, the st in addition to ard error is very sensitive to sample size, such that the larger the sample size, the smaller the error. And the smaller the error, the greater the homogeneity in the sampling distribution of sample means (that is, if the st in addition to ard error is small relative to the range, the sample means arent all over the place). The st in addition to ard error is of importance primarily because it is used in the calculation of other inferential statistics in addition to when it is small it increases the confidence you can have that your sample statistics are representative of population parameters.
Finding Z Scores in addition to the St in addition to ard Error with SPSS Lets calculate Z scores in addition to st in addition to ard errors as long as the variables companx1 in addition to companx3 in the Lesson3.sav data set Go to Anayze/Descriptive Statistics/ Descriptives Move the variables companx3 (difficulty underst in addition to ing the technical aspects of computers) in addition to companx1(fear of making mistakes) into the Variables window using the black arrow Click on Save st in addition to ardized values as variables (this creates two new variables whose data are expressed in Z scores rather than raw scores) Click options in addition to check S. E. Mean as well as mean, s, etc Click Continue in addition to then OK Go to the Output viewer to see descriptive statistics as long as the variables Go to the Data Editor in addition to note the new variables which have been added in the right-most columns Compare S.E.s, Raw Scores to Z Scores Note that the st in addition to ard errors of the two variables are about the same although the range is larger as long as difficulty underst in addition to ing Z scores. Raw scores Point Estimates, Confidence Intervals A point estimate is an obtained sample value such as a mean, which can be expressed in terms of ratings, percentages, etc. For example, the polls that are released showing a race between two political c in addition to idates are based on point estimates of the percentage of people in the population who intend to vote as long as or at least favor one or the other c in addition to idate Confidence level in addition to confidence interval A confidence interval is a range that the researcher constructs around the point estimate of its corresponding population parameter, often expressed in the popular literature as a margin of error of plus or minus some number of points, percentages, etc. This range becomes narrower as the st in addition to ard error becomes smaller, which in turn becomes smaller as the sample size becomes larger
Confidence levels Confidence levels are usually expressed in terms like .05, .01, etc. in the scholarly literature, in addition to 5%, 1% etc in the popular press. They are also called significance levels. They represent the likelihood that the population parameter which corresponds to the point estimate falls outside that range. To turn it around the other way, the represent the probability that if you constructed 100 confidence intervals around the point estimate from samples of the same size, 95 (or 99) of them would contain the true percentage of people in the population who preferred C in addition to idate A (or C in addition to idate B) Using the Sample St in addition to ard Error to Construct a Confidence Interval Since the mean of the sampling distribution of sample means as long as a particular variable equals the population mean as long as that variable, we can try to estimate how likely it is that the population mean falls within a certain range, using the sample statistics. We will use the st in addition to ard error of the mean from our sample to construct a confidence interval around our sample mean such that there is a 95% likelihood that the range we construct contains the population mean Calculating the St in addition to ard Error with SPSS Lets consider the variable vehicle weight from the Cars.sav data file. Lets find the mean in addition to the st in addition to ard error of the mean as long as vehicle weight Go to Analyze/Descriptive Statistics/ Frequencies, then click the Statistics button in addition to request the mean in addition to the st in addition to ard error of the mean (S. E. Mean)
A Word about the t Distribution Levin in addition to Fox in addition to some other authors advocate using the t distribution to construct confidence intervals in addition to find significance levels when the sample size is small. When it is large, say over 50, the Z in addition to t distributions are very similar. Think of t as a st in addition to ard score like z. When comparing an obtained sample mean to a known or assumed population mean, t is computed as the sample mean minus the population mean, divided by an unbiased estimate of the st in addition to ard error (the sample st in addition to ard deviation divided by the square root of N-1) The t table is entered by looking up values of t as long as the sample size minus one (N-1) (also known as the degrees of freedom in this case) in addition to the significance level (area under the curve corresponding to the alpha level (.05, 01, .005, or 1 minus the degree of confidence)). Suppose we had a sample size of 31 in addition to an obtained value of t of 2.75. Entering the t distribution at df=30 in addition to moving across we find that a t of 2.75 corresponds to a significance or alpha level (area in the tail of the distribution) of only .01 (two-tailed-.005 in each tail), which means that there is only 1 chance in 100 of obtaining a sample mean like ours given the known population mean. (the t-test in practice is adjusted as long as whether it is one-tailed or two-tailed. Levin in addition to Fox provide examples of setting confidence intervals as long as means using the t distribution rather than z. T Distribution A Word about the Concept Degrees of Freedom (DF) In scholarly journals you will see references to DF when the results of statistical tests are reported. What does this mean Degrees of freedom generally is calculated as the number of independent cases (which are free to vary) in the sample from which we are computing a statistic. For example, suppose we had the following data: 5,6,7,8,9. And we calculated their mean as 35/5 = 7. We could change any four of those numbers, e.g., we could change them to 1, 2, 3, in addition to 4, but the fifth one would have to make their total come out to 35 (would have to be 25) so that the mean of the five numbers would remain 7. Thus the degrees of freedom in computing the mean is n-1 or in this case 4.
Innes, Stephanie Health/Medicine Reporter
Innes, Stephanie is from United States and they belong to Arizona Daily Star and they are from Tucson, United States got related to this Particular Journal. and Innes, Stephanie deal with the subjects like Health and Wellness; Medical
Journal Ratings by Black Hills State University
This Particular Journal got reviewed and rated by Black Hills State University and short form of this particular Institution is US and gave this Journal an Excellent Rating.