# Mathematics: Statistics

Mathematics support for students at the University of Suffolk.

# Getting started with Statistics

### Statistical Software

Learning Services does not offer specialist support workshops for statistical software. If you are expected to use either of these software packages as part of your course, then lectures will be provided in their use. However there are lots of digital resources available for you to use and we have dedicated pages for:

It is also possible to use Excel to perform some common statistical tasks and we have a lynda.com video to get you started:
Statistics with Excel

You might also want to check out the R-Project, an open-source alternative to SPSS, although we are unfortunately unable to offer any support.

If you want to create surveys to collect data for your studies then try:

Survey Monkey produce a guide to Smart Survey Design

Again, we are unable to provide support for these tools. Please do check that you are complying with data protection laws and ethical standards for your discipline before sending out surveys (e.g. BPS - Ethics). Your course lecturers will be able to provide further guidance.

## Statistics Guide for Beginners

The following guide for beginners is credited to Maureen Haaker, Lecturer in the Dept of CYPE, in the Faculty of Arts, Business and Applied Social Science. Please acknowledge her in citations (Haaker, M, (2015), Statistics Guide for Beginners, University of Suffolk).

Statistics Guide for Beginners

## Statistical questions

Understanding statistical questions, types of statistical questions, hypothesis testing, categorical data.  Examples, videos and tests

#### Descriptive Statistics

Descriptive statistics describe what a set of data looks like. We shall look at measurements that can be calculated from a dataset:

• Measures of Central Tendency - mean, median and mode
• Measures of Dispersion - variance and standard deviation

As part of the dissemination and awareness process we'll be sharing videos from external sources.

The following video is from the Khan Academy, and walks through how to calculate a number of descriptive statistics; mean, mode, median.

These measure averages and are regularly used as the starting point in understanding the dataset.

#### Measures of Central Tendency

The Mean

The mean is a value calculated from the data by finding the total of all the values and dividing by how many values there are. You may see a formula similar to

$\bar{x}=\frac{\sum&space;x}{n}$

This formula says 'x bar ('x bar' is the sample mean of a set of data) equals the sum of x (add up the values) divided by n (the number of values). When talking about the mean of a population it is often referred to using the Greek symbol µ (mu).

Example:

Find the mean of 25, 23, 18, 26 and 30.

$mean&space;=&space;\frac{25+23+18+26+30}{5}&space;=&space;\frac{122}{5}&space;=&space;24.4$

The mean can be heavily affected by extremely small or large values. For example consider the wages of the following 10 employees:

Wage (£000s): £18 £18 £23 £24 £27 £27.5 £28 £28 £34 £250

$\frac{18+18&space;+&space;23&space;+&space;24&space;+&space;27&space;+&space;27.5&space;+&space;28&space;+&space;28&space;+&space;34+250}{10}&space;=&space;47.75$

The mean wage would be £47750. The employee earning £250000 has heavily skewed the mean upwards. 9 out of 10 employees are earning far less than this. You should be careful when using the mean when data is skewed such as this.

We can look at a trimmed mean. This removes the largest and smallest values from the dataset when calculating the mean. For example we may look at a 10% trimmed mean for the data in the previous example. This will remove the top 10% and bottom 10% from the data. For our 10 values this would mean ignore the top 1 and bottom 1 values. We would be left with 8 values.

$\frac{18&space;+&space;23&space;+&space;24&space;+&space;27&space;+&space;27.5&space;+&space;28&space;+&space;28&space;+&space;34}{8}&space;=&space;26.19$

The Median

The median represents the value for which 50% are less than and 50% are more than. If the numbers are in order then it is the value that is in the middle.

The median of 25, 23, 18, 26 and 30?

First write in order of size: 18, 23, 25, 26, 30 and identify the value in the middle 18, 23, 25, 26, 30.

The median of the wages from the previous example?

18, 18, 23, 24, 27 , , 27.5, 28, 28, 34, 250

Note that this time there is no middle value. There are 5 values to the left and 5 values to the right. In this case the median would be the value half way between the two values either side of the middle. To find the halfway point add the two numbers together and divide by 2. Median = 27.25 (so £27250). Note that the median is not skewed as heavily as the mean.

The Mode

The mode is the average that represents the value, or values, that appear most often (the top of the pops). Sometimes there is not a mode and sometimes there is more than one mode. If there are two modes the data is referred to as bi-modal.

The next 9 minute video covers measures of central tendency.

#### Descriptive Statistics

Descriptive statistics describe what a set of data looks like. We shall look at measurements that can be calculated from a dataset:

• Measures of Central Tendency - mean, median and mode
• Measures of Dispersion - variance and standard deviation

As part of the dissemination and awareness process we'll be sharing videos from external sources.

The following video is from the Khan Academy, and walks through how to calculate a number of descriptive statistics; mean, mode, median.

These measure averages and are regularly used as the starting point in understanding the dataset.

#### Measures of Dispersion

Dispersion measures how 'spread-out' the data are.

Variance

The variance measures the averaged squared distance from the mean. From the dataset the mean is calculated and then the distance each point is from the mean is squared, summed and then the answer is divided by how many values there are. The larger the variance the more spread-out the data are from the mean.

$Variance=s^2=\frac{\sum (x-\bar{x})^2}{n}$

Example: Find the variance of 20, 25, 8, 22 and 40

- First find the mean

$\frac{20 + 25+ 8+ 22+ 40}{5}=23$

- Subtract 23 from each of the values

$-3, 2, -15, -1, 17$

- Square each of these values

$9, 4, 225, 1, 289$

(note that the square of a negative number is positive)

- Find the average of these values

$variance = \frac{9+ 4+ 225+ 1+ 289}{5}=105.6$

Standard Deviation

The standard deviation is the square root of the variance.

$stdev=s=\sqrt{\frac{\sum (x-\bar{x})^2}{n}}$

For the data in the previous example the standard deviation would be:

$\sqrt{105.6}=10.3$

In descriptive statistics it describes how spread out the data are. When data are normally distributed (see later tab) it can be used to work out probabilities and proportions of items in the data set.

You may see the following formula for standard deviation:

$stdev=s=\sqrt{\frac{\sum (x-\bar{x})^2}{n-1}}$

This formula is used in inferential statistics and is discussed in a later tab.

Khan Academy Video - Range, Variance and Standard Deviation

The following 13 minute video covers topics on dispersion.

### Distributions

In this guide we will look at:

• what a distribution is
• the normal distribution
• skewed distributions

Below is a collection of data regarding the colours of smarties found in a box. Frequency records how many occurences of these colours that there were.

 Smartie Colour Frequency Red 28 Orange 42 Yellow 45 Green 37 Blue 29 Purple 19 TOTAL 200

This is known as a Frequency Distribution

This distribution can be shown graphically:

probability distribution looks at the relative frequency of each of these colours. The probability of choosing a yellow smartie at random would be 45/200 = 0.225 (there are 45 yellow smarties out of 200). The probability distribution for the smarties looks like:

 Smartie Colour Probability Red 0.14 Orange 0.21 Yellow 0.225 Green 0.185 Blue 0.145 Purple 0.095

We can also have frequency distributions for continuous data. Heights are recorded in the frequency table:

 Height (m) Frequency 1 - 1.1 1 1.1 - 1.2 5 1.2 - 1.3 8 1.3 - 1.4 16 1.4 - 1.5 15 1.5 - 1.6 12 1.6 - 1.7 11 1.7 - 1.8 3 1.8 - 1.9 2 1.9 - 2 0

We can show this frequency distribution in a histogram:

This is informed by a Level 5 Adult Nurse student who asked the question as an outcome of the module she is studying on Research for Practice (Qualitative Methods and Tools).

• What is the difference between descriptive and inferential statistics?
• What are descriptive statistics?

The videos answer these questions, and give some really useful background to the different types of data, based on examples, what tests you can perform based on the type of data, and how to apply it to your research or when reading research papers.

As part of the dissemination and awareness process we'll be sharing videos from external sources. To start the process I've selected the Statistics 101 Course on YouTube, by Brandon Foltz (https://www.youtube.com/user/BCFoltz/videos). He regularly publishes videos around statistical techniques. I'll admit, these are more directed towards the final year / dissertation students.

The following video explores the concept of correlation. Given it is part of a series, the start makes reference to the previous video. Therefore, I'd suggest, sit back, relax and give it time.

Correlation is a very useful statistical test for exploring two data sets which have no causal relationship. Excel (and Google Spreadsheet) includes an inbuilt function for calculating correlation coefficient. The following video outlines how this is done.

A requirement is the use of statistics to help us either accept or reject a hypothesis (A hypothesis is a proposed explanation for a phenomenon. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can test it. Wikipedia, http://en.wikipedia.org/wiki/Hypothesis).

The accepting or rejecting of a Null Hypothesis, often involves a T Test (where the sample is less than 30) or a Z Test (where the sample is greater than 30).

The following video walks through an illustration of what this means and how to complete a T Test. Although the software application often calculates the actual number, it is really useful for us to understand how it is calculated to help our interpretation. The following video is from the Khan Academy (https://www.khanacademy.org/).

Currently, one of the most frequent requests we receive is around, "what is a p value? what does it mean?"

What is a P value? (printable guide)

The following videos have been selected to answer the following questions;

• what is a p-value?
• when and where are thy used?
• what do they show?

The videos use lots of terminology, a key one to remember, is "Null hypothesis, refers to a general statement or default position that there is no relationship between two measured phenomena." If we reject the Null hypothesis, (indicated by a low p-value) we can accept the alternate hypothesis"

Statstutor
Information about using statistics, includes videos, examples and powerpoint presentations on many aspects of statistics

Mathscentre
Help with statistics

Suffolk Observatory