Why summarise data
The analogous graphical representation for an ordinal variable does not have spaces between the bars in order to emphasize that there is an inherent order. In contrast, figure 2 below illustrates a relative frequency bar chart of the distribution of treatment with antihypertensive medications.
This graphical representation corresponds to the tabular presentation in the last column of Table 2 above. Consider the graphical representation of the data in Table 3 above, comparing the relative frequency of antihypertensive medications between men and women.
It would appropriately look like the figure shown below. Note that a range of 0 - 40 was chosen for the vertical axis.
For the example above the relative frequencies are However, one can visually mislead the reader regarding the comparison by using a vertical scale that is either too expansive or too restrictive.
These bar charts display the same relative frequencies, i. A distinguishing feature of bar charts for dichotomous and non-ordered categorical variables is that the bars are separated by spaces to emphasize that they describe non-ordered categories. When one is dealing with ordinal variables, however, the appropriate graphical format is a histogram.
A histogram is similar to a bar chart, except that the adjacent bars abut one another in order to reinforce the idea that the categories have an inherent order.
The frequency histogram below summarizes the blood pressure data that was presented in a tabular format in Table 4 on the previous page. Note that the vertical axis displays the frequencies or numbers of participants classified in each category. This histogram immediately conveys the message that the majority of participants are in the lower two categories of the distribution.
A small number of participants are in the Stage II hypertension category. The histogram below is a relative frequency histogram for the same data. Note that the figure is the same, except for the vertical axis, which is scaled to accommodate relative frequencies instead of frequencies. The data values for these ten participants are shown in the table below.
The rightmost column contains the body mass index BMI computed using the height and weight measurements. Larger sample sizes produce more precise results and therefore carry more weight. However, there is a point at which increasing the sample size will not materially increase the precision of the analysis.
Sample size computations will be discussed in detail in a later module. However, for a large sample, inspection of the individual data values does not provide a meaningful summary, and summary statistics are necessary.
The two key components of a useful summary for a continuous variable are:. In biostatistics, the term 'average' is a very general term that can be addressed by several statistics. The one that is most familiar is the sample mean, which is computed by summing all of the values and dividing by the sample size. For the sample of diastolic blood pressures in the table above, the sample mean is computed as follows:.
To simplify the formulas for sample statistics and for population parameters , we usually denote the variable of interest as "X". X is simply a placeholder for the variable being analyzed. The X with the bar over it represents the sample mean, and it is read as "X bar". When reporting summary statistics for a continuous variable, the convention is to report one more decimal place than the number of decimal places measured. Systolic and diastolic blood pressures, total serum cholesterol and weight were measured to the nearest integer, therefore the summary statistics are reported to the nearest tenth place.
Height was measured to the nearest quarter inch hundredths place , therefore the summary statistics are reported to the nearest thousandths place. Body mass index was computed to the nearest tenths place, summary statistics are reported to the nearest hundredths place. When there is an odd number of observations in the sample, the median is the value that holds as many values above it as below it in the ordered data set.
When there is an even number of observations in the sample e. Half of the diastolic blood pressures are above 71 and half are below. In this case, the sample mean and the sample median are very similar. The mean and median provide different information about the average value of a continuous variable. Suppose the sample of 10 diastolic blood pressures looked like the following:.
The extreme value of is affecting the computation of the mean. For this same sample, the median is The median is unaffected by extreme or outlying values. For this reason, the median is preferred over the mean when there are extreme values either very small or very large values relative to the others.
When there are no extreme values, the mean is the preferred measure of a typical value, in part because each observation is considered in the computation of the mean. When there are no extreme values in a sample, the mean and median of the sample will be close in value. Below we provide a more formal method to determine when values are extreme and thus when the median should be used. If the mean and median are very different, it suggests that there are outliers affecting the mean.
A third measure of a "typical" value for a continuous variable is the mode, which is defined as the most frequent value. In Table 8 above the mode of the diastolic blood pressures is 81, the mode of the total cholesterol levels is , the mode of the heights is A tally plot is a kind of frequency graph that you can sketch in a notebook.
The tally plot in the preceding figure shows a normal parametric distribution. You can see that the shape is more or less symmetrical around the middle. So here the mean and standard deviation would be good summary values to represent the data.
The original dataset was:. The first bin, labelled 18, contains values up to There are two in the dataset 17, and Note that the same bins were used for the second dataset. The range for both samples was The data in the second sample are clearly not normally distributed. For these data the median and inter-quartile range would be appropriate summary statistics. A histogram is like a bar chart. The bars represent the frequency of values in the data sample that correspond to various size classes bins.
Generally the bars are drawn without gaps between them to highlight the fact that the x-axis represents a continuous variable. There is little difference between a tally plot and a histogram but the latter can be produced easily using a computer you can sketch one in a notebook too. To make a histogram you follow the same general procedure as for a tally plot but with subtle differences:.
You can draw a histogram by hand or use your spreadsheet. The following histograms were drawn using the same data as for the tally plots in the preceding section. The first histogram shows normally distributed data. In both these examples the bars are shown with a small gap, more properly the bars should be touching. The x-axis shows the size classes as a range under each bar.
You can also show the maximum value for each size class. Ideally your histogram should have the labels at the divisions between size classes like so:. Visualizing the shape of your data samples is usually your main goal. However, it is possible to characterize the shape of a data distribution using shape statistics. There are two, which are used in conjunction with each other:. The skewness of a sample is a measure of how central the average is in relation to the overall spread of values.
The formula to calculate skewness uses the number of items in the sample the replication, n and the standard deviation, s. A negative value indicates the opposite. The larger the value the more skewed the sample is.
The kurtosis of a sample is a measure of how pointed the distribution is see drawing the distribution. It is also a way to think about how clustered the values are around the middle. The formula to calculate kurtosis uses the number of items in the sample the replication, n and the standard deviation, s.
A positive result indicates a pointed distribution, which will probably also have a low dispersion. A negative result indicates a flat distribution, which will probably have high dispersion. The higher the value the more extreme the pointedness or flatness of the distribution. You should always summarize a sample of data values to make them more easily understood by you and others. At the very least you need to show:. The shape of the data its distribution is also important because the shape determines which summary statistics are most appropriate to describe the sample.
Your data may be normally distributed i. The shape of the data also leads you towards the most appropriate ways of analyzing the data, that is, which statistical tests you can use.
We run training courses in data management, visualisation and analysis using Excel and R: The Statistical Programming Environment. Courses will be held at one of our training centres in London.
Alternatively we can come to you and provide the training at your workplace. About Similarly, approximately Nearly all A diagram of the In Module 3 , we use Excel to summarize the data in the polling station list.
It is 9. On the chart we have shaded the area to show what data is within three standard deviations 9. The standard deviation gives us a standardized way of knowing what is normal, what is extra large or what is extra small. We know that Fran the Fox is short. When we consider the standard deviation and that nearly all Similar to standard deviation, variance measures how tightly or loosely numbers are spread around the average.
So, a larger variance means data is spread further out from the average, and a smaller variance means they are more tightly grouped around the average. The variance is the average of the squared differences or deviations of each number from the average the mathematical formula is at the end of this note. If you want to perform your own calculations, here is the heights dataset. The data along with some calculations are available as an Excel file or an Open Spreadsheets file.
It looks complicated, but the important change is to divide by N-1 instead of N when calculating a Sample Variance. Remember that the Standard Deviation is just the square root of the Variance , so the formula for calculating the Variance is the same formula above but without the Square root part. If you prefer, you can just count in from both ends of the list until you meet in the middle.
The mode is the number that is repeated more often than any other number. In a series of values of 2, 3, 4, 5, 4, 4, 6, 10, 12; the mode would be 4.
We are skipping the calculation for the standard deviation for this module, because we want to focus on it as a concept and not get caught up in the formula. The formula for the standard deviation and variance are at the end of this module for those who may want it.
The mean is the fair share measure. The mean is also called the balancing point of a distribution. If we measure the distance between each data point and the mean, the distances are balanced on each side of the mean. The median is the physical center of the data when we make an ordered list. It has the same number of values above it as below it. We need to use a graph to determine the shape of the distribution.
By looking at the shape, we can determine which measure of center best describes the data. Use the mean as a measure of center only for distributions that are reasonably symmetric with a central peak.
0コメント