top of page

#102 Let Your Data Do the Talking – Start Here!

  • Writer: Ashish J. Edward
    Ashish J. Edward
  • Sep 6, 2023
  • 13 min read

Updated: Oct 11, 2024

Descriptive statistics might sound like a mouthful, but we're stripping away the fluff to give you the essentials in plain talk. These are the nuts and bolts that make sense of the data chaos.



Measures of Central Tendency


Think of these as the anchors of your data. First up, the mean (average) – it's like adding up all the numbers and then dividing by how many there are. The median (middle value) – it's the number that sits right in the middle when all your values are lined up. The mode (most common value) – it's like the MVP, showing up more times than anyone else. These measures give you an idea of what's typical and what's raising eyebrows.


Measures of Dispersion


Now, imagine your data as a gang of friends. The range is like measuring how far they're spread out from each other, from the introvert at the corner to the life of the party. The standard deviation – it's a bit like the GPS of your data, telling you the average distance of each value from the group's mean. It's all about knowing if your gang's tight-knit or more like scattered stars.


Frequency Distributions


This is all about counting heads in the data crowd. How often does each value show up? It's like spotting the regulars and the newcomers at a party. Some values might be the life of the party, showing up often, while others might be wallflowers, hardly making an appearance.


These aren't just tools for math whizzes. They're your secret weapons to decode data, your compass to navigate through numbers. They're here to untangle the mess, making data's message loud and clear. Descriptive statistics isn't about fancy formulas; it's about revealing the stories hidden behind the digits. So lets dive deeper into descriptive statistics and cover each of these points.


Measures of Central Tendency


Mean (Average) : This is the classic go-to. Imagine all your data values meeting up for a party – the mean is the sum of everyone's contribution divided by the number of guests. It's like the average score of a basketball team's points in a season. But beware, outliers can be party crashers. If your team scores 90, 100, 110, and 2 points in four games, the mean is around 75.5. But that 2-point game skews things. So, use the mean when you've got a well-behaved dataset.


Properties of Mean

  • Dependent on the magnitude of the data points, but its independent of the position of the data points.

  • Sensitive to variation.

  • Used when we need to measure the central tendency that should reflect the total of the scores.

  • It is also used to calculate some values of variation like standard deviation & variance.

  • Enjoys participation from all data points. Extreme values influence the mean.


Median (Middle Value) : Here's where things get sorted. If you line up all your data from smallest to largest, the median is the value right in the middle. If your family earns $30,000, $40,000, $45,000, $100,000, and $2,000,000 a year, the median gives you $45,000. That way, the millionaire uncle doesn't throw the average off. Median comes to the rescue when your data has some wild rides. Median is the “positional average” of a data set. If the number of data points are odd, the middle value is the Median (In data set 2,3,4,5,6 the Median is 4, as that’s the middle value). If the number of data points (N) is even, the Median will be at the (N+1)/2th position – i.e. 2,3,4,5,6,7 – N = 6, Median is the average of the middle values = (4+5)/2 = 4.5


Properties of Median

  • It’s a positional value, hence is independent of the magnitude of the data.

  • Not affected by variation in the data.

  • Best measure of central tendency when the data is skewed as it is not affected by extreme values.

  • Does not represent all data points but represents the central location of the data array.


Mode (Most Common Value) : Think popularity contest. The mode is the value that shows up the most in your data. If you ask a classroom their favourite ice cream flavor, and chocolate gets the most votes, chocolate is your mode. This one's handy when you're looking for what's the crowd favourite.


Mode is the data point that has the maximum frequency in any dataset. A dataset can be uni-modal, bi-modal or multimodal i.e. data can have more than one mode.


2,4,4,6,6,8,8,8,10. Mode is 8 – Unimodal

2,4,4,6,8,8,10,11. Two Modes – 4 & 8 – Bi-modal

2,4,4,6,6,8,8,10. Three modes – 4,6 and 8 – Multi-modal.


When to use Mean , Median & Mode ?


So, when do you throw each of these into the ring? Let's break it down:


  • Use the Mean when your data is fairly balanced, no big surprises. It's like calculating the average temperature for a week – each day's temp has its say, no matter how hot or cold.

  • Use the Median when there are wild outliers messing with the party. If you're finding the middle income in a town – the median won't get thrown off by a random billionaire moving in.

  • Use the Mode when you're searching for the star of the show, the most frequent value. If you want to know the most popular pet in a neighborhood – if everyone's got a dog, dogs are the mode.


In a nutshell, these measures of central tendency aren't just numbers; they're your data's heartbeat. The mean, median, and mode are like different lenses through which you can understand your data's rhythm. When your data throws a party, the mean calculates the average fun, the median makes sure nobody's getting too wild, and the mode points out who's hogging the spotlight :). So, whether you're a student, a researcher, or a business owner, remember that understanding when to use each of these measures is like having a roadmap to navigate through your data adventures.


Visual representation of Mean, Median & Mode


The best visual representation of mean, median, and mode is through a histogram.

A histogram is a bar graph showing the distribution of a dataset. Each bar represents a value range, and the bar height indicates frequency.


Measures of Dispersion


Time to cut through the data fog and get down to business – we're diving into measures of dispersion.


Range : If you're looking at temperatures in a week and the highest is 90°F while the lowest is 50°F, your range is 40°F. Simple as that. It's your quick gauge of data spread. If you're looking at the ages of concert-goers and the youngest is 20 while the oldest is 60, your range is 40 years. Easy peasy. Range = Highest Value – Lowest Value


Interquartile Range (IQR) : is a measure of statistical dispersion that describes the range within which the middle 50% of the data values lie. It is the difference between the third quartile (Q3) and the first quartile (Q1). IQR is useful for identifying outliers and understanding the variability of a dataset.

IQR = Q3 – Q1, where Q3 is the third quartile (the 75th percentile) & Q1 is the first quartile (the 25th percentile)


Data set (odd number of data points) = 2,6,9,12,18,19,27,15,7,5,1 ; Sort data in descending order = 1,2,5,6,7,9,12,15,18,19,27 ; Median = 1,2,5,6,7,9,12,15,18,19,27 = 9 (middle value)

Q1 = (1,2,5,6,7) is 5 as it’s the middle value ; Q3 = (12,15,18,19,27) = 18

IQR = Q3 – Q1 = 18 – 5 = 13 ; lets play around by data set by having even number of data points – two examples to understand how to calculate the IQR.


Data set (even number of data points) = 1,2,5,6,7,9,12,15,18,19 ; Q1 = (1,2,5,6,7) = 5 ; Q3 = (9,12,15,18,19) = 15 ; IQR = 15- 5 = 10

Data set (even number of data points) = 1,2,5,6,7,9,12,15 ; Q1 = (1,2,5,6) = (2+5)/2 = 3.5 ; Q3 = (7,9,12,15) = (9+12)/2 = 10.5 ; IQR = 7


Variance : This is like the big sibling of standard deviation. It's the average of the squared differences from the mean. If your team's basketball scores have a variance of 25, it means, on average, the scores are 25 points away from the mean. It digs deeper than standard deviation.

Variance (σ2 ) for a Population = ∑ (xᵢ - μ)2 / N ; xᵢ is each individual data point ; μ is the Population Mean, N is the number of data points in the population.

Variance (σ2 ) for a Sample = ∑ (xᵢ - μ)2 / n-1 ; xᵢ is each individual data point ; μ is the Population Mean, n is the number of data points in the sample


Standard Deviation: Think of this as your data's compass. It tells you how much each data point wanders away from the mean (average). Imagine you're assessing test scores. If the mean is 75 and a score is 90, that's a deviation of 15. A larger standard deviation means your data is all over the map.

Standard Deviation = √Variance


Now, let's get tactical – when to use these tools:

  • Use the Range when you want a quick peek at how spread out your data is. It's like measuring the span of your friend group's ages – the oldest and youngest in the group.

  • Use the Standard Deviation when you're seeking the nitty-gritty of individual data points' deviation from the mean. If you're checking out the consistency of a baker's dozen of cupcake sizes, standard deviation comes in handy.

  • Use the Variance when you're all in for a deep dive into data spread. Variance goes the extra mile by considering squared differences, painting a detailed picture of your data's variability.


In a nutshell, these measures of dispersion aren't just stats jargon; they're your data's storytellers. They reveal the spread, unveiling patterns beyond averages. Whether you're a student, a researcher, or a curious mind, mastering when to employ these tools is like having a secret map to decode the mysteries hidden in your data.


Example : Practical usage of Measures of Dispersion


Say a company aims to gauge the overall well-being and happiness of its employees to identify areas for improvement in the workplace environment. The company asks each employee to rate their level of happiness at work on a scale of 1 to 100. The scores collected are: 70, 85, 90, 60, 75, 80, 95, 50. Based on these scores, what insights can the company gain about its workplace environment, and what steps could be taken to make improvements?


In this case, statistics can help the company understand the spread and central tendencies of employee happiness scores. By calculating the average (mean), the company can get a general idea of how happy employees are. The range and standard deviation can show how much the scores vary, indicating whether most employees feel similarly or there's a wide disparity in happiness levels.


Range = Highest Score - Lowest Score = 95 – 50 = 45

Mean = (70 + 85 + 90 + 60 + 75 + 80 + 95 + 50) / 8 = 75.62

Variance = [(70-75.625)² + (85-75.625)² + ... + (50-75.625)²] / 8 = 291.48

Standard Deviation = √Variance = √291.484375 = 17.08 (approx.)

Interquartile Range (IQR) : Sorted Data: 50, 60, 70, 75, 80, 85, 90, 95 ; Q1 (25th percentile) = 67.5 ; Q3 (75th percentile) = 87.5 ; IQR = 87.5 - 67.5 = 20


Interpretation and Decision-Making basis data :


Range (45): The wide range indicates a significant disparity in employee satisfaction. HR may need to investigate the extreme low and high scores to understand the reasons behind them.

Variance (291.48): A high variance suggests that the scores are spread out from the mean. This could mean that different departments or roles have varying levels of satisfaction.

Standard Deviation (17.08): HR should aim to reduce this by implementing programs that target the specific needs of less satisfied employees.

IQR (20): The IQR suggests that the middle 50% of scores are moderately spread out. Programs targeting this group could be more generalized in nature.


Based on these findings, HR could decide to implement targeted engagement programs for specific groups. They could also conduct follow-up surveys to measure the effectiveness of these programs and aim to reduce the standard deviation over time.


So, these are not just theoretical concepts for the books but have relevance in the real world my friend.


Frequency Distribution


A frequency distribution is like a pattern that shows you how often each possible value appears in your data – it's all about counting and seeing the trends. You use graphs and tables to lay it out – it's like a visual map of how often things show up at different points.


1. Ungrouped Frequency Distributions: Think of this like counting how many times each specific value appears in your data. For instance, if you're tracking the favourite colours of a group of people and you find 10 people like blue, 8 like red, and 5 like green – that's an ungrouped frequency distribution for the colours.


Steps to creating an ungrouped frequency table is as below:


  • Set Up Your Table: Picture a table with two columns. The first one gets the variable's name, and the second is labelled "Frequency." For each value, you're going to give it a row in this table.

  • Fill in Values: Put your data values in the first column. If it's numbers, no worries, just drop them in. If it's something like ratings or categories, don't stress about the order.

  • Count the Frequencies: Now comes the fun part – count how many times each value shows up in your data. That's your frequency! Put the frequency in the second column.


2. Grouped Frequency Distributions: Imagine you have a lot of data points, and instead of counting each individual value, you group them into ranges. For example, if you're measuring the ages of participants in a marathon, and you group them into age ranges like 20-29, 30-39, and so on, that's a grouped frequency distribution. Steps to create one is as below:


  • Split into Intervals: Imagine taking a bunch of numbers and dividing them into groups called "class intervals." This helps organize your data buddies into manageable packs.

  • Calculate Range: Find the highest and lowest values in your data and subtract them. This gives you the range – the full spread of your data.

  • Pick Interval Width: There's no fixed rule, but a simple formula can help. You divide the range by the square root of your sample size. It's like finding a good chunk to group your data.

  • Define Intervals: For each interval, you set a lower limit and an upper limit. Think of it as setting up ranges. The first interval starts with your lowest value and goes up by the interval width.

  • Build Your Table: Picture a table with two columns – one for the variable name and one for "Frequency." Each row represents an interval.

  • Fill Frequencies: Put the frequency, or how many numbers are in each interval, in the "Frequency" column next to the interval.


When to Use a Relative Frequency Table


When dealing with a large data set or/and, when data is spread out over a broad range, grouping helps you see patterns more clearly – simplifies the analysis by reducing the noise and focusses on the bigger picture by showing trends, patterns etc.



3. Relative Frequency Table: A relative frequency table helps you understand the proportion of each value compared to the whole. It's like showing how much each piece contributes to the entire pie. Steps to make one as below :


  • Set Up Your Table: Picture a table with two columns – one for the variable's name and another for "Relative Frequency."

  • Calculate Relative Frequencies: For each value, calculate its frequency (how often it appears), and then divide it by the total number of values. This gives you the relative frequency.

  • Fill in the Table: In the "Relative Frequency" column, put the calculated relative frequencies next to their corresponding values.

Example: Let's say you're counting the number of fruits people like: apples, bananas, and oranges. If 10 people like apples, 15 like bananas, and 5 like oranges, the total is 30. For apples, the relative frequency is 10/30, which is about 0.33 or 33%. Similarly, for bananas, it's 15/30 or 50%, and for oranges, it's 5/30 or about 16.67%.


When to Use a Relative Frequency Table


When you want to understand the proportion or percentage of each value compared to the whole. They're handy for making comparisons and spotting trends. For instance, if you're looking at survey responses about movie preferences, a relative frequency table could show you the percentage of people who like action, romance, comedy, etc. It's all about putting values into perspective!


4. Cumulative Frequency Table: A cumulative frequency table shows you the running total of frequencies as you move through your data. It's like building up the numbers as you go along. Steps as below :


  • Set Up Your Table: Picture a table with two columns – one for the variable's name and another for "Cumulative Frequency."

  • Start from the Lowest: In the first row, put the lowest value of your data in the variable column.

  • Calculate Cumulative Frequencies: For each value, calculate its frequency (how often it appears), and add it to the cumulative frequency of the previous row (gives you the running total).

  • Fill in the Table: Put the calculated cumulative frequencies in the "Cumulative Frequency" column next to their corresponding values.


When to Use a Cumulative Frequency Table


Cumulative frequency tables are useful when you want to understand the progression and distribution of data as it accumulates. If you're looking at a series of data points over time or through a sequence, a cumulative frequency table helps you see how things build up step by step. When working with ordinal data or data with an inherent order, cumulative frequency tables help you understand how many data points are ranked below or equal to a certain value.


Descriptive Statistics in MS Excel



Visualizing Descriptive Statistics in MS Excel


  • Histogram : Select the Data: Highlight the range of cells containing the data you want to analyze. ; Go to 'Insert' Tab: Click on the 'Insert' tab in the ribbon. ; Choose 'Histogram': In the 'Charts' section, click on 'Histogram’. ; Customize: Right-click on various elements of the chart to customize labels, titles, and colors.

  • Box Plot (Box and Whisker Plot) : Select the Data: Highlight the range of cells containing the data.; Go to 'Insert' Tab: Click on the 'Insert' tab in the ribbon. ; Choose 'Box and Whisker Plot': In the 'Charts' section, click on 'Insert Statistic Chart' and then choose 'Box and Whisker’. ; Customize: You can customize the box plot by right-clicking on its elements.

  • Scatter Plot : Select the Data: Highlight the range of cells containing the data. ; Go to 'Insert' Tab: Click on the 'Insert' tab in the ribbon. ; Choose 'Scatter Plot': In the 'Charts' section, click on 'Scatter’. ; Customize: Add trendlines or modify the axis labels as needed.

  • Pie Chart : Select the Data: Highlight the range of cells containing the data. ; Go to 'Insert' Tab: Click on the 'Insert' tab in the ribbon. ; Choose 'Pie Chart': In the 'Charts' section, click on 'Pie’. Customize: Add labels, titles, and change colours as needed.

  • Line Graph ; Select the Data: Highlight the range of cells containing the data. ; Go to 'Insert' Tab: Click on the 'Insert' tab in the ribbon. ; Choose 'Line Chart': In the 'Charts' section, click on 'Line’. ; Customize: Add axis labels, titles, and other elements as needed.


Customization Tips


Use the "Chart Elements" button (+) next to the chart to add titles, labels, and legends.

Right-click on the chart to format elements like axes, data series, and gridlines.

Use the "Design" and "Format" tabs that appear when you click on the chart for more customization options.


By visually representing your data, you can make your findings more accessible and easier to understand, which is particularly useful in business settings for decision-making.


Comentarios


bottom of page