Descriptive statistics is a branch of statistics that deals with the analysis and summary of data. It involves the use of statistical measures to describe and summarize the characteristics of a dataset, including measures of central tendency and measures of variability. Descriptive statistics provides a powerful set of tools and techniques for researchers to better understand their data, identify patterns and trends, and draw meaningful conclusions.
Central Tendency
Measures of central tendency are used in descriptive statistics to describe the typical or central value of a dataset. The mean, also known as the arithmetic mean or average, is one such measure and is calculated by adding up all the values in the dataset and dividing by the total number of values. The median is another measure of central tendency and is the middle value in a dataset when the values are arranged in order. The mode is the value that occurs most frequently in the dataset. Measures of central tendency provide valuable information about the center or typical value of the dataset.
Variability
Measures of variability, or measures of dispersion, in descriptive statistics are used to describe the spread or dispersion of the data around the central value. One such measure is the variance, which is the average of the squared differences of each value from the mean. The standard deviation is another measure of variability and is the square root of the variance. Measures of variability provide valuable information about the range and distribution of the data.
In this article, we will focus on the Orders table of the Northwind database, which contains information about customer orders, such as the order date, customer ID, the shipped date, and the order total.
To calculate descriptive statistics for the Orders table, we will use several SQL commands, including SELECT, AVG, SUM, COUNT, MIN, MAX, and STDEV. Letโs take a closer look at each of these commands.
Measuring Central Tendency In SQL
AVG Command
The AVG command is used to calculate the average value of a column. It takes a column name as input and returns the average value of that column. Here is an example of an AVG command that calculates the average order total for all orders:
SELECT AVG(OrderTotal)
FROM Orders;
This command calculates the average value of the OrderTotal column for all rows in the Orders table.
SUM Command
The SUM command is used to calculate the sum of a column. It takes a column name as input and returns the sum of that column. Here is an example of a SUM command that calculates the total sales for all orders:
SELECT SUM(OrderTotal)
FROM Orders;
This command calculates the sum of the OrderTotal column for all rows in the Orders table.
COUNT Command
The COUNT command is used to count the number of rows in a table. It takes a column name as input and returns the number of rows that have a non-null value in that column. Here is an example of a COUNT command that counts the number of orders:
SELECT COUNT(OrderID)
FROM Orders;
This command counts the number of rows in the Orders table that have a non-null value in the OrderID column.
MIN Command
The MIN command is used to find the minimum value in a column. It takes a column name as input and returns the smallest value in that column. Here is an example of a MIN command that finds the earliest order date:
SELECT MIN(OrderDate)
FROM Orders;
This command finds the smallest value in the OrderDate column for all rows in the Orders table.
MAX Command
The MAX command is used to find the maximum value in a column. It takes a column name as input and returns the largest value in that column. Here is an example of a MAX command that finds the latest shipped date:
SELECT MAX(ShippedDate)
FROM Orders;
This command finds the largest value in the ShippedDate column for all rows in the Orders table.
Measuring Variability in SQL
SQL server provides several built-in functions to measure variability in a dataset. Variability refers to how spread out the data is around the mean or central tendency. The most commonly used measures of variability in SQL server are variance and standard deviation.
VAR Command
To calculate variance in SQL server, we can use the VAR function. The VAR function calculates the variance of a dataset by taking the sum of squared deviations of each value from the mean and dividing it by the total number of values minus one. The syntax for the VAR function is as follows:
SELECT VAR(OrderTotal) AS variance
FROM Orders;
Here, OrderTotalrefers to the column of the dataset for which we want to calculate the variance, and Orders refers to the name of the table that contains the dataset. The result of the VAR function is the variance of the dataset.
STDEV Command
The STDEV command is used to calculate the standard deviation of a column. It takes a column name as input and returns the standard deviation of that column. Here is an example of a STDEV command that calculates the standard deviation of the order totals:
SELECT STDEV(OrderTotal)
FROM Orders;
This command calculates the standard deviation of the OrderTotal column for all rows in the Orders table.
PERCENTILE_CONT Command
In addition, to these basic SQL commands, there are also other SQL commands that can be used to calculate more complex descriptive statistics, such as the median and percentiles.
The PERCENTILE_CONT command is used to calculate the continuous percentile rank of a column. It takes two arguments as input: the percentile value (between 0 and 1) and the column name. Here is an example of a PERCENTILE_CONT command that calculates the 75th percentile of the order totals:
SELECT OrderID,PERCENTILE_CONT(0.25)
WITHIN GROUP (ORDER BY Freight) OVER() AS "75th Percentile"
FROM Orders;
--
SELECT OrderID,PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY Freight) OVER() AS "75th Percentile"
FROM Orders;
--
SELECT OrderID,PERCENTILE_CONT(0.75)
WITHIN GROUP (ORDER BY Freight) OVER() AS "75th Percentile"
FROM Orders;
This command calculates the 75th percentile of the order totals in the Orders table. Note the OVER() command that calculate the percentile oer the whole column.
You can also calculate percentile grouped by, or partitioned by, OrderID as follows:
SELECT OrderID,
PERCENTILE_CONT(0.25)
WITHIN GROUP (ORDER BY Freight) OVER(PARTITION BY OrderID)
AS "25th Percentile"
FROM Orders;
--
SELECT OrderID,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY Freight) OVER(PARTITION BY OrderID)
AS "Median"
FROM Orders;
--
SELECT OrderID,
PERCENTILE_CONT(0.75)
WITHIN GROUP (ORDER BY Freight) OVER(PARTITION BY OrderID)
AS "75th Percentile"
FROM Orders;
In conclusion, descriptive statistics is an important tool for analyzing and summarizing data. SQL commands can be used to calculate basic and complex descriptive statistics for a dataset, including the mean, median, mode, variance, standard deviation, and percentiles. The Northwind database provides a useful dataset for practicing these SQL commands and developing proficiency in using descriptive statistics to analyze data.
Leave a Reply