Distribution of the variables with Generate series function in SQL

When you explore new data, is very important to look at the distribution of the variables in order to find outliers, errors, weird 9999 data, and so on. You can do that using graphs as a histogram, but if you to extract data into bins from a relational database, you must use some function such as genarate_series.

In this example, we want to summarize the distribution of the number of questions with the tag “dropbox” on Stack Overflow, per day, by binning the data.

The generate_series function in SQL has this syntax:

generate_series(from, to, step)

We had this table bellow:

Sample of StackOverflow table with questions asked on Stack Overflow with certain tags

idtagdatequestion_countquestion_pctunanswered_countunanswered_pct
1paypal2018-09-25180500.00109375784020.001751857
2amazon-elb2018-09-2514520.0000885610.000116972
3amazon-mws2018-09-257060.00004282780.000058
Stack Overflow table sample

So we want to summarize the distribution of the number of questions with the tag “dropbox” on Stack Overflow per day by binning the data.

Let’s see the steps to do that:

Create CTE Bins table to aggregate the questions and compare question_count to lower and upper

WITH bins AS (

SELECT generate_series(2200, 3050, 50) AS lower,

generate_series(2250, 3100, 50) AS upper),

— Subset stackoverflow to just tag “dropbox

dropbox AS (

SELECT question_count

FROM stackoverflow

WHERE tag=’dropbox’)

— Select columns for results and

Summarize question_count

SELECT lower, upper, count(question_count)

FROM bins

Join to dropbox we also created above, keeping all rows from the bins table in the join

LEFT JOIN dropbox

— Compare question_count to lower and upper

ON question_count >= lower

AND question_count < upper

— Group by lower and upper to count values in each bin

GROUP BY lower, upper

— Order by lower to put bins in order

ORDER BY lower;

This will be the result:

loweruppercount
220022500
225023000
2300235022
2350240039
2400245054
2450250053
2500255045
2550260041
2600265046
2650270057
2700275044
2750280050
2800285062
2850290061
29002950108
29503000159
3000305098
3050310044
Table resulting from the query above

We can notice there are 159 days that have between 2950 and 3000 questions with the “Dropbox” tag. Understanding the distributions of the variables is one of the super important tasks in an exploratory analysis.

References

Numeric Data Types and Summary Functions. Course EXPLORATORY DATA ANALYSIS IN SQL by Christina Maimone (Data Scientist). DataCamp material.

Leave a Reply

Your email address will not be published. Required fields are marked *