Python is a programming language that lets you work quickly and integrate systems more effectively, and PostgreSQL is the world's most advanced open source database. Those two work very well together. This article describes how to make the most of PostgreSQL (psql) when solving a simple problem. As tempting as it is to throw Python code at a problem, it's not always the best choice.
SQL comes with quite a lot of processing power, and integrating SQL into your workflow often means writing fewer lines of code. As Edsger Dijkstra said, lines of code are lines spent:
The practice is pervaded by the reassuring illusion that programs are just devices like any others, the only difference admitted being that their manufacture might require a new type of craftsmen, viz. programmers. From there it is only a small step to measuring "programmer productivity" in terms of "number of lines of code produced per month." This is a very costly measuring unit because it encourages the writing of insipid code, but today I am less interested in how foolish a unit it is from even a pure business point of view. My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.
— Dijkstra, "On the cruelty of really teaching computing science."
By using SQL, you write less code, so you can write applications in less time.
A simple use case
To examine how PostgreSQL and Python work well together, we'll use the New York Stock Exchange (NYSE) "Daily NYSE Group Volume in NYSE Listed" dataset. To download the data, go to the Facts & Figures Interactive Viewer, click on Market Activity, then click on Daily NYSE Group Volume in NYSE Listed. Then click on the "Excel" symbol at the top (which is actually a CSV file that uses Tab as a separator) to save the "factbook.xls" file to your computer, open it and remove the headings, then load it into a PostgreSQL table.
Loading the data set
Here's what the data looks like. It includes comma-separated thousands and dollar signs, so we can't readily process the figures as numbers.
2010 1/4/2010 1,425,504,460 4,628,115 $38,495,460,645
2010 1/5/2010 1,754,011,750 5,394,016 $43,932,043,406
2010 1/6/2010 1,655,507,953 5,494,460 $43,816,749,660
2010 1/7/2010 1,797,810,789 5,674,297 $44,104,237,184
To change this, we can create an ad-hoc table definition, and once the data is loaded, it's transformed into a proper SQL data type, thanks to alter table
commands.
begin;
create table factbook
(
year int,
date date,
shares text,
trades text,
dollars text
);
\copy factbook from 'factbook.csv' with delimiter E'\t' null ''
alter table factbook
alter shares
type bigint
using replace(shares, ',', '')::bigint,
alter trades
type bigint
using replace(trades, ',', '')::bigint,
alter dollars
type bigint
using substring(replace(dollars, ',', '') from 2)::numeric;
commit;
We can use PostgreSQL's copy
functionality to stream the data from the CSV file into our table. The \copy
variant is a psql-specific command and initiates client/server streaming of the data, reading a local file, and sending its contents through any established PostgreSQL connection.
Application code and SQL
There is a lot of data in this file, so let's use the data from February 2017 for this example. The following query lists all entries in the month of February 2017:
\set start '2017-02-01'
select date,
to_char(shares, '99G999G999G999') as shares,
to_char(trades, '99G999G999') as trades,
to_char(dollars, 'L99G999G999G999') as dollars
from factbook
where date >= date :'start'
and date < date :'start' + interval '1 month'
order by date;
We use the psql application to run this query, and psql supports the use of variables. The \set
command sets the '2017-02-01'
value to the variable start
, and then we can reuse the variable with the expression :'start'
.
Writing date :'start'
is equivalent to date '2017-02-01'
—this is called a decorated literal expression in PostgreSQL. This allows us to set the data type of the literal value so that the PostgreSQL query parser won't have to guess or infer it from the context.
This SQL query also uses the interval
data type to compute the end of the month, which is, of course, the last day of February in our example. Adding an interval
value of 1 month
to the first day of the month gives us the first day of the next month, so we use the "less than" (<
) strict operator to exclude this day from our result set.
The to_char()
function (documented in the PostgreSQL docs section on Data Type Formatting Functions) will convert a number to its text representation with detailed control over the conversion. The format is composed of template patterns. We'll use the following patterns:
- Value with the specified number of digits
L
: currency symbol (uses locale)G
: group separator (uses locale)
Other template patterns for numeric formatting are available—see the PostgreSQL docs for reference.
Here's the result of our query:
date │ shares │ trades │ dollars
════════════╪═════════════════╪═════════════╪══════════════════
2017-02-01 │ 1,161,001,502 │ 5,217,859 │ $ 44,660,060,305
2017-02-02 │ 1,128,144,760 │ 4,586,343 │ $ 43,276,102,903
2017-02-03 │ 1,084,735,476 │ 4,396,485 │ $ 42,801,562,275
2017-02-06 │ 954,533,086 │ 3,817,270 │ $ 37,300,908,120
2017-02-07 │ 1,037,660,897 │ 4,220,252 │ $ 39,754,062,721
2017-02-08 │ 1,100,076,176 │ 4,410,966 │ $ 40,491,648,732
2017-02-09 │ 1,081,638,761 │ 4,462,009 │ $ 40,169,585,511
2017-02-10 │ 1,021,379,481 │ 4,028,745 │ $ 38,347,515,768
2017-02-13 │ 1,020,482,007 │ 3,963,509 │ $ 38,745,317,913
2017-02-14 │ 1,041,009,698 │ 4,299,974 │ $ 40,737,106,101
2017-02-15 │ 1,120,119,333 │ 4,424,251 │ $ 43,802,653,477
2017-02-16 │ 1,091,339,672 │ 4,461,548 │ $ 41,956,691,405
2017-02-17 │ 1,160,693,221 │ 4,132,233 │ $ 48,862,504,551
2017-02-21 │ 1,103,777,644 │ 4,323,282 │ $ 44,416,927,777
2017-02-22 │ 1,064,236,648 │ 4,169,982 │ $ 41,137,731,714
2017-02-23 │ 1,192,772,644 │ 4,839,887 │ $ 44,254,446,593
2017-02-24 │ 1,187,320,171 │ 4,656,770 │ $ 45,229,398,830
2017-02-27 │ 1,132,693,382 │ 4,243,911 │ $ 43,613,734,358
2017-02-28 │ 1,455,597,403 │ 4,789,769 │ $ 57,874,495,227
(19 rows)
The dataset has data for only 19 days in February 2017 (the days the NYSE was open). What if we want to display an entry for each calendar day and fill in the missing dates with either matching data or a zero figure?
Here's a typical Python implementation of that:
#! /usr/bin/env python3
import sys
import psycopg2
import psycopg2.extras
from calendar import Calendar
CONNSTRING = "dbname=yesql application_name=factbook"
def fetch_month_data(year, month):
"Fetch a month of data from the database"
date = "%d-%02d-01" % (year, month)
sql = """
select date, shares, trades, dollars
from factbook
where date >= date %s
and date < date %s + interval '1 month'
order by date;
"""
pgconn = psycopg2.connect(CONNSTRING)
curs = pgconn.cursor()
curs.execute(sql, (date, date))
res = {}
for (date, shares, trades, dollars) in curs.fetchall():
res[date] = (shares, trades, dollars)
return res
def list_book_for_month(year, month):
"""List all days for given month, and for each
day list fact book entry.
"""
data = fetch_month_data(year, month)
cal = Calendar()
print("%12s | %12s | %12s | %12s" %
("day", "shares", "trades", "dollars"))
print("%12s-+-%12s-+-%12s-+-%12s" %
("-" * 12, "-" * 12, "-" * 12, "-" * 12))
for day in cal.itermonthdates(year, month):
if day.month != month:
continue
if day in data:
shares, trades, dollars = data[day]
else:
shares, trades, dollars = 0, 0, 0
print("%12s | %12s | %12s | %12s" %
(day, shares, trades, dollars))
if __name__ == '__main__':
year = int(sys.argv[1])
month = int(sys.argv[2])
list_book_for_month(year, month)
In this implementation, we use the above SQL query to fetch our result set and, moreover, to store it in a dictionary. The dict's key is the day of the month, so we can then loop over a calendar's list of days, retrieve matching data when we have it, and install a default result set (e.g., zeroes) when we don't have any data.
Below is the output when running the program. As you can see, we opted for an output similar to the psql output, which makes it easier to compare the effort needed to reach the same result.
$ ./factbook-month.py 2017 2
day | shares | trades | dollars
-------------+--------------+--------------+-------------
2017-02-01 | 1161001502 | 5217859 | 44660060305
2017-02-02 | 1128144760 | 4586343 | 43276102903
2017-02-03 | 1084735476 | 4396485 | 42801562275
2017-02-04 | 0 | 0 | 0
2017-02-05 | 0 | 0 | 0
2017-02-06 | 954533086 | 3817270 | 37300908120
2017-02-07 | 1037660897 | 4220252 | 39754062721
2017-02-08 | 1100076176 | 4410966 | 40491648732
2017-02-09 | 1081638761 | 4462009 | 40169585511
2017-02-10 | 1021379481 | 4028745 | 38347515768
2017-02-11 | 0 | 0 | 0
2017-02-12 | 0 | 0 | 0
2017-02-13 | 1020482007 | 3963509 | 38745317913
2017-02-14 | 1041009698 | 4299974 | 40737106101
2017-02-15 | 1120119333 | 4424251 | 43802653477
2017-02-16 | 1091339672 | 4461548 | 41956691405
2017-02-17 | 1160693221 | 4132233 | 48862504551
2017-02-18 | 0 | 0 | 0
2017-02-19 | 0 | 0 | 0
2017-02-20 | 0 | 0 | 0
2017-02-21 | 1103777644 | 4323282 | 44416927777
2017-02-22 | 1064236648 | 4169982 | 41137731714
2017-02-23 | 1192772644 | 4839887 | 44254446593
2017-02-24 | 1187320171 | 4656770 | 45229398830
2017-02-25 | 0 | 0 | 0
2017-02-26 | 0 | 0 | 0
2017-02-27 | 1132693382 | 4243911 | 43613734358
2017-02-28 | 1455597403 | 4789769 | 57874495227
PostgreSQL advanced functions
The same thing can be accomplished with a single SQL query, without any application code "spent" on solving the problem:
select cast(calendar.entry as date) as date,
coalesce(shares, 0) as shares,
coalesce(trades, 0) as trades,
to_char(
coalesce(dollars, 0),
'L99G999G999G999'
) as dollars
from /*
* Generate the target month's calendar then LEFT JOIN
* each day against the factbook dataset, so as to have
* every day in the result set, whether or not we have a
* book entry for the day.
*/
generate_series(date :'start',
date :'start' + interval '1 month'
- interval '1 day',
interval '1 day'
)
as calendar(entry)
left join factbook
on factbook.date = calendar.entry
order by date;
In this query, we use several basic SQL and PostgreSQL techniques that might be new to you:
- SQL accepts comments written either in the
-- comment
style, running from the opening to the end of the line, or C-style with a/* comment */
style. As with any programming language, comments are best used to note intentions, which otherwise might be tricky to reverse engineer from the code alone. generate_series()
is a PostgreSQL set returning function, for which the documentation reads: "Generate a series of values, from start to stop with a step size of step." As PostgreSQL knows its calendar, it's easy to generate all days from any given month with the first day of the month as a single parameter in the query.generate_series()
is inclusive, much like theBETWEEN
operator, so we exclude the first day of the next month with the expression- interval '1 day'
.- The
cast(calendar.entry as date)
expression transforms the generatedcalendar.entry
, which is the result of thegenerate_series()
function call into thedate
data type. We need to usecast
because thegenerate_series()
function returns a set of timestamp entries, which is not relevant to us in this exercise. - The
left join
between our generatedcalendar
table and thefactbook
table will keep everycalendar
row and associate afactbook
row with it only when thedate
columns of both the tables have the same value. When thecalendar.date
is not found infactbook
, thefactbook
columns (year
,date
,shares
,trades
, anddollars
) are filled in withNULL
values instead. - Coalesce returns the first of its arguments that is not null. So the expression
coalesce(shares, 0) as shares
is either how many shares we found in thefactbook
table for thiscalendar.date
row, or 0 when we found no entry for thecalendar.date
. In addition, theleft join
kept our result set row and filled in thefactbook
columns withNULL
values.
Finally, here's the result of this query:
date │ shares │ trades │ dollars
════════════╪════════════╪═════════╪══════════════════
2017-02-01 │ 1161001502 │ 5217859 │ $ 44,660,060,305
2017-02-02 │ 1128144760 │ 4586343 │ $ 43,276,102,903
2017-02-03 │ 1084735476 │ 4396485 │ $ 42,801,562,275
2017-02-04 │ 0 │ 0 │ $ 0
2017-02-05 │ 0 │ 0 │ $ 0
2017-02-06 │ 954533086 │ 3817270 │ $ 37,300,908,120
2017-02-07 │ 1037660897 │ 4220252 │ $ 39,754,062,721
2017-02-08 │ 1100076176 │ 4410966 │ $ 40,491,648,732
2017-02-09 │ 1081638761 │ 4462009 │ $ 40,169,585,511
2017-02-10 │ 1021379481 │ 4028745 │ $ 38,347,515,768
2017-02-11 │ 0 │ 0 │ $ 0
2017-02-12 │ 0 │ 0 │ $ 0
2017-02-13 │ 1020482007 │ 3963509 │ $ 38,745,317,913
2017-02-14 │ 1041009698 │ 4299974 │ $ 40,737,106,101
2017-02-15 │ 1120119333 │ 4424251 │ $ 43,802,653,477
2017-02-16 │ 1091339672 │ 4461548 │ $ 41,956,691,405
2017-02-17 │ 1160693221 │ 4132233 │ $ 48,862,504,551
2017-02-18 │ 0 │ 0 │ $ 0
2017-02-19 │ 0 │ 0 │ $ 0
2017-02-20 │ 0 │ 0 │ $ 0
2017-02-21 │ 1103777644 │ 4323282 │ $ 44,416,927,777
2017-02-22 │ 1064236648 │ 4169982 │ $ 41,137,731,714
2017-02-23 │ 1192772644 │ 4839887 │ $ 44,254,446,593
2017-02-24 │ 1187320171 │ 4656770 │ $ 45,229,398,830
2017-02-25 │ 0 │ 0 │ $ 0
2017-02-26 │ 0 │ 0 │ $ 0
2017-02-27 │ 1132693382 │ 4243911 │ $ 43,613,734,358
2017-02-28 │ 1455597403 │ 4789769 │ $ 57,874,495,227
(28 rows)
Note that we replaced 60 lines of Python code with a simple SQL query. Down the road that means less code to maintain and a more efficient implementation, too. Here, the Python is doing a Hash Join Nested Loop
while PostgreSQL picks a Merge Left Join
over two ordered relations.
Computing weekly changes
Imagine the analytics department now wants us to provide the weekly difference for each day. This means we need to add a column with the change calculated as a percentage of the dollars
column between each date and the same day of the previous week.
I'm using the "week-over-week percentage difference" example because it's both a classic analytics need (although maybe mostly in marketing circles), and because (in my experience) a developer's first reaction is rarely to write a SQL query to do all the math.
Also, the calendar isn't very helpful in computing weeks, but for PostgreSQL, this task is as easy as spelling the word week
:
with computed_data as
(
select cast(date as date) as date,
to_char(date, 'Dy') as day,
coalesce(dollars, 0) as dollars,
lag(dollars, 1)
over(
partition by extract('isodow' from date)
order by date
)
as last_week_dollars
from /*
* Generate the month calendar, plus a week before
* so that we have values to compare dollars against
* even for the first week of the month.
*/
generate_series(date :'start' - interval '1 week',
date :'start' + interval '1 month'
- interval '1 day',
interval '1 day'
)
as calendar(date)
left join factbook using(date)
)
select date, day,
to_char(
coalesce(dollars, 0),
'L99G999G999G999'
) as dollars,
case when dollars is not null
and dollars <> 0
then round( 100.0
* (dollars - last_week_dollars)
/ dollars
, 2)
end
as "WoW %"
from computed_data
where date >= date :'start'
order by date;
To implement this case in SQL, we need window functions that appeared in the SQL standard in 1992, but are often skipped in SQL classes. The last things executed in a SQL statement are windows
functions, well after join
operations and where
clauses. So, if we want to see a full week before the first of February, we need to extend our calendar selection a week into the past and then, once again, restrict the data we issue to the caller.
That's why we use a common table expression—the WITH
part of the query—to fetch the extended data set we need, including the last_week_dollars
computed column.
The expression extract('isodow' from date)
is a standard SQL feature that allows computing the day of the week following ISO rules. Used as a partition by
frame clause, it allows a row to be a peer to any other row having the same isodow
. The lag()
window function can then refer to the previous peer dollars value when ordered by date; that's the number that we want to compare to the current dollars value.
The computed_data result set is then used in the main part of the query as a relation we get data from, and the computation is easier this time, as we simply apply a classic difference percentage formula to the dollars
and the last_week_dollars
columns.
Here's the result from running this query:
date │ day │ dollars │ WoW %
════════════╪═════╪══════════════════╪════════
2017-02-01 │ Wed │ $ 44,660,060,305 │ -2.21
2017-02-02 │ Thu │ $ 43,276,102,903 │ 1.71
2017-02-03 │ Fri │ $ 42,801,562,275 │ 10.86
2017-02-04 │ Sat │ $ 0 │ ¤
2017-02-05 │ Sun │ $ 0 │ ¤
2017-02-06 │ Mon │ $ 37,300,908,120 │ -9.64
2017-02-07 │ Tue │ $ 39,754,062,721 │ -37.41
2017-02-08 │ Wed │ $ 40,491,648,732 │ -10.29
2017-02-09 │ Thu │ $ 40,169,585,511 │ -7.73
2017-02-10 │ Fri │ $ 38,347,515,768 │ -11.61
2017-02-11 │ Sat │ $ 0 │ ¤
2017-02-12 │ Sun │ $ 0 │ ¤
2017-02-13 │ Mon │ $ 38,745,317,913 │ 3.73
2017-02-14 │ Tue │ $ 40,737,106,101 │ 2.41
2017-02-15 │ Wed │ $ 43,802,653,477 │ 7.56
2017-02-16 │ Thu │ $ 41,956,691,405 │ 4.26
2017-02-17 │ Fri │ $ 48,862,504,551 │ 21.52
2017-02-18 │ Sat │ $ 0 │ ¤
2017-02-19 │ Sun │ $ 0 │ ¤
2017-02-20 │ Mon │ $ 0 │ ¤
2017-02-21 │ Tue │ $ 44,416,927,777 │ 8.28
2017-02-22 │ Wed │ $ 41,137,731,714 │ -6.48
2017-02-23 │ Thu │ $ 44,254,446,593 │ 5.19
2017-02-24 │ Fri │ $ 45,229,398,830 │ -8.03
2017-02-25 │ Sat │ $ 0 │ ¤
2017-02-26 │ Sun │ $ 0 │ ¤
2017-02-27 │ Mon │ $ 43,613,734,358 │ ¤
2017-02-28 │ Tue │ $ 57,874,495,227 │ 23.25
(28 rows)
Have fun writing code, and as SQL is code, have fun writing SQL!
This article is based on an excerpt from Dimitri Fontaine's book Mastering PostgreSQL in Application Development, which explains how to replace thousands of lines of code with simple queries. The book goes into greater detail on these topics and provides many other examples so you can master PostgreSQL and issue the SQL queries that fetch exactly the result set you need.
5 Comments