Subroutines
A function is a set of code statements that are executed in sequence to affect some outcome. In the most general sense, a function can be almost anything, e.g. the square root operation \(x \mapsto \sqrt{x}\), an instruction to draw the Mona Lisa with a given color palette, a call to align the residue sequence of homologous proteins across evolutionarily distant organisms, an instruction to update the status on your Facebook profile, a command to update the balance on a customer's savings account, etc.
Immediately we can see a division in the example functions listed above: the first three take an argument (a number, a color palette, amino acid sequences) and return an output (a number, a painting, a sequence alignment), while the last two take an argument (text string, a transaction) and make a change to a variable that lives at some other place in the codebase (a side effect).
Pure Functions
The first kind of function (termed a pure function) takes input and gives an output in a way that's completely self-contained, whereas the second kind takes an input, performs an operation that changes things in the outside world, possibly using other things from the outside world. Given the same inputs, a pure function will return the same answer every single time.
It is easy (relatively) to reason about programs which are built using pure functions, and in fact there exist entire programming languages which were built to embrace this style (Haskell, Lisp, OCaml, F#, etc.). While side effects are necessary for a program that "does something" (i.e. user interfaces), they're more difficult to reason through, and should be avoided when possible.
Pure function
A pure function is a map from a set of inputs to an output \[f : I \to O\]
such that every element in \(I\) has one and only one image in \(O\)
where \(I\) and \(O\) are sets of data, other functions, etc.
For example, we can define a function that takes two numbers (\(x\), \(n\)) and returns \(x\) raised to the power \(n\).
Power
Implemented in OCaml, the power function (for positive reals
x
andn
) is given by
1let power x n = x ** n ;;
This function maps \(f : \mathbb{R}^+ \times \mathbb{R}^+ \to \mathbb{R}^+\) with no side effects.
We might want a function, prime
, that takes a given integer and determines whether or not it is prime. A particularly inefficient algorithm for achieving this is a brute force test for divisibility by all numbers less than or equal to our integer.
Brute force primality in Python
1 2 3 4 5def prime(candidate): for number in range(2, candidate): if candidate % number == 0: return False return True
This function maps \(f : \mathbb{N} \to \{\text{True},\text{False}\}\) with no side effects. If \(x \mod n = 0\) for any number \(n \in [0, N-1]\), our function breaks out of the
for
loop, returnsFalse
, and terminates the function. If it makes it through the list without finding any divisors, the loop finishes without returning anything, and thereturn True
statement is executed.
Side effects
Suppose the global variable
a
is initially set to7
,
1a = 7
and we have the two functions
add_a
, which returns the valuex + a
, andsquare_a
, which reassigns the square ofa
toa
.
1 2 3 4 5 6def add_a(x): return x + a def square_a(): global a a = a * a
If we call
add_a(3)
without callingsquare_a
, we'll get10
, every time we call it. However, if we callsquare_a
first,add_a(3)
will return52
, and if we callsquare_a
twice,add_a(3)
will return2404
. This is a consequence of the fact thatadd_a
depends on the global variablea
that is not an explicit argument ofadd_a
. This makes the function susceptible to changes ina
that are outside the scope ofadd_a
. If we write a program consisting of calls toadd_a
andsquare_a
, the result depends crucially on the sequence in which we called the functions.Thus, in order to reason about the output of
add_a
, we need to know the full history of our environment, specifically, how many timessquare_a
has been called. The flagrant use of global variables can lead to situations where debugging requires a full-blown forensic analysis. We can largely avoid these problems by making our function dependencies explicit, thus isolating the task of debugging to single functions, and promoting a more modular style of programming.
Variables and Scope
In writing a function, it is important to stay clear about what variables are accessible at different levels of code. If we try to reference at the top level, a variable that is defined deeper in the code, or expect a certain change to have been made to our variable, we'll run into problems.
Kronecker Delta
For example, consider the function
kronecker(n)
below which is meant to return the Kronecker tensor \(\delta\) of rank \(n\). (In computing, the Kronecker tensor \(\delta\) of rank \(n\) is simply the identity matrix of order \(n \times n\); that is, if \(i=j\), then \(\delta_{ij} = 1\), otherwise \(\delta_{ij} = 0\).
1 2 3 4 5 6 7 8 9 10 11 12 13 14def kronecker(n): tensor = [] for i in range(n): row = [] for j in range(n): def calculate(i, j): if i == j: var = 1 else: var = 0 calculate(i, j) row.append(var) tensor.append(row) return tensor
If we call
kronecker(3)
, we expect to get
1 2 3[[1, 0, 0], [0, 1, 0], [0, 0, 1]]
However, if we call
kronecker(3)
as implemented above, we find that we have a scope error.What went wrong? In each loop over the
j
index, we calculate a new tensor elementvar
(using the functioncalculate(i, j)
) and append it to our newly formed listrow
. At the end of each loop overi
,row
is appended to the listtensor
, to accumulate our results.The problem is that we expect
var
to be defined at the level of thej
loop after we define it incalculate(i, j)
. However,var
is defined at the deepest level of our code, and so, won't be defined at a higher level. The usual convention is that variables are available to code at the same level at which they're defined, and to all levels below that, but not at levels above their definition (see diagram below).
One way to fix our
kronecker
function would be
1 2 3 4 5 6 7 8 9 10 11 12def kronecker(n): tensor = [] for i in range(n): row = [] for j in range(n): if i == j: var = 1 else: var = 0 row.append(var) tensor.append(row) return tensor
Recursive functions
Functions need not be encapsulations of useful snippets of code, whose sequence of events is explicitly outlined, e.g. "loop over this list of numbers, and print the square of each number." Functions can also be defined as a set of rules such that any given input maps to a single output. In this way, we can carefully define a computation in our code, but leave the explicit computation up to the computer. This method is called recursion, and works by declaring a set of defined base cases in terms of which all other arguments to the function can be computed.
Recursive factorial function
For a much abused, but clear example, let us consider the integer factorial function. One way to define \(n!\) is \(f(n) = n! = n \times (n-1) \times \ldots 3 \times 2 \times 1\), so that for instance,
\[7! = 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1\]
This is a fine definition, but it is tedious to write out, especially as our arguments get bigger. However, there is a clear pattern in the definition of \(n!\). Consider \(7!\) as written above, we see that we can also write this as \(7 \times 6!\). This insight allows us to define the integer factorial function as follows
\[f(n) = \begin{cases} n\times f(n-1) & \text{ if $n > 1$} \\ 1 & \text{ if $n = 1$} \\ \end{cases}\]
To see how this simplifies our code, let us implement both factorials in Python.
Implemented without recursion, we have the following code to calculate \(n!\)
1 2 3 4 5def factorial(n): result = 1 for number in range(1, n + 1): result = result * number return result
with recursion we have
1 2 3 4 5def factorial(n): if n == 1: return 1 else: return n * factorial(n - 1)
Without using recursion, we have to make use of some unsavory programming practices. For one, we have used a sort of global variable, in
result
, whose value gets updated with every run through the loop. We also have to specify a range which opens us up to a potential fencepost (off-by-one) error.In the recursive case, we stipulate only the essentials: the base case, and the recursive definition. Here, any mistake is glaringly obvious because it will come from the mathematical definition, not from a subtle detail of our implementation. Using recursion, we are more likely to end up with clean code.
Here we have seen the benefit of recursion on organization and avoiding mistakes. However, the structure of the factorial calculation is quite simple, just a string of numbers that we multiply together. Let's try to calculate the Fibonacci numbers, a problem with a slightly richer structure.
Fibonacci numbers
As we did with the integer factorial function, let's write a recursive definition for the \(n\)th Fibonacci number:
\[f(n) = \begin{cases} f(n-1) + f(n-2) & \text{ if } n \geq 2 \\ 1 & \text{ if } n \in\{0, 1\} \\ \end{cases}\]
For example, to find the third Fibonacci number, this definition yields
\[\begin{align} f(3) &= f(2) + f(1) \\ &= \left[f(1) + f(0)\right] + f(1) \\ &= 1 + 1 + 1 \\ &= 3 \end{align}\]
For the fourth, we have
\[\begin{align} f(4) &= f(3) + f(2) \\ &= \left[f(2) + f(1)\right] + \left[f(1)+f(0)\right] \\ &= \left[\left[f(1) + f(0)\right] + f(1)\right] + \left[f(1)+f(0)\right] \\ &= 1+1+1+1+1 \\ &= 5 \end{align}\]
We can implement this function in Python as follows
1 2 3 4 5def fib(number): if number in [0, 1]: return 1 else: return fib(number - 2) + fib(number - 1)
This code works fine for computing
fib(n)
whenn
is small, however, it is cripplingly slow asn
grows. From the cases we tried above, we can see that the computation branches out quickly.Let's inspect the diagram for \(f(5)\) below. At the first branch, we have \(f(5)\to f(4)+f(3)\). When \(f(4)\) branches, it leads to computation of \(f(3)\) and \(f(2)\). However, we already needed to compute \(f(3)\) from the first branching. Once we compute \(f(3)\) a single time, it makes little sense to do it again.
As we compute bigger and bigger Fibonacci numbers, we'll find more and more of these cases, where the tree repeats computations that have already been performed (how many numbers must we compute in total?). Unless we cut these branches from the tree, we are going to waste a lot of processor time repeating things we've already done.
The Ackermann function is a computable function which grows very, very quickly as its inputs grow. For example, while \( A(1,2),\) \(A (2,2),\) and \(A(3,2) \) are equal to \(4,7,\) and \(29,\) respectively, \( A(4,2) \approx 2 \times 10^{19728} \).
The Ackermann function can be defined as follows: \[ A(m,n) = \begin{cases} n+1 & \mbox{if } m = 0 \\ A(m-1, 1) & \mbox{if } m > 0 \mbox{ and } n = 0 \\ A(m-1, A(m, n-1)) & \mbox{if } m > 0 \mbox{ and } n > 0. \end{cases} \]
What is the value of \( A(3,6)? \)
Memoization
As we noted, the recursive Fibonacci definition we implemented above leads to a lot of redundant computation. We can save time by remembering the answer to computations after the first time we perform them. This way, fib(n)
will be computed a maximum of one time for each value of n
. This method of halting computation of previously obtained values is known as "memoization" and can often be employed to obtain significant speedups.
Memoization of the recursive Fibonacci function
First let us make a lookup table called
fib_history
, and change our code so thatfib
consults this table before committing to a new computation, i.e. it will only execute a new branch if the parent of that branch has not been computed before.
1 2 3 4 5 6fib_history = {1:1, 0:1}; def fib_memo(n): if n not in fib_history: fib_history[n] = fib_memo(n - 2) + fib_memo(n - 1) return fib_history[n]
Here we see that
fib_memo
asks, politely, whether the value offib_memo(n)
is yet infib_history
, and only if that answer is no, will it enter the recursion. In this way, we only ever evaluatefib_memo
a maximum ofn
times in determiningfib_memo(n)
.Revisiting the tree representation of our computation allows us to visualize the cost savings:
``
fib_history
accumulates knowledge over time, so we can even use the results from previous calls to benefit future calls. For example, if we've calledfib_memo(67)
in the past, we don't need to make 68 calls to findfib_memo(68)
as nearly all the work has already been performed.Below, we plot the performance of
fib
againstfib_memo
for moderate values ofn
. It is clear thatfib
's runtime grows exponentially withn
whereasfib_memo
has essentially constant cost for this range ofn
.
Default values
It is common to find functions tend to have the same value for some of their arguments, each time they're called. Or, you might have a function whose behavior you'd like to be able to modify, if needed, but which in most cases will run without the need for adjustments. In both of these cases, it is convenient to have default values for function arguments.
An optional argument helps true love last
Suppose you're writing an app to automatically answer your text messages when you're busy. Also suppose there exists some method by which an incoming text message can trigger functions, namely one you've called
send_text
. The default return ofsend_text()
is"Sorry, can't talk now, in a meeting."
.
1 2 3def send_text(number): message = "Sorry, can't talk now, in a meeting." sms.message(number, message)
This works fine most of the time, but can get you in trouble with your significant other, who might think you're exploring other options and ignoring them. In this case, it is convenient to call the function with an optional argument
is_honeybear
to indicate that your significant other is the recipient. In this case the callsend_text(is_honeybear = True)
, prompts the app to append the message so that it reads"Sorry, can't talk now, in a meeting. I love you so much!"
.
1 2 3 4 5def send_text(number, is_honeybear=False): message = "Sorry, can't talk now, in a meeting." alt_message = " I love you so much!" to_send = message + (alt_message if is_honeybear else "") sms.message(number, message)
In the first case, where somebody other than the significant other is calling,
send_text
uses the default value foris_honeybear
which isFalse
.
In practice, default values tend to reveal themselves as you begin to use a function. In the design phase, you usually make the function and arguments as simple as possible, but after use in the codebase, it becomes clear that sometimes you need to override certain values.
As a concrete example, suppose you've written a function that reads in a tabular dataset (a spreadsheet with column names), and by default, expects columns to be separated by tabs:
1 2 3 4import csv def read_data(filename): with file(filename) as fp: return csv.reader(fp, delimiter='\t')
If we call
read_data
on a dataset like the filesample_data.txt
1 2 3 4 5 6 7Product Amount UnitPrice wagon 2 30.00 shovel 1 10.00 fertilizer 22 1.50 . . . . . . . . .
It will work fine, because the argument for the separator tells
pandas.read_csv
to expect columns separated by tabs, as we find in the file.However, you start receiving similarly formatted spreadsheets from a new client to analyze, but instead of separating the columns by tabs, they're separated by commas. Because the tasks are so similar we don't want to write a separate function to handle comma-separated and tab-separated data.
We can fix this by giving
read_data
an optional argument for the separator,separator
. For convenience, can set the default value to the most common argument we expect to use in importing datasets, the tab separator\t
.
1 2 3 4import csv def read_data(filename, separator='\t'): with file(filename) as fp: return csv.reader(fp, delimiter=separator)
Thus, whenever we call
read_data
on tab separated data, we can just provide the filename. When we need to specify the separator to be something different, commas for instance, we can callread_data('sample_data.txt', separator=',')
.