Subroutines

A function is a set of code statements that are executed in sequence to affect some outcome. In the most general sense, a function can be almost anything, e.g. the square root operation $x \mapsto \sqrt{x}$, an instruction to draw the Mona Lisa with a given color palette, a call to align the residue sequence of homologous proteins across evolutionarily distant organisms, an instruction to update the status on your Facebook profile, a command to update the balance on a customer's savings account, etc.

wikipedia

Immediately we can see a division in the example functions listed above: the first three take an argument (a number, a color palette, amino acid sequences) and return an output (a number, a painting, a sequence alignment), while the last two take an argument (text string, a transaction) and make a change to a variable that lives at some other place in the codebase (a side effect).

Pure Functions

The first kind of function (termed a pure function) takes input and gives an output in a way that's completely self-contained, whereas the second kind takes an input, performs an operation that changes things in the outside world, possibly using other things from the outside world. Given the same inputs, a pure function will return the same answer every single time.

It is easy (relatively) to reason about programs which are built using pure functions, and in fact there exist entire programming languages which were built to embrace this style (Haskell, Lisp, OCaml, F#, etc.). While side effects are necessary for a program that "does something" (i.e. user interfaces), they're more difficult to reason through, and should be avoided when possible.

Pure function

A pure function is a map from a set of inputs to an output \[f : I \to O\]

such that every element in $I$ has one and only one image in $O$

where $I$ and $O$ are sets of data, other functions, etc.

For example, we can define a function that takes two numbers ($x$, $n$) and returns $x$ raised to the power $n$.

Power

Implemented in OCaml, the power function (for positive reals x and n) is given by
1
let power x n = x ** n ;;
This function maps $f : \mathbb{R}^+ \times \mathbb{R}^+ \to \mathbb{R}^+$ with no side effects.

We might want a function, prime, that takes a given integer and determines whether or not it is prime. A particularly inefficient algorithm for achieving this is a brute force test for divisibility by all numbers less than or equal to our integer.

Brute force primality in Python
1
2
3
4
5
def prime(candidate):
    for number in range(2, candidate):
        if candidate % number == 0:
            return False
    return True
This function maps $f : \mathbb{N} \to \{\text{True},\text{False}\}$ with no side effects. If $x \mod n = 0$ for any number $n \in [0, N-1]$, our function breaks out of the for loop, returns False, and terminates the function. If it makes it through the list without finding any divisors, the loop finishes without returning anything, and the return True statement is executed.

Side effects

Suppose the global variable a is initially set to 7,
1
a = 7
and we have the two functions add_a, which returns the value x + a, and square_a, which reassigns the square of a to a.
1
2
3
4
5
6
def add_a(x):
    return x + a

def square_a():
    global a
    a = a * a
If we call add_a(3) without calling square_a, we'll get 10, every time we call it. However, if we call square_a first, add_a(3) will return 52, and if we call square_a twice, add_a(3) will return 2404. This is a consequence of the fact that add_a depends on the global variable a that is not an explicit argument of add_a. This makes the function susceptible to changes in a that are outside the scope of add_a. If we write a program consisting of calls to add_a and square_a, the result depends crucially on the sequence in which we called the functions.

Thus, in order to reason about the output of add_a, we need to know the full history of our environment, specifically, how many times square_a has been called. The flagrant use of global variables can lead to situations where debugging requires a full-blown forensic analysis. We can largely avoid these problems by making our function dependencies explicit, thus isolating the task of debugging to single functions, and promoting a more modular style of programming.

Variables and Scope

In writing a function, it is important to stay clear about what variables are accessible at different levels of code. If we try to reference at the top level, a variable that is defined deeper in the code, or expect a certain change to have been made to our variable, we'll run into problems.

Kronecker Delta

For example, consider the function kronecker(n) below which is meant to return the Kronecker tensor $\delta$ of rank $n$. (In computing, the Kronecker tensor $\delta$ of rank $n$ is simply the identity matrix of order $n \times n$; that is, if $i=j$, then $\delta_{ij} = 1$, otherwise $\delta_{ij} = 0$.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def kronecker(n):
    tensor = []
    for i in range(n):
        row = []
        for j in range(n):
            def calculate(i, j):
                if i == j:
                    var = 1
                else:
                    var = 0
            calculate(i, j)
            row.append(var)
    tensor.append(row)
    return tensor
If we call kronecker(3), we expect to get
1
2
3
[[1, 0, 0],
 [0, 1, 0],
 [0, 0, 1]]
However, if we call kronecker(3) as implemented above, we find that we have a scope error.

What went wrong? In each loop over the j index, we calculate a new tensor element var (using the function calculate(i, j)) and append it to our newly formed list row. At the end of each loop over i, row is appended to the list tensor, to accumulate our results.

The problem is that we expect var to be defined at the level of the j loop after we define it in calculate(i, j). However, var is defined at the deepest level of our code, and so, won't be defined at a higher level. The usual convention is that variables are available to code at the same level at which they're defined, and to all levels below that, but not at levels above their definition (see diagram below).

One way to fix our kronecker function would be
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def kronecker(n):
    tensor = []
    for i in range(n):
        row = []
        for j in range(n):
            if i == j:
                var = 1
            else:
                var = 0
            row.append(var)
        tensor.append(row)
    return tensor

Recursive functions

Functions need not be encapsulations of useful snippets of code, whose sequence of events is explicitly outlined, e.g. "loop over this list of numbers, and print the square of each number." Functions can also be defined as a set of rules such that any given input maps to a single output. In this way, we can carefully define a computation in our code, but leave the explicit computation up to the computer. This method is called recursion, and works by declaring a set of defined base cases in terms of which all other arguments to the function can be computed.

Recursive factorial function

For a much abused, but clear example, let us consider the integer factorial function. One way to define $n!$ is $f(n) = n! = n \times (n-1) \times \ldots 3 \times 2 \times 1$, so that for instance,

\[7! = 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1\]

This is a fine definition, but it is tedious to write out, especially as our arguments get bigger. However, there is a clear pattern in the definition of $n!$. Consider $7!$ as written above, we see that we can also write this as $7 \times 6!$. This insight allows us to define the integer factorial function as follows

\[f(n) = \begin{cases} n\times f(n-1) & \text{ if $n > 1$} \\ 1 & \text{ if $n = 1$} \\ \end{cases}\]

To see how this simplifies our code, let us implement both factorials in Python.

Implemented without recursion, we have the following code to calculate $n!$
1
2
3
4
5
def factorial(n):
    result = 1
    for number in range(1, n + 1):
        result = result * number
    return result
with recursion we have
1
2
3
4
5
def factorial(n):
    if n == 1:
        return 1
    else:
        return n * factorial(n - 1)
Without using recursion, we have to make use of some unsavory programming practices. For one, we have used a sort of global variable, in result, whose value gets updated with every run through the loop. We also have to specify a range which opens us up to a potential fencepost (off-by-one) error.

In the recursive case, we stipulate only the essentials: the base case, and the recursive definition. Here, any mistake is glaringly obvious because it will come from the mathematical definition, not from a subtle detail of our implementation. Using recursion, we are more likely to end up with clean code.

You have in your wallet $300, which you want to spend completely. You decide to spend all of it by buying food from a fancy restaurant with the following menu:

    tofu scramble: $1
    pancakes: $5
    brunch combo: $20
    saffron infused peach tea: $50
    truffles: $100
    caviar: $200

Let $O$ be the number of different ways that you can spend exactly $300. What are the last 3 digits of $O$?

Here we have seen the benefit of recursion on organization and avoiding mistakes. However, the structure of the factorial calculation is quite simple, just a string of numbers that we multiply together. Let's try to calculate the Fibonacci numbers, a problem with a slightly richer structure.

Fibonacci numbers

As we did with the integer factorial function, let's write a recursive definition for the $n$th Fibonacci number:

\[f(n) = \begin{cases} f(n-1) + f(n-2) & \text{ if } n \geq 2 \\ 1 & \text{ if } n \in\{0, 1\} \\ \end{cases}\]

For example, to find the third Fibonacci number, this definition yields

\[\begin{align} f(3) &= f(2) + f(1) \\ &= \left[f(1) + f(0)\right] + f(1) \\ &= 1 + 1 + 1 \\ &= 3 \end{align}\]

For the fourth, we have

\[\begin{align} f(4) &= f(3) + f(2) \\ &= \left[f(2) + f(1)\right] + \left[f(1)+f(0)\right] \\ &= \left[\left[f(1) + f(0)\right] + f(1)\right] + \left[f(1)+f(0)\right] \\ &= 1+1+1+1+1 \\ &= 5 \end{align}\]

We can implement this function in Python as follows
1
2
3
4
5
def fib(number):
    if number in [0, 1]:
        return 1
    else:
        return fib(number - 2) + fib(number - 1)
This code works fine for computing fib(n) when n is small, however, it is cripplingly slow as n grows. From the cases we tried above, we can see that the computation branches out quickly.

Let's inspect the diagram for $f(5)$ below. At the first branch, we have $f(5)\to f(4)+f(3)$. When $f(4)$ branches, it leads to computation of $f(3)$ and $f(2)$. However, we already needed to compute $f(3)$ from the first branching. Once we compute $f(3)$ a single time, it makes little sense to do it again.

As we compute bigger and bigger Fibonacci numbers, we'll find more and more of these cases, where the tree repeats computations that have already been performed (how many numbers must we compute in total?). Unless we cut these branches from the tree, we are going to waste a lot of processor time repeating things we've already done.

Memoization

As we noted, the recursive Fibonacci definition we implemented above leads to a lot of redundant computation. We can save time by remembering the answer to computations after the first time we perform them. This way, fib(n) will be computed a maximum of one time for each value of n. This method of halting computation of previously obtained values is known as "memoization" and can often be employed to obtain significant speedups.

Memoization of the recursive Fibonacci function

First let us make a lookup table called fib_history, and change our code so that fib consults this table before committing to a new computation, i.e. it will only execute a new branch if the parent of that branch has not been computed before.
1
2
3
4
5
6
fib_history = {1:1, 0:1};

def fib_memo(n):
    if n not in fib_history:
        fib_history[n] = fib_memo(n - 2) + fib_memo(n - 1)
    return fib_history[n]
Here we see that fib_memo asks, politely, whether the value of fib_memo(n) is yet in fib_history, and only if that answer is no, will it enter the recursion. In this way, we only ever evaluate fib_memo a maximum of n times in determining fib_memo(n).

Revisiting the tree representation of our computation allows us to visualize the cost savings:

``fib_history accumulates knowledge over time, so we can even use the results from previous calls to benefit future calls. For example, if we've called fib_memo(67) in the past, we don't need to make 68 calls to find fib_memo(68) as nearly all the work has already been performed.

Below, we plot the performance of fib against fib_memo for moderate values of n. It is clear that fib's runtime grows exponentially with n whereas fib_memo has essentially constant cost for this range of n.

Default values

It is common to find functions tend to have the same value for some of their arguments, each time they're called. Or, you might have a function whose behavior you'd like to be able to modify, if needed, but which in most cases will run without the need for adjustments. In both of these cases, it is convenient to have default values for function arguments.

An optional argument helps true love last

Suppose you're writing an app to automatically answer your text messages when you're busy. Also suppose there exists some method by which an incoming text message can trigger functions, namely one you've called send_text. The default return of send_text() is "Sorry, can't talk now, in a meeting.".
1
2
3
def send_text(number):
    message = "Sorry, can't talk now, in a meeting."
    sms.message(number, message)
This works fine most of the time, but can get you in trouble with your significant other, who might think you're exploring other options and ignoring them. In this case, it is convenient to call the function with an optional argument is_honeybear to indicate that your significant other is the recipient. In this case the call send_text(is_honeybear = True), prompts the app to append the message so that it reads "Sorry, can't talk now, in a meeting. I love you so much!".
1
2
3
4
5
def send_text(number, is_honeybear=False):
    message = "Sorry, can't talk now, in a meeting."
    alt_message = " I love you so much!"
    to_send = message + (alt_message if is_honeybear else "")
    sms.message(number, message)
In the first case, where somebody other than the significant other is calling, send_text uses the default value for is_honeybear which is False.

In practice, default values tend to reveal themselves as you begin to use a function. In the design phase, you usually make the function and arguments as simple as possible, but after use in the codebase, it becomes clear that sometimes you need to override certain values.

As a concrete example, suppose you've written a function that reads in a tabular dataset (a spreadsheet with column names), and by default, expects columns to be separated by tabs:
1
2
3
4
import csv
def read_data(filename):
    with file(filename) as fp:
        return csv.reader(fp, delimiter='\t')
If we call read_data on a dataset like the file sample_data.txt
1
2
3
4
5
6
7
Product     Amount  UnitPrice
wagon       2       30.00
shovel      1       10.00
fertilizer  22      1.50
.           .       .
.           .       .
.           .       .
It will work fine, because the argument for the separator tells pandas.read_csv to expect columns separated by tabs, as we find in the file.

However, you start receiving similarly formatted spreadsheets from a new client to analyze, but instead of separating the columns by tabs, they're separated by commas. Because the tasks are so similar we don't want to write a separate function to handle comma-separated and tab-separated data.

We can fix this by giving read_data an optional argument for the separator, separator. For convenience, can set the default value to the most common argument we expect to use in importing datasets, the tab separator \t.
1
2
3
4
import csv
def read_data(filename, separator='\t'):
    with file(filename) as fp:
        return csv.reader(fp, delimiter=separator)
Thus, whenever we call read_data on tab separated data, we can just provide the filename. When we need to specify the separator to be something different, commas for instance, we can call read_data('sample_data.txt', separator=',').

Contents