# Subroutines

A function is a set of code statements that are executed in sequence to affect some outcome. In the most general sense, a function can be almost anything, e.g. the square root operation $x \mapsto \sqrt{x}$, an instruction to draw the Mona Lisa with a given color palette, a call to align the residue sequence of homologous proteins across evolutionarily distant organisms, an instruction to update the status on your Facebook profile, a command to update the balance on a customer's savings account, etc.

Immediately we can see a division in the example functions listed above: the first three take an argument (a number, a color palette, amino acid sequences) and return an output (a number, a painting, a sequence alignment), while the last two take an argument (text string, a transaction) and make a change to a variable that lives at some other place in the codebase (a side effect).

## Pure Functions

The first kind of function (termed a pure function) takes input and gives an output in a way that's completely self-contained, whereas the second kind takes an input, performs an operation that changes things in the outside world, possibly using other things from the outside world. Given the same inputs, a pure function will return the same answer every single time.

It is easy (relatively) to reason about programs which are built using pure functions, and in fact there exist entire programming languages which were built to embrace this style (Haskell, Lisp, OCaml, F#, etc.). While side effects are necessary for a program that "does something" (i.e. user interfaces), they're more difficult to reason through, and should be avoided when possible.

Pure function

A pure function is a map from a set of inputs to an output $f : I \to O$

such that every element in $I$ has one and only one image in $O$

where $I$ and $O$ are sets of data, other functions, etc.

For example, we can define a function that takes two numbers ($x$, $n$) and returns $x$ raised to the power $n$.

Power

Implemented in OCaml, the power function (for positive reals

`x`

and`n`

) is given by

1`let power x n = x ** n ;;`

This function maps $f : \mathbb{R}^+ \times \mathbb{R}^+ \to \mathbb{R}^+$ with no side effects.

We might want a function, `prime`

, that takes a given integer and determines whether or not it is prime. A particularly inefficient algorithm for achieving this is a brute force test for divisibility by all numbers less than or equal to our integer.

Brute force primality in Python

1 2 3 4 5`def prime(candidate): for number in range(2, candidate): if candidate % number == 0: return False return True`

This function maps $f : \mathbb{N} \to \{\text{True},\text{False}\}$ with no side effects. If $x \mod n = 0$ for any number $n \in [0, N-1]$, our function breaks out of the

`for`

loop, returns`False`

, and terminates the function. If it makes it through the list without finding any divisors, the loop finishes without returning anything, and the`return True`

statement is executed.

Side effectsSuppose the global variable

`a`

is initially set to`7`

,

1`a = 7`

and we have the two functions

`add_a`

, which returns the value`x + a`

, and`square_a`

, which reassigns the square of`a`

to`a`

.

1 2 3 4 5 6`def add_a(x): return x + a def square_a(): global a a = a * a`

If we call

`add_a(3)`

without calling`square_a`

, we'll get`10`

, every time we call it. However, if we call`square_a`

first,`add_a(3)`

will return`52`

, and if we call`square_a`

twice,`add_a(3)`

will return`2404`

. This is a consequence of the fact that`add_a`

depends on the global variable`a`

that is not an explicit argument of`add_a`

. This makes the function susceptible to changes in`a`

that are outside the scope of`add_a`

. If we write a program consisting of calls to`add_a`

and`square_a`

, the result depends crucially on the sequence in which we called the functions.Thus, in order to reason about the output of

`add_a`

, we need to know the full history of our environment, specifically, how many times`square_a`

has been called. The flagrant use of global variables can lead to situations where debugging requires a full-blown forensic analysis. We can largely avoid these problems by making our function dependencies explicit, thus isolating the task of debugging to single functions, and promoting a more modular style of programming.

## Variables and Scope

In writing a function, it is important to stay clear about what variables are accessible at different levels of code. If we try to reference at the top level, a variable that is defined deeper in the code, or expect a certain change to have been made to our variable, we'll run into problems.

Kronecker Delta

For example, consider the function

`kronecker(n)`

below which is meant to return the Kronecker tensor $\delta$ of rank $n$. (In computing, the Kronecker tensor $\delta$ of rank $n$ is simply the identity matrix of order $n \times n$; that is, if $i=j$, then $\delta_{ij} = 1$, otherwise $\delta_{ij} = 0$.

1 2 3 4 5 6 7 8 9 10 11 12 13 14`def kronecker(n): tensor = [] for i in range(n): row = [] for j in range(n): def calculate(i, j): if i == j: var = 1 else: var = 0 calculate(i, j) row.append(var) tensor.append(row) return tensor`

If we call

`kronecker(3)`

, we expect to get

1 2 3`[[1, 0, 0], [0, 1, 0], [0, 0, 1]]`

However, if we call

`kronecker(3)`

as implemented above, we find that we have a scope error.What went wrong? In each loop over the

`j`

index, we calculate a new tensor element`var`

(using the function`calculate(i, j)`

) and append it to our newly formed list`row`

. At the end of each loop over`i`

,`row`

is appended to the list`tensor`

, to accumulate our results.The problem is that we expect

`var`

to be defined at the level of the`j`

loop after we define it in`calculate(i, j)`

. However,`var`

is defined at the deepest level of our code, and so, won't be defined at a higher level. The usual convention is that variables are available to code at the same level at which they're defined, and to all levels below that, but not at levels above their definition (see diagram below).

One way to fix our

`kronecker`

function would be

1 2 3 4 5 6 7 8 9 10 11 12`def kronecker(n): tensor = [] for i in range(n): row = [] for j in range(n): if i == j: var = 1 else: var = 0 row.append(var) tensor.append(row) return tensor`

## Recursive functions

Functions need not be encapsulations of useful snippets of code, whose sequence of events is explicitly outlined, e.g. "loop over this list of numbers, and print the square of each number." Functions can also be defined as a set of rules such that any given input maps to a single output. In this way, we can carefully define a computation in our code, but leave the explicit computation up to the computer. This method is called recursion, and works by declaring a set of defined base cases in terms of which all other arguments to the function can be computed.

Recursive factorial function

For a much abused, but clear example, let us consider the integer factorial function. One way to define $n!$ is $f(n) = n! = n \times (n-1) \times \ldots 3 \times 2 \times 1$, so that for instance,

$7! = 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1$

This is a fine definition, but it is tedious to write out, especially as our arguments get bigger. However, there is a clear pattern in the definition of $n!$. Consider $7!$ as written above, we see that we can also write this as $7 \times 6!$. This insight allows us to define the integer factorial function as follows

$f(n) = \begin{cases} n\times f(n-1) & \text{ if \$n > 1\$} \\ 1 & \text{ if \$n = 1\$} \\ \end{cases}$

To see how this simplifies our code, let us implement both factorials in Python.

Implemented without recursion, we have the following code to calculate $n!$

1 2 3 4 5`def factorial(n): result = 1 for number in range(1, n + 1): result = result * number return result`

with recursion we have

1 2 3 4 5`def factorial(n): if n == 1: return 1 else: return n * factorial(n - 1)`

Without using recursion, we have to make use of some unsavory programming practices. For one, we have used a sort of global variable, in

`result`

, whose value gets updated with every run through the loop. We also have to specify a range which opens us up to a potential fencepost (off-by-one) error.In the recursive case, we stipulate only the essentials: the base case, and the recursive definition. Here, any mistake is glaringly obvious because it will come from the mathematical definition, not from a subtle detail of our implementation. Using recursion, we are more likely to end up with clean code.

Here we have seen the benefit of recursion on organization and avoiding mistakes. However, the structure of the factorial calculation is quite simple, just a string of numbers that we multiply together. Let's try to calculate the Fibonacci numbers, a problem with a slightly richer structure.

Fibonacci numbers

As we did with the integer factorial function, let's write a recursive definition for the $n$th Fibonacci number:

$f(n) = \begin{cases} f(n-1) + f(n-2) & \text{ if } n \geq 2 \\ 1 & \text{ if } n \in\{0, 1\} \\ \end{cases}$

For example, to find the third Fibonacci number, this definition yields

$\begin{aligned} f(3) &= f(2) + f(1) \\ &= \left[f(1) + f(0)\right] + f(1) \\ &= 1 + 1 + 1 \\ &= 3 \end{aligned}$

For the fourth, we have

$\begin{aligned} f(4) &= f(3) + f(2) \\ &= \left[f(2) + f(1)\right] + \left[f(1)+f(0)\right] \\ &= \left[\left[f(1) + f(0)\right] + f(1)\right] + \left[f(1)+f(0)\right] \\ &= 1+1+1+1+1 \\ &= 5 \end{aligned}$

We can implement this function in Python as follows

1 2 3 4 5`def fib(number): if number in [0, 1]: return 1 else: return fib(number - 2) + fib(number - 1)`

This code works fine for computing

`fib(n)`

when`n`

is small, however, it is cripplingly slow as`n`

grows. From the cases we tried above, we can see that the computation branches out quickly.Let's inspect the diagram for $f(5)$ below. At the first branch, we have $f(5)\to f(4)+f(3)$. When $f(4)$ branches, it leads to computation of $f(3)$ and $f(2)$. However, we already needed to compute $f(3)$ from the first branching. Once we compute $f(3)$ a single time, it makes little sense to do it again.

As we compute bigger and bigger Fibonacci numbers, we'll find more and more of these cases, where the tree repeats computations that have already been performed (how many numbers must we compute in total?). Unless we cut these branches from the tree, we are going to waste a lot of processor time repeating things we've already done.

$A(1,2),$ $A (2,2),$ and $A(3,2)$ are equal to $4,7,$ and $29,$ respectively, $A(4,2) \approx 2 \times 10^{19728}$.

The Ackermann function is a computable function which grows very, very quickly as its inputs grow. For example, whileThe Ackermann function can be defined as follows: $A(m,n) = \begin{cases} n+1 & \mbox{if } m = 0 \\ A(m-1, 1) & \mbox{if } m > 0 \mbox{ and } n = 0 \\ A(m-1, A(m, n-1)) & \mbox{if } m > 0 \mbox{ and } n > 0. \end{cases}$

What is the value of $A(3,6)?$

## Memoization

As we noted, the recursive Fibonacci definition we implemented above leads to a lot of redundant computation. We can save time by remembering the answer to computations after the first time we perform them. This way, `fib(n)`

will be computed a maximum of one time for each value of `n`

. This method of halting computation of previously obtained values is known as "memoization" and can often be employed to obtain significant speedups.

Memoization of the recursive Fibonacci function

First let us make a lookup table called

`fib_history`

, and change our code so that`fib`

consults this table before committing to a new computation, i.e. it will only execute a new branch if the parent of that branch has not been computed before.

1 2 3 4 5 6`fib_history = {1:1, 0:1}; def fib_memo(n): if n not in fib_history: fib_history[n] = fib_memo(n - 2) + fib_memo(n - 1) return fib_history[n]`

Here we see that

`fib_memo`

asks, politely, whether the value of`fib_memo(n)`

is yet in`fib_history`

, and only if that answer is no, will it enter the recursion. In this way, we only ever evaluate`fib_memo`

a maximum of`n`

times in determining`fib_memo(n)`

.Revisiting the tree representation of our computation allows us to visualize the cost savings:

``

`fib_history`

accumulates knowledge over time, so we can even use the results from previous calls to benefit future calls. For example, if we've called`fib_memo(67)`

in the past, we don't need to make 68 calls to find`fib_memo(68)`

as nearly all the work has already been performed.Below, we plot the performance of

`fib`

against`fib_memo`

for moderate values of`n`

. It is clear that`fib`

's runtime grows exponentially with`n`

whereas`fib_memo`

has essentially constant cost for this range of`n`

.

## Default values

It is common to find functions tend to have the same value for some of their arguments, each time they're called. Or, you might have a function whose behavior you'd like to be able to modify, if needed, but which in most cases will run without the need for adjustments. In both of these cases, it is convenient to have default values for function arguments.

An optional argument helps true love lastSuppose you're writing an app to automatically answer your text messages when you're busy. Also suppose there exists some method by which an incoming text message can trigger functions, namely one you've called

`send_text`

. The default return of`send_text()`

is`"Sorry, can't talk now, in a meeting."`

.

1 2 3`def send_text(number): message = "Sorry, can't talk now, in a meeting." sms.message(number, message)`

This works fine most of the time, but can get you in trouble with your significant other, who might think you're exploring other options and ignoring them. In this case, it is convenient to call the function with an optional argument

`is_honeybear`

to indicate that your significant other is the recipient. In this case the call`send_text(is_honeybear = True)`

, prompts the app to append the message so that it reads`"Sorry, can't talk now, in a meeting. I love you so much!"`

.

1 2 3 4 5`def send_text(number, is_honeybear=False): message = "Sorry, can't talk now, in a meeting." alt_message = " I love you so much!" to_send = message + (alt_message if is_honeybear else "") sms.message(number, message)`

In the first case, where somebody other than the significant other is calling,

`send_text`

uses the default value for`is_honeybear`

which is`False`

.

In practice, default values tend to reveal themselves as you begin to use a function. In the design phase, you usually make the function and arguments as simple as possible, but after use in the codebase, it becomes clear that sometimes you need to override certain values.

As a concrete example, suppose you've written a function that reads in a tabular dataset (a spreadsheet with column names), and by default, expects columns to be separated by tabs:

1 2 3 4`import csv def read_data(filename): with file(filename) as fp: return csv.reader(fp, delimiter='\t')`

If we call

`read_data`

on a dataset like the file`sample_data.txt`

1 2 3 4 5 6 7`Product Amount UnitPrice wagon 2 30.00 shovel 1 10.00 fertilizer 22 1.50 . . . . . . . . .`

It will work fine, because the argument for the separator tells

`pandas.read_csv`

to expect columns separated by tabs, as we find in the file.However, you start receiving similarly formatted spreadsheets from a new client to analyze, but instead of separating the columns by tabs, they're separated by commas. Because the tasks are so similar we don't want to write a separate function to handle comma-separated and tab-separated data.

We can fix this by giving

`read_data`

an optional argument for the separator,`separator`

. For convenience, can set the default value to the most common argument we expect to use in importing datasets, the tab separator`\t`

.

1 2 3 4`import csv def read_data(filename, separator='\t'): with file(filename) as fp: return csv.reader(fp, delimiter=separator)`

Thus, whenever we call

`read_data`

on tab separated data, we can just provide the filename. When we need to specify the separator to be something different, commas for instance, we can call`read_data('sample_data.txt', separator=',')`

.