nnia2nd_assignment

¹⁷⁷ views

Kernel: Python 3 (Ubuntu Linux)

Exercise Sheet 2: Linear Algebra and TensorFlow Basics (Deadline: 14 Nov 14:00)

Linear Algebra (11 points)

For theoretical tasks you are encouraged to write LaTeX. Jupyter notebooks support them simply by typing an expression between dollar signs or in the blocks like \ begin{equation} \ end{equation}.

Alternatively, you can upload the solutions in the written form as images and paste them inside cells. But if you do this, make sure that the images have high quality, so we can read them without any problems.

1. Orthogonal vectors (1 point)

Let $a, b$ be non-zero vectors in $R^n$ . Show that $a^T b$ = 0 if and only if a and b are orthogonal vectors (i.e. angle between them equals $\frac{\pi}{2} + k \pi$ radians).

Hint: be careful to prove 2 directions: 1.) $a^T b$ = 0 $\implies$ orthogonality and 2.) orthogonality $\implies$ $a^T b$ = 0.

2. Interpretation of matrix multiplication (2 points)

Let $A=\begin{bmatrix} 2 & 4 \\ 4 & 2 \\ 2 & 0 \end{bmatrix} \ \ \ \ B=\begin{bmatrix} 0.6 & -0.8 \\ 0.8 & 0.6 \end{bmatrix}$ a. Are the column vectors of B orthogonal? Orthonormal? Is B a rotation matrix? Justify your answers.(0.5 points)

b. Calculate $A \cdot B$ . (0.5 points)

c. What is the interpretation of a matrix multiplication, when the column vectors of B are orthonormal (you may want to draw a simple picture visualizing row vectors of A and column vectors of B, but you don't need to submit it)? (1 point)

3. Property of $X^T X$ matrix (1 point)

Let $X \in R^{m×n}$ be a matrix obtained by stacking all training vectors (also called "design matrix"), like in the lecture. Obviously, it can be an arbritrary matrix. However a covariance matrix $C = X^T X$ has some interesting properties, namely it is always a positive semidefinite matrix ( $v^T Cv \geq 0$ for all non-zero $v \in R^n$ ). Prove this statement.

Hint: $v^T C v$ can be represented as a squared norm of some vector.

4. Eigendecomposition (4 points)

Consider the following matrix: $M=\begin{bmatrix} 1 & -1 & 0 \\ -1 & 2 & -1 \\ 0 & -1 & 1 \end{bmatrix}$ Is the matrix M:

a. Symmetric? What does it imply for its eigendecomposition? (0.5 points)

b. Singular? What does it imply for its eigendecomposition? (0.5 points)

c. Find the eigendecomposition of M (3 points).

5. Rayleigh-Ritz principle (3 points)

On the lecture we saw that the exact solution for decoding matrix D in PCA problem is a matrix that contains the $l$ eigenvectors corresponding to the largest eigenvalues of $X^T X$ , where X is a design matrix. But where does it come from?

If you want to see full derivation of PCA, you can find it in Deep Learning Book, chapter 2.12. In this exercise we would like to focus only on Raleigh-Ritz principle, which allows to solve optimization problems of type $min_{||v||_2 = 1} v^T C v$ or $max_{||v||_2 = 1} v^T C v$ using eigendecomposition of C.

So prove that $min_{||v||_2 = 1} v^T C v = \lambda_{min}$ , where $\lambda_{min}$ is the smallest eigenvalue of C. For which vector $v$ this minimum is attained? (the proof for $max_{||v||_2 = 1} v^T C v = \lambda_{max}$ is analogous).

Answer - 1

Given $a, b \in \mathbb{R}^n$ then $a^T b$ is just the inner product $\langle a,b \rangle$ .

To prove in both directions it is sufficient to show that in case of $a, b \in \mathbb{R}^n$ , $\langle a,b \rangle \equiv ||a|| ||b|| \cos{\theta}$

Using the Cosine Law and the distribution property of the inner product, we have:

\begin{aligned} ||a-b||^{2} &= ||a||^2 + ||b||^2 -2 ||a|| ||b|| \cos{\theta} \\ &= \langle a,a \rangle + \langle b,b \rangle - 2 \langle a,b \rangle \\ &= \langle a-b, a-b \rangle \end{aligned}

Equating the two expressions and noting that $\forall \theta \in \{ \frac{\pi}{2} + k \pi \} : k \in \mathbb{N}$ we have $\cos{\theta}=0$ , we have our result.

Answer - 2

a.

(i) $B_{x,1}^T B_{x,2} = -0.46 + 0.46 = 0$

(ii) They are orthogonal since their dot product = 0. They are orthonomal since each vector individually is a normal vector.

(iii) A square matrix $X$ is a rotation matrix iff $\det({X})=1$ and $X$ is an orthogonal matrix. $\det({X})=0.36 + 0.64 = 1$ . Therefore, we conclude that $B$ is a rotation matrix.

b.

In [1]:

import numpy as np

A = [[2,4], [4,2], [2,0]]

B = [[0.6, -0.8],
    [0.8, 0.6]]

AB = np.matmul(A,B)

AB

Out[1]:

array([[ 4.4,  0.8],
       [ 4. , -2. ],
       [ 1.2, -1.6]])

c.

Given a rotation matrix $R \in \mathbb{R}^{n \times n}$ and some vector $x \in \mathbb{R}^n$ , then the rotation $Rx$ is a counter-clockwise rotation of $x$ by $R$ . Since $R^T = R^{-1}$ then $R^T$ must be clockwise rotation by the same angle.

We note that $(AB)^T = B^T A^T$ , and so we conclude that the row matricies of $A$ are rotated clockwise by the operation $AB$ .

Answer - 3

\begin{aligned} v^T C v = v^T X^T X v &= (X v)^{T}(X v) \\ &\geq 0 \ \{\text{by the property of norms}\} \end{aligned}

Answer - 4

Have:

M=\begin{bmatrix} 1 & -1 & 0 \\ -1 & 2 & -1 \\ 0 & -1 & 1 \end{bmatrix}

a.

$(M^T = M) \therefore \text{symmetric}$

Symmetric $n \times n$ matricies have $n$ linearly independent eigenvectors and their normalised eigenvectors make up an orthonormal set.

b.

We find that $\det(M)=0$ , therefore it is singular. This implies that $0$ is an eigenvalue of $M$ .

c.

We look for the eigendecomposition of the form $M = P \Lambda P^{-1}$ , where eigenvectors make up the columns of $P$ , with each column $i$ corresponding to the eigenvalue found in the diagonal eigenvalue matrix - ie $\Lambda_{ii}$ .

Do $det(M-\lambda I)=0$ to obtain the charachteristic polynomial: $-\lambda^3+4 \lambda^2 - 3\lambda = 0$ . Now solve for $\lambda$ to find the roots - i.e eigenvalues.

We find the following eigenvalues: $0, 1, 3$ . These make up the diagonal matrix $\Lambda$ .

Now solve for $(M-\lambda_i I)v_{\lambda_i}=0$ to obtain each eigenvector $v_{\lambda_i}$ , corresponding to each eigenvalue $\lambda_i$ . We find three eigenvectors:

\begin{align} v_1^T &= [1, 1, 1] \\ v_2^T &= [1, 0, -1] \\ v_3^T &= [1, -2, 1] \\ \end{align}

Next normalize each of the eigenvectors and form the matrix $P$ , obtaining $PM = P \Lambda$ .

Finally we use the symmetric matrix property that $M=P \Lambda P^{-1} = P \Lambda P^{T}$ , which is our eigendecomposition of $M$ .

We include an attachment of hand calculation as well as numeric computation for confirmation:

In [2]:

M = [[1, -1, 0],
    [-1, 2,-1],
    [0,-1, 1 ]]
np.linalg.eigh(M)

Out[2]:

(array([  9.99658224e-17,   1.00000000e+00,   3.00000000e+00]),
 array([[ -5.77350269e-01,  -7.07106781e-01,   4.08248290e-01],
        [ -5.77350269e-01,   9.71445147e-17,  -8.16496581e-01],
        [ -5.77350269e-01,   7.07106781e-01,   4.08248290e-01]]))

$\therefore \ M = \begin{bmatrix} \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & 0 & \frac{-2}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & \frac{-1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \end{bmatrix} \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 3 \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & 0 & \frac{-2}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & \frac{-1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \end{bmatrix}^{T}$

Answer - 5

Let $C \in \mathbb{R}^{n \times n}$ and symmetric, as well as $v \in \mathbb{R}^n \ : \ ||v||_2 = 1$ .

We have the following eigendecomposition $C= P \Lambda P^T$ , where $P$ is orthogonal and $\Lambda$ is a diagonal matrix with eigenvalues $\Lambda_{ii} = \lambda_i$ on the diagonal. WLOG, we order the eigenvalues in an ascending order: $\lambda_1 \leq \lambda_2 \leq ... \leq \lambda_n$ .

Then,

\begin{align} v^T C v &= v^T P \Lambda P^T v \\ &= (P^T v)^{T} \Lambda (P^T v) \\ &= \sum_{i=1}^{n} \lambda_i ||(P^T v)_i||_{2}^{2} \\ &= \sum_{i=1}^{n} \lambda_i ||v_i||_{2}^{2} \ \ \ \ \ &\textit{using orthogonality of P} \\ &\geq \lambda_1 ||v||_{2}^{2} \ \ \ \ \ &\textit{using the smallest eigenvalue} \\ &= \lambda_1 \ \ \ \ \ &\textit{using the normality of v} \\ \implies \min_{||v||_2 = 1} v^T C v &= \lambda_1 \end{align}

Using the same logic we can show that $\max_{||v||_2 = 1} v^T C v = \lambda_n$

Moreover, now let $v_i$ be the eigenvector correponding to the eigenvalue $\lambda_i$ , then since columns of $P$ form an orthonormal set of eigenvectors, it is easy to show that $v_i^T C v_i = \lambda_i$ .

TensorFlow (9 points)

1. A simple computational graph in Tensorflow (2 + 1 = 3 points)

To get started with TensorFlow we ask you to specify a very simple computational graph of the function $f(x) = x^2 + 2x + 5$ and also taking its derivative. If you don't know how to proceed, please study the example given on the lecture and also the official TenorFlow starting guide: https://www.tensorflow.org/get_started/get_started.

This exercise requires around 10 lines of code, it is simple enough. However, it is very important to understand the notion of computational graph.

Actually, this simple example will be a backbone for all neural networks in the future. All what we will need is to specify first the forward pass (f), and then take derivatives with respect to parameters (x), in order to update them, so they minimize the loss function f. But we will come back to this in several weeks.

In [3]:

%matplotlib inline
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import math


### Define a simple comp. graph (note that all its components are special TensorFlow objects)
# Specify an input x to the comp. graph

x = tf.placeholder(tf.float32);

# Specify f(x) = x^2 + 2x + 5 

fx = x * x + 2 * x + 5

# Take derivative of f with respect to x. That will be very important in the future, since you will only
# need to specify forward pass of your neural network, and all derivatives will be determined automatically by 
# TensorFlow.

df = tf.gradients(fx, x)

### Execute the comp. graph
# Create a TensorFlow session

sess = tf.Session();


# Evaluate the function value and its derivative for values from -10 to 10 with step 0.1
res1 = sess.run(fx, {x: np.arange(-10, 10, 0.1)})
res2 = sess.run(df, {x: np.arange(-10, 10, 0.1)})


# Plot the function f (use matplotlib library)

plt.plot(np.arange(-10, 10, 0.1), res1, color='b')

# Plot the derivative of the function f

plt.plot(np.arange(-10, 10, 0.1), res2[0], color='r')

Out[3]:

[<matplotlib.lines.Line2D at 0x7f9ebf0a9fd0>]

Tensorflow Question:

1.) In the exercise above x should be a TF placeholder, but f is a TF tensor. What is the difference between a placeholder and other tensors in computational graph? And how TF Variable differs from tensors? (0.5 point)

Answer:

A TF placeholder is a structure that makes a promise of providing data at some later point. The value of the placeholder must be fed at runtime. It is often used to load datasets and input variables.
A tensor is a multi-dimensional array-like structure. Tensors are the basic units of TensorFlow.
A TF variable is a structure to initialize and store a value with the ability to change it later. The value of the variable must be specified in the decleration. It is ofted used to store model's weights and update them while training.

Tensorflow Question:

2.) What does tf.Session().run(...) accept as an argument and what does it return? What is the fundamental difference between these 2 types of objects? (0.5 point)

Answer:

tf.Session().run takes as arguments the operation(s) to be carried out (graph element(s)) as well as the assigned values for the placeholder(s) in the operation(s). It returns the results of carrying out the given operation(s) on the given placeholder value(s).
The difference is the argument is a graph element while the return object is the result of carrying out that element. For example, passing the argument tf.constant([10, 20]) will result in the 1-D array [10,20].

2. A pure Python implementation (1 point)

Please do the same, but in pure Python, calculating the expression for the derivative by hand. There is no need to use NumPy library, since we deal with 1-dimensional x (0.5 point).

In [4]:

# Evaluate the function value and its derivative for values from -10 to 10 with step 0.1

# Define f
pyFx = lambda x: x * x + 2 *x + 5

# Define f_derivative_val
pyDf = lambda x: 2 * x + 2



# Plot the function f
plt.plot(np.arange(-10, 10, 0.1), list(map(pyFx, np.arange(-10, 10, 0.1)))='b')

# Plot the derivative of the function f
plt.plot(np.arange(-10, 10, 0.1), list(map(pyDf, np.arange(-10, 10, 0.1))), color='r')

Out[4]:

[<matplotlib.lines.Line2D at 0x7f9ebf010eb8>]

3. A more complicated computational graph (5 points)

Okay, you could easily take the derivative of $f(x) = x^2 + 2x + 5$ by hand. But what if you need to take the derivative of the function $f(x) = \sigma(-6 \sigma (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma (-x + 4 x^{-2}))$ (where $\sigma = \frac{1}{1 + exp(-x)}$ ) with respect to x? It resembles a neural network much more than the previous example. Please, implement it in TensorFlow (0.5 point) and in pure Python (0.5 point) in the same way as before (but using new cells with the code below).

Note that in pure Python implementation you will need to somehow overcome the numerical errors for small values of x (roughly from -0.1 to 0.1). You can just skip these values.

For pure Python implementation you will need to derive the derivative by hand (3 points). Please write down your derivation for the new $f(x)$ (you are encouraged to write LaTeX for them):

$\frac{df}{dx}$ = ...

Hint 1: you may want to use a new variable to simplify your derivations, for example setting $a = -6 \sigma (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma (-x + 4 x^{-2})$ can help. Don't worry if the final expression is quite ugly, it is supposed to be so.

Hint 2: your derived expression for $\frac{df}{dx}$ should produce exactly the same values (up to the float type precision) as TensorFlow derivative.

Answer:

\begin{align} f(x) &= \sigma(-6 \sigma (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma (-x + 4 x^{-2})) \\ \sigma &= \frac{1}{1 + exp(-x)} \\ \alpha &= -6 \sigma (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma (-x + 4 x^{-2}) \\ \beta &= x + 5 x^{-2} + 3 x^{-3} \\ \gamma &= -x + 4 x^{-2} \\ f'(x) &= \sigma '(\alpha) \alpha ' \\ \alpha ' &= -6 \sigma ' (\beta) (1 - 10x^{-3} - 9x^{-4}) + 2 \sigma ' (\gamma) (-1 - 8x^{-2}) \\ \sigma ' &= \frac{exp(-x)}{(1 + exp(-x))^2}\\ \end{align}

This gives us our result.

There exists also the interpritation of the problem where $f_2(x) = \sigma \times (-6 \sigma \times (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma \times (-x + 4 x^{-2}))$ . For this the derivative is: $f_2 '(x) = -\frac{2 e^{2 x} \left(8 x^5+4 e^x x^4+4 x^4+22 x^2-22 e^x x-4 x-27 e^x-27\right)}{\left(e^x+1\right)^3 x^4}$ .

We include both interpretations in the analysis that follows.

In [12]:

# TensorFlow

x = tf.placeholder(tf.float32)

## NOTE: We examine two functions, one per interpretation of the problem
# Interpreting σ as a function of x
longFx = (1)/(1+tf.exp(-1 * ((-6 * ((1)/(1+tf.exp(-1 * (x + (5 * (x ** -2)) + (3 * (x ** -3))))))) + (2 * (1/(1 + tf.exp(-1 * (-x+(4 * (x ** -2)))))))))) 
longFd = tf.gradients(longFx, x)


# interpreting σ as a fixed term
longFx2 =1/(1+tf.exp(-x))  * (-6*(1/(1+tf.exp(-x))) * (x + 5*x**(-2) + 3*x**(-3)) + 2 * 1/(1+tf.exp(-x)) * (-x + 4*x**(-2)))
longFd2 = tf.gradients(longFx2, x)

sess = tf.Session();

res1 = sess.run(longFx, {x: np.arange(-10, 10, 0.1)})
res2 = sess.run(longFd, {x: np.arange(-10, 10, 0.1)})

res3 = sess.run(longFx2, {x: np.arange(-10, 10, 0.1)})
res4 = sess.run(longFd2, {x: np.arange(-10, 10, 0.1)})

#We plot both functions
plt.plot(np.arange(-10, 10, 0.1), res1, color = "b")
plt.plot(np.arange(-10, 10, 0.1), res2[0], color = "r" )
plt.title('fx1')
plt.show()

plt.plot(np.arange(-10, 10, 0.1), res3, color ="b")
plt.plot(np.arange(-10, 10, 0.1), res4[0], color = "r")
plt.title('fx2')
plt.show()

Out[12]:

In [14]:

# Pure python implementation

# first interpretation
pyLongFx = lambda x: (1)/(1+np.exp(-1 * ((-6 * ((1)/(1+np.exp(-1 * (x + (5 * (x ** -2)) + (3 * (x ** -3))))))) + (2 * (1/(1 + np.exp(-1 * (-x+(4 * (x ** -2))))))))))
pyLongFd = lambda x: ((np.exp(-1 * ((-6 / (1 + np.exp(-1 * (x + (5 * (x ** -2)) + (3 * (x ** -3)))))) + (2 / (1 + np.exp(x - (4 * (x ** -2)))))))/ ((1+np.exp(-1 * ((-6 * ((1)/(1+np.exp(-1 * (x + (5 * (x ** -2)) + (3 * (x ** -3))))))) + (2 * (1/(1 + np.exp(-1 * ((-1 * x)+(4 * (x ** -2)))))))))) ** 2)) * ((((-6 * np.exp((-1 * x)-(5 * (x ** -2))-(3 * (x ** -3)))) * (1 - (10 * (x ** -3)) - (9 * (x ** -4)))) / ((1 + np.exp((-1 * x)-(5 * (x ** -2))-(3 * (x ** -3)))) ** 2)) + (((2 * np.exp(x - (4*(x**-2)))) * (-1 - (8 * (x**-3)))) / ((1 + np.exp(x - (4*(x**-2)))) ** 2))))


# second interpretation
pyLongFx2 = lambda x: 1/(1+np.exp(-x))  * (-6*(1/(1+np.exp(-x))) * (x + 5*x**(-2) + 3*x**(-3)) + 2 * 1/(1+np.exp(-x)) * (-x + 4*x**(-2)))
pyLongFd2 = lambda x: - (2*np.exp(2*x)*(8*x**5 + 4*np.exp(x)*x**4 + 4*x**4 + 22*x**2 - 22*np.exp(x)*x -4*x - 27*np.exp(x) - 27))/(np.exp(x) + 1)**(3)*x**4


# round the two outputs to 10 decimal places, to avoid floating point precision problems
# TODO: compare the result of both methods.

plt.plot(np.arange(-10, 10, 0.1), list(map(pyLongFx, np.arange(-10, 10, 0.1))), color='b')
plt.plot(np.arange(-10, 10, 0.1), list(map(pyLongFd, np.arange(-10, 10, 0.1))), color='r')
plt.title('fx1')
plt.show()

plt.plot(np.arange(-10, 10, 0.1), list(map(pyLongFx2, np.arange(-10, 10, 0.1))), color='b')
plt.plot(np.arange(-10, 10, 0.1), list(map(pyLongFd2, np.arange(-10, 10, 0.1))), color='r')
plt.title('fx2')
plt.show()

Out[14]:

/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:4: RuntimeWarning: overflow encountered in exp
/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:5: RuntimeWarning: overflow encountered in exp
/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:5: RuntimeWarning: invalid value encountered in double_scalars

Questions (1 point)

** 1.) What is the main difference in the program structure between TensorFlow and plain Python? (0.5 point) **

The way variables are defined is different. In TF, we use TF types instead of python objects. In addition, calculations are not carried out in TF unless tf.Session().run() is executed, while in pure python the statements are executed while interpreted.

** 2.) Does TensorFlow provide numerical or automatic differentiation? What are advantages of this way of differentiation? (0.5 point) **

TF provides automatic differentiation. The advantages of AD is its resilience to errors in floating point rounding. We note this fact in the comparison of '2nd interpretation' plots.

Submission instructions

You should provide a single Jupyter notebook as a solution. The naming should include the assignment number and matriculation IDs of all team members in the following format: assignment-1_matriculation1_matriculation_2_matriculation3.ipynb (in case of 3 team members). Make sure to keep the order matriculation1_matriculation_2_matriculation3 the same for all assignments.

Please, submit your solution to your tutor (with [NNIA][assignment-2] in email subject):

Maksym Andriushchenko [email protected]
Marius Mosbach [email protected]
Rajarshi Biswas [email protected]
Marimuthu Kalimuthu [email protected]

If you are in a team, please submit only 1 solution to only 1 tutor.

Exercise Sheet 2: Linear Algebra and TensorFlow Basics (Deadline: 14 Nov 14:00)

Linear Algebra (11 points)

1. Orthogonal vectors (1 point)

2. Interpretation of matrix multiplication (2 points)

3. Property of $X^T X$ matrix (1 point)

4. Eigendecomposition (4 points)

5. Rayleigh-Ritz principle (3 points)

Answer - 1

Answer - 2

a.

b.

c.

Answer - 3

Answer - 4

a.

b.

c.

Answer - 5

TensorFlow (9 points)

1. A simple computational graph in Tensorflow (2 + 1 = 3 points)

Tensorflow Question:

Answer:

Tensorflow Question:

Answer:

2. A pure Python implementation (1 point)

3. A more complicated computational graph (5 points)

Answer:

Questions (1 point)

Submission instructions

Product

Resources

Company

Exercise Sheet 2: Linear Algebra and TensorFlow Basics (Deadline: 14 Nov 14:00)

Linear Algebra (11 points)

1. Orthogonal vectors (1 point)

2. Interpretation of matrix multiplication (2 points)

3. Property of XTXX^T XXTX matrix (1 point)

4. Eigendecomposition (4 points)

5. Rayleigh-Ritz principle (3 points)

Answer - 1

Answer - 2

a.

b.

c.

Answer - 3

Answer - 4

a.

b.

c.

Answer - 5

TensorFlow (9 points)

1. A simple computational graph in Tensorflow (2 + 1 = 3 points)

Tensorflow Question:

Answer:

Tensorflow Question:

Answer:

2. A pure Python implementation (1 point)

3. A more complicated computational graph (5 points)

Answer:

Questions (1 point)

Submission instructions

3. Property of $X^T X$ matrix (1 point)