Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

nnia2nd_assignment

177 views
Kernel: Python 3 (Ubuntu Linux)

Exercise Sheet 2: Linear Algebra and TensorFlow Basics (Deadline: 14 Nov 14:00)

Linear Algebra (11 points)

For theoretical tasks you are encouraged to write LaTeX. Jupyter notebooks support them simply by typing an expression between dollar signs or in the blocks like \ begin{equation} \ end{equation}.

Alternatively, you can upload the solutions in the written form as images and paste them inside cells. But if you do this, make sure that the images have high quality, so we can read them without any problems.

1. Orthogonal vectors (1 point)

Let a,ba, b be non-zero vectors in RnR^n. Show that aTba^T b = 0 if and only if a and b are orthogonal vectors (i.e. angle between them equals π2+kπ\frac{\pi}{2} + k \pi radians).

Hint: be careful to prove 2 directions: 1.) aTba^T b = 0     \implies orthogonality and 2.) orthogonality     \implies aTba^T b = 0.

2. Interpretation of matrix multiplication (2 points)

Let A=[244220]    B=[0.60.80.80.6] A=\begin{bmatrix} 2 & 4 \\ 4 & 2 \\ 2 & 0 \end{bmatrix} \ \ \ \ B=\begin{bmatrix} 0.6 & -0.8 \\ 0.8 & 0.6 \end{bmatrix} a. Are the column vectors of B orthogonal? Orthonormal? Is B a rotation matrix? Justify your answers.(0.5 points)

b. Calculate ABA \cdot B. (0.5 points)

c. What is the interpretation of a matrix multiplication, when the column vectors of B are orthonormal (you may want to draw a simple picture visualizing row vectors of A and column vectors of B, but you don't need to submit it)? (1 point)

3. Property of XTXX^T X matrix (1 point)

Let XRm×nX \in R^{m×n} be a matrix obtained by stacking all training vectors (also called "design matrix"), like in the lecture. Obviously, it can be an arbritrary matrix. However a covariance matrix C=XTXC = X^T X has some interesting properties, namely it is always a positive semidefinite matrix (vTCv0v^T Cv \geq 0 for all non-zero vRnv \in R^n). Prove this statement.

Hint: vTCvv^T C v can be represented as a squared norm of some vector.

4. Eigendecomposition (4 points)

Consider the following matrix: M=[110121011] M=\begin{bmatrix} 1 & -1 & 0 \\ -1 & 2 & -1 \\ 0 & -1 & 1 \end{bmatrix} Is the matrix M:

a. Symmetric? What does it imply for its eigendecomposition? (0.5 points)

b. Singular? What does it imply for its eigendecomposition? (0.5 points)

c. Find the eigendecomposition of M (3 points).

5. Rayleigh-Ritz principle (3 points)

On the lecture we saw that the exact solution for decoding matrix D in PCA problem is a matrix that contains the ll eigenvectors corresponding to the largest eigenvalues of XTXX^T X, where X is a design matrix. But where does it come from?

If you want to see full derivation of PCA, you can find it in Deep Learning Book, chapter 2.12. In this exercise we would like to focus only on Raleigh-Ritz principle, which allows to solve optimization problems of type minv2=1vTCvmin_{||v||_2 = 1} v^T C v or maxv2=1vTCvmax_{||v||_2 = 1} v^T C v using eigendecomposition of C.

So prove that minv2=1vTCv=λminmin_{||v||_2 = 1} v^T C v = \lambda_{min}, where λmin\lambda_{min} is the smallest eigenvalue of C. For which vector vv this minimum is attained? (the proof for maxv2=1vTCv=λmaxmax_{||v||_2 = 1} v^T C v = \lambda_{max} is analogous).

Answer - 1

Given a,bRna, b \in \mathbb{R}^n then aTba^T b is just the inner product a,b \langle a,b \rangle .

To prove in both directions it is sufficient to show that in case of a,bRna, b \in \mathbb{R}^n, a,babcosθ \langle a,b \rangle \equiv ||a|| ||b|| \cos{\theta}

Using the Cosine Law and the distribution property of the inner product, we have:

ab2=a2+b22abcosθ=a,a+b,b2a,b=ab,ab\begin{aligned} ||a-b||^{2} &= ||a||^2 + ||b||^2 -2 ||a|| ||b|| \cos{\theta} \\ &= \langle a,a \rangle + \langle b,b \rangle - 2 \langle a,b \rangle \\ &= \langle a-b, a-b \rangle \end{aligned}

Equating the two expressions and noting that θ{π2+kπ}:kN\forall \theta \in \{ \frac{\pi}{2} + k \pi \} : k \in \mathbb{N} we have cosθ=0\cos{\theta}=0 , we have our result.

Answer - 2

a.

(i) Bx,1TBx,2=0.46+0.46=0B_{x,1}^T B_{x,2} = -0.46 + 0.46 = 0

(ii) They are orthogonal since their dot product = 0. They are orthonomal since each vector individually is a normal vector.

(iii) A square matrix XX is a rotation matrix iff det(X)=1\det({X})=1 and XX is an orthogonal matrix. det(X)=0.36+0.64=1\det({X})=0.36 + 0.64 = 1. Therefore, we conclude that BB is a rotation matrix.

b.

import numpy as np A = [[2,4], [4,2], [2,0]] B = [[0.6, -0.8], [0.8, 0.6]] AB = np.matmul(A,B) AB
array([[ 4.4, 0.8], [ 4. , -2. ], [ 1.2, -1.6]])

c.

Given a rotation matrix RRn×nR \in \mathbb{R}^{n \times n} and some vector xRnx \in \mathbb{R}^n, then the rotation RxRx is a counter-clockwise rotation of xx by RR. Since RT=R1R^T = R^{-1} then RTR^T must be clockwise rotation by the same angle.

We note that (AB)T=BTAT (AB)^T = B^T A^T, and so we conclude that the row matricies of AA are rotated clockwise by the operation ABAB.

Answer - 3

vTCv=vTXTXv=(Xv)T(Xv)0 {by the property of norms}\begin{aligned} v^T C v = v^T X^T X v &= (X v)^{T}(X v) \\ &\geq 0 \ \{\text{by the property of norms}\} \end{aligned}

Answer - 4

Have:

M=[110121011]M=\begin{bmatrix} 1 & -1 & 0 \\ -1 & 2 & -1 \\ 0 & -1 & 1 \end{bmatrix}

a.

(MT=M)symmetric(M^T = M) \therefore \text{symmetric}

Symmetric n×nn \times n matricies have nn linearly independent eigenvectors and their normalised eigenvectors make up an orthonormal set.

b.

We find that det(M)=0\det(M)=0, therefore it is singular. This implies that 00 is an eigenvalue of MM.

c.

We look for the eigendecomposition of the form M=PΛP1M = P \Lambda P^{-1} , where eigenvectors make up the columns of PP, with each column ii corresponding to the eigenvalue found in the diagonal eigenvalue matrix - ie Λii\Lambda_{ii}.

Do det(MλI)=0det(M-\lambda I)=0 to obtain the charachteristic polynomial: λ3+4λ23λ=0 -\lambda^3+4 \lambda^2 - 3\lambda = 0 . Now solve for λ\lambda to find the roots - i.e eigenvalues.

We find the following eigenvalues: 0,1,30, 1, 3. These make up the diagonal matrix Λ\Lambda.

Now solve for (MλiI)vλi=0(M-\lambda_i I)v_{\lambda_i}=0 to obtain each eigenvector vλiv_{\lambda_i}, corresponding to each eigenvalue λi\lambda_i. We find three eigenvectors:

v1T=[1,1,1]v2T=[1,0,1]v3T=[1,2,1]\begin{align} v_1^T &= [1, 1, 1] \\ v_2^T &= [1, 0, -1] \\ v_3^T &= [1, -2, 1] \\ \end{align}

Next normalize each of the eigenvectors and form the matrix PP, obtaining PM=PΛPM = P \Lambda.

Finally we use the symmetric matrix property that M=PΛP1=PΛPTM=P \Lambda P^{-1} = P \Lambda P^{T}, which is our eigendecomposition of MM.

We include an attachment of hand calculation as well as numeric computation for confirmation:

M = [[1, -1, 0], [-1, 2,-1], [0,-1, 1 ]] np.linalg.eigh(M)
(array([ 9.99658224e-17, 1.00000000e+00, 3.00000000e+00]), array([[ -5.77350269e-01, -7.07106781e-01, 4.08248290e-01], [ -5.77350269e-01, 9.71445147e-17, -8.16496581e-01], [ -5.77350269e-01, 7.07106781e-01, 4.08248290e-01]]))

 M=[13121613026131216][000010003][13121613026131216]T \therefore \ M = \begin{bmatrix} \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & 0 & \frac{-2}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & \frac{-1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \end{bmatrix} \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 3 \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & 0 & \frac{-2}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & \frac{-1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \end{bmatrix}^{T}

Answer - 5

Let CRn×nC \in \mathbb{R}^{n \times n} and symmetric, as well as vRn : v2=1v \in \mathbb{R}^n \ : \ ||v||_2 = 1.

We have the following eigendecomposition C=PΛPTC= P \Lambda P^T, where PP is orthogonal and Λ\Lambda is a diagonal matrix with eigenvalues Λii=λi\Lambda_{ii} = \lambda_i on the diagonal. WLOG, we order the eigenvalues in an ascending order: λ1λ2...λn\lambda_1 \leq \lambda_2 \leq ... \leq \lambda_n.

Then,

vTCv=vTPΛPTv=(PTv)TΛ(PTv)=i=1nλi(PTv)i22=i=1nλivi22     using orthogonality of Pλ1v22     using the smallest eigenvalue=λ1     using the normality of v    minv2=1vTCv=λ1\begin{align} v^T C v &= v^T P \Lambda P^T v \\ &= (P^T v)^{T} \Lambda (P^T v) \\ &= \sum_{i=1}^{n} \lambda_i ||(P^T v)_i||_{2}^{2} \\ &= \sum_{i=1}^{n} \lambda_i ||v_i||_{2}^{2} \ \ \ \ \ &\textit{using orthogonality of P} \\ &\geq \lambda_1 ||v||_{2}^{2} \ \ \ \ \ &\textit{using the smallest eigenvalue} \\ &= \lambda_1 \ \ \ \ \ &\textit{using the normality of v} \\ \implies \min_{||v||_2 = 1} v^T C v &= \lambda_1 \end{align}

Using the same logic we can show that maxv2=1vTCv=λn \max_{||v||_2 = 1} v^T C v = \lambda_n

Moreover, now let viv_i be the eigenvector correponding to the eigenvalue λi\lambda_i, then since columns of PP form an orthonormal set of eigenvectors, it is easy to show that viTCvi=λiv_i^T C v_i = \lambda_i.

TensorFlow (9 points)

1. A simple computational graph in Tensorflow (2 + 1 = 3 points)

To get started with TensorFlow we ask you to specify a very simple computational graph of the function f(x)=x2+2x+5f(x) = x^2 + 2x + 5 and also taking its derivative. If you don't know how to proceed, please study the example given on the lecture and also the official TenorFlow starting guide: https://www.tensorflow.org/get_started/get_started.

This exercise requires around 10 lines of code, it is simple enough. However, it is very important to understand the notion of computational graph.

Actually, this simple example will be a backbone for all neural networks in the future. All what we will need is to specify first the forward pass (f), and then take derivatives with respect to parameters (x), in order to update them, so they minimize the loss function f. But we will come back to this in several weeks.

%matplotlib inline import tensorflow as tf import matplotlib.pyplot as plt import numpy as np import math ### Define a simple comp. graph (note that all its components are special TensorFlow objects) # Specify an input x to the comp. graph x = tf.placeholder(tf.float32); # Specify f(x) = x^2 + 2x + 5 fx = x * x + 2 * x + 5 # Take derivative of f with respect to x. That will be very important in the future, since you will only # need to specify forward pass of your neural network, and all derivatives will be determined automatically by # TensorFlow. df = tf.gradients(fx, x) ### Execute the comp. graph # Create a TensorFlow session sess = tf.Session(); # Evaluate the function value and its derivative for values from -10 to 10 with step 0.1 res1 = sess.run(fx, {x: np.arange(-10, 10, 0.1)}) res2 = sess.run(df, {x: np.arange(-10, 10, 0.1)}) # Plot the function f (use matplotlib library) plt.plot(np.arange(-10, 10, 0.1), res1, color='b') # Plot the derivative of the function f plt.plot(np.arange(-10, 10, 0.1), res2[0], color='r')
[<matplotlib.lines.Line2D at 0x7f9ebf0a9fd0>]
Image in a Jupyter notebook

Tensorflow Question:

1.) In the exercise above x should be a TF placeholder, but f is a TF tensor. What is the difference between a placeholder and other tensors in computational graph? And how TF Variable differs from tensors? (0.5 point)

Answer:

  • A TF placeholder is a structure that makes a promise of providing data at some later point. The value of the placeholder must be fed at runtime. It is often used to load datasets and input variables.

  • A tensor is a multi-dimensional array-like structure. Tensors are the basic units of TensorFlow.

  • A TF variable is a structure to initialize and store a value with the ability to change it later. The value of the variable must be specified in the decleration. It is ofted used to store model's weights and update them while training.

Tensorflow Question:

2.) What does tf.Session().run(...) accept as an argument and what does it return? What is the fundamental difference between these 2 types of objects? (0.5 point)

Answer:

  • tf.Session().run takes as arguments the operation(s) to be carried out (graph element(s)) as well as the assigned values for the placeholder(s) in the operation(s). It returns the results of carrying out the given operation(s) on the given placeholder value(s).

  • The difference is the argument is a graph element while the return object is the result of carrying out that element. For example, passing the argument tf.constant([10, 20]) will result in the 1-D array [10,20].

2. A pure Python implementation (1 point)

Please do the same, but in pure Python, calculating the expression for the derivative by hand. There is no need to use NumPy library, since we deal with 1-dimensional x (0.5 point).

# Evaluate the function value and its derivative for values from -10 to 10 with step 0.1 # Define f pyFx = lambda x: x * x + 2 *x + 5 # Define f_derivative_val pyDf = lambda x: 2 * x + 2 # Plot the function f plt.plot(np.arange(-10, 10, 0.1), list(map(pyFx, np.arange(-10, 10, 0.1)))='b') # Plot the derivative of the function f plt.plot(np.arange(-10, 10, 0.1), list(map(pyDf, np.arange(-10, 10, 0.1))), color='r')
[<matplotlib.lines.Line2D at 0x7f9ebf010eb8>]
Image in a Jupyter notebook

3. A more complicated computational graph (5 points)

Okay, you could easily take the derivative of f(x)=x2+2x+5f(x) = x^2 + 2x + 5 by hand. But what if you need to take the derivative of the function f(x)=σ(6σ(x+5x2+3x3)+2σ(x+4x2))f(x) = \sigma(-6 \sigma (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma (-x + 4 x^{-2})) (where σ=11+exp(x)\sigma = \frac{1}{1 + exp(-x)}) with respect to x? It resembles a neural network much more than the previous example. Please, implement it in TensorFlow (0.5 point) and in pure Python (0.5 point) in the same way as before (but using new cells with the code below).

Note that in pure Python implementation you will need to somehow overcome the numerical errors for small values of x (roughly from -0.1 to 0.1). You can just skip these values.

For pure Python implementation you will need to derive the derivative by hand (3 points). Please write down your derivation for the new f(x)f(x) (you are encouraged to write LaTeX for them):

dfdx\frac{df}{dx} = ...

Hint 1: you may want to use a new variable to simplify your derivations, for example setting a=6σ(x+5x2+3x3)+2σ(x+4x2)a = -6 \sigma (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma (-x + 4 x^{-2}) can help. Don't worry if the final expression is quite ugly, it is supposed to be so.

Hint 2: your derived expression for dfdx\frac{df}{dx} should produce exactly the same values (up to the float type precision) as TensorFlow derivative.

Answer:

f(x)=σ(6σ(x+5x2+3x3)+2σ(x+4x2))σ=11+exp(x)α=6σ(x+5x2+3x3)+2σ(x+4x2)β=x+5x2+3x3γ=x+4x2f(x)=σ(α)αα=6σ(β)(110x39x4)+2σ(γ)(18x2)σ=exp(x)(1+exp(x))2\begin{align} f(x) &= \sigma(-6 \sigma (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma (-x + 4 x^{-2})) \\ \sigma &= \frac{1}{1 + exp(-x)} \\ \alpha &= -6 \sigma (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma (-x + 4 x^{-2}) \\ \beta &= x + 5 x^{-2} + 3 x^{-3} \\ \gamma &= -x + 4 x^{-2} \\ f'(x) &= \sigma '(\alpha) \alpha ' \\ \alpha ' &= -6 \sigma ' (\beta) (1 - 10x^{-3} - 9x^{-4}) + 2 \sigma ' (\gamma) (-1 - 8x^{-2}) \\ \sigma ' &= \frac{exp(-x)}{(1 + exp(-x))^2}\\ \end{align}

This gives us our result.

There exists also the interpritation of the problem where f2(x)=σ×(6σ×(x+5x2+3x3)+2σ×(x+4x2))f_2(x) = \sigma \times (-6 \sigma \times (x + 5 x^{-2} + 3 x^{-3}) + 2 \sigma \times (-x + 4 x^{-2})). For this the derivative is: f2(x)=2e2x(8x5+4exx4+4x4+22x222exx4x27ex27)(ex+1)3x4f_2 '(x) = -\frac{2 e^{2 x} \left(8 x^5+4 e^x x^4+4 x^4+22 x^2-22 e^x x-4 x-27 e^x-27\right)}{\left(e^x+1\right)^3 x^4}.

We include both interpretations in the analysis that follows.

# TensorFlow x = tf.placeholder(tf.float32) ## NOTE: We examine two functions, one per interpretation of the problem # Interpreting σ as a function of x longFx = (1)/(1+tf.exp(-1 * ((-6 * ((1)/(1+tf.exp(-1 * (x + (5 * (x ** -2)) + (3 * (x ** -3))))))) + (2 * (1/(1 + tf.exp(-1 * (-x+(4 * (x ** -2)))))))))) longFd = tf.gradients(longFx, x) # interpreting σ as a fixed term longFx2 =1/(1+tf.exp(-x)) * (-6*(1/(1+tf.exp(-x))) * (x + 5*x**(-2) + 3*x**(-3)) + 2 * 1/(1+tf.exp(-x)) * (-x + 4*x**(-2))) longFd2 = tf.gradients(longFx2, x) sess = tf.Session(); res1 = sess.run(longFx, {x: np.arange(-10, 10, 0.1)}) res2 = sess.run(longFd, {x: np.arange(-10, 10, 0.1)}) res3 = sess.run(longFx2, {x: np.arange(-10, 10, 0.1)}) res4 = sess.run(longFd2, {x: np.arange(-10, 10, 0.1)}) #We plot both functions plt.plot(np.arange(-10, 10, 0.1), res1, color = "b") plt.plot(np.arange(-10, 10, 0.1), res2[0], color = "r" ) plt.title('fx1') plt.show() plt.plot(np.arange(-10, 10, 0.1), res3, color ="b") plt.plot(np.arange(-10, 10, 0.1), res4[0], color = "r") plt.title('fx2') plt.show()
Image in a Jupyter notebookImage in a Jupyter notebook
# Pure python implementation # first interpretation pyLongFx = lambda x: (1)/(1+np.exp(-1 * ((-6 * ((1)/(1+np.exp(-1 * (x + (5 * (x ** -2)) + (3 * (x ** -3))))))) + (2 * (1/(1 + np.exp(-1 * (-x+(4 * (x ** -2)))))))))) pyLongFd = lambda x: ((np.exp(-1 * ((-6 / (1 + np.exp(-1 * (x + (5 * (x ** -2)) + (3 * (x ** -3)))))) + (2 / (1 + np.exp(x - (4 * (x ** -2)))))))/ ((1+np.exp(-1 * ((-6 * ((1)/(1+np.exp(-1 * (x + (5 * (x ** -2)) + (3 * (x ** -3))))))) + (2 * (1/(1 + np.exp(-1 * ((-1 * x)+(4 * (x ** -2)))))))))) ** 2)) * ((((-6 * np.exp((-1 * x)-(5 * (x ** -2))-(3 * (x ** -3)))) * (1 - (10 * (x ** -3)) - (9 * (x ** -4)))) / ((1 + np.exp((-1 * x)-(5 * (x ** -2))-(3 * (x ** -3)))) ** 2)) + (((2 * np.exp(x - (4*(x**-2)))) * (-1 - (8 * (x**-3)))) / ((1 + np.exp(x - (4*(x**-2)))) ** 2)))) # second interpretation pyLongFx2 = lambda x: 1/(1+np.exp(-x)) * (-6*(1/(1+np.exp(-x))) * (x + 5*x**(-2) + 3*x**(-3)) + 2 * 1/(1+np.exp(-x)) * (-x + 4*x**(-2))) pyLongFd2 = lambda x: - (2*np.exp(2*x)*(8*x**5 + 4*np.exp(x)*x**4 + 4*x**4 + 22*x**2 - 22*np.exp(x)*x -4*x - 27*np.exp(x) - 27))/(np.exp(x) + 1)**(3)*x**4 # round the two outputs to 10 decimal places, to avoid floating point precision problems # TODO: compare the result of both methods. plt.plot(np.arange(-10, 10, 0.1), list(map(pyLongFx, np.arange(-10, 10, 0.1))), color='b') plt.plot(np.arange(-10, 10, 0.1), list(map(pyLongFd, np.arange(-10, 10, 0.1))), color='r') plt.title('fx1') plt.show() plt.plot(np.arange(-10, 10, 0.1), list(map(pyLongFx2, np.arange(-10, 10, 0.1))), color='b') plt.plot(np.arange(-10, 10, 0.1), list(map(pyLongFd2, np.arange(-10, 10, 0.1))), color='r') plt.title('fx2') plt.show()
/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:4: RuntimeWarning: overflow encountered in exp /usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:5: RuntimeWarning: overflow encountered in exp /usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:5: RuntimeWarning: invalid value encountered in double_scalars
Image in a Jupyter notebookImage in a Jupyter notebook

Questions (1 point)

** 1.) What is the main difference in the program structure between TensorFlow and plain Python? (0.5 point) **

The way variables are defined is different. In TF, we use TF types instead of python objects. In addition, calculations are not carried out in TF unless tf.Session().run() is executed, while in pure python the statements are executed while interpreted.

** 2.) Does TensorFlow provide numerical or automatic differentiation? What are advantages of this way of differentiation? (0.5 point) **

TF provides automatic differentiation. The advantages of AD is its resilience to errors in floating point rounding. We note this fact in the comparison of '2nd interpretation' plots.

Submission instructions

You should provide a single Jupyter notebook as a solution. The naming should include the assignment number and matriculation IDs of all team members in the following format: assignment-1_matriculation1_matriculation_2_matriculation3.ipynb (in case of 3 team members). Make sure to keep the order matriculation1_matriculation_2_matriculation3 the same for all assignments.

Please, submit your solution to your tutor (with [NNIA][assignment-2] in email subject):

  1. Maksym Andriushchenko [email protected]

  2. Marius Mosbach [email protected]

  3. Rajarshi Biswas [email protected]

  4. Marimuthu Kalimuthu [email protected]

If you are in a team, please submit only 1 solution to only 1 tutor.