# Problem Sheet 2

In Tasks (a)-(c), consider a simplified case of $n=1$ and $h_{\theta}(x) = \theta x$. Assume quadratic pointwise loss, $\ell(y,\hat y) = (y-\hat y)^2.$
Assume that data are samples of a random variable pair $(X,Y) \sim \mathbb{P}$.

## Task (a) (Warm-up)

- Find the expected risk $\mathbb{E}[\ell(h_{\theta}(X),Y)]$ in terms of $\theta$ and suitable expectations.
- Find the best parameter $\theta^{best} = \arg\min_{\theta \in \mathbb{R}} \mathbb{E}[\ell(h_{\theta}(X),Y)].$

## Task (b)
- Assume $X$ is uniformly distributed on $[-1,1]$, $X \sim \mathcal{U}(-1,1)$, and consider two options of generating data:
  - $Y = X$, and 
  - $Y = X - X^3$.
  
For each of those options for $Y$, compute $\theta^{best}$ and the corresponding expected risk $\mathbb{E}[\ell(h_{\theta^{best}}(X),Y)]$. What can you say about these $\theta^{best}$ and the accuracy of the prediction rule $h_{\theta^{best}}$ in each case?

## Task (c) 
Assume $X \sim \mathcal{U}(-1,1)$ as before, but the data $Y = X + \xi$ is **noisy**, that is, a function of $X$ perturbed by another random variable $\xi$ which is **independent of** $X$. Assume $\mathbb{E}[\xi] = 0$ and $\mathrm{Var}[\xi] < \infty$, where $\mathrm{Var}[\xi]$ is the variance.
- Find $\theta^{best} = \arg\min_{\theta \in \mathbb{R}} \mathbb{E}[\ell(h_{\theta}(X),Y)]$ under these new assumptions. Compare it with $\theta^{best}$ from Task (b).

---

## Task 1: simulating the "true" distribution

For testing the convergence of the empirical risk we cannot use the temperature data (or any actual data in fact), since it's finite and we cannot let $|D| \rightarrow \infty$.

To select a _class_ of models, or to test numerical algorithms, one often designs a **synthetic** distribution with known properties. One can then sample as many realisations from this distribution as desired.

- **Write** a Python **function** `TrueSample()` which returns a pair x,y which is a random sample from the following distribution:
   - $X \sim \mathcal{U}(-1,1)$ is a random value uniformly distributed between -1 and 1;
   - for a given $X$, compute the label as $Y = X - X^3 + \frac{1}{10} \xi$, where $\xi \sim \mathcal{N}(0,1)$ is a random variable with the standard normal distribution.   
   
   _Hint: read up on [numpy.random](https://numpy.org/doc/1.16/reference/routines.random.html) module_
 
- **Compute** 30 samples of x and y from `TrueSample()` and store them in _numpy_ arrays X and Y, respectively.
- **Plot** array Y vs array X using `plot()` from _matplotlib_'s _pyplot_ module. Make sure you plot **only points** (x,y) (not lines connecting them).

## Task 2: convergence of the parameter
- **Copy** over functions `features` and `optimise_loss` from Problem Sheet 1 (your implementation or my model solution)
- Take $n=1$ and **compute** $\boldsymbol\theta^* \in \mathbb{R}^2$ using `optimise_loss` applied to the data X and Y prepared in Task 1.
- **Vary** the number of samples in Task 1 in a range 30, 100, 300, 1000, 3000, and for each corresponding realisation of X and Y compute the corresponding $\boldsymbol\theta^*$.
- **Compare** $\theta^*_1$ with $\theta^{best}$ found in Task (b) as the number of samples increases.

## Task 3 (Warm-up): data splitting

- **Write a Python function** `split_data(X,Y,K,k)` that takes as input _numpy_ arrays X and Y as defined previously, a positive integer number of chunks K the data should be split into, and an index $0\le$ k $<$ K. The function should split X and Y into K subsets of equal size (you can assume that K divides the size of X and Y) and **return** 4 outputs: Xtrain, Ytrain, Xtest, Ytest, where Xtest and Ytest are the $k$-th subsets of X and Y, respectively, and Xtrain and Ytrain contain the rest of X and Y, respectively.