광주과학기술원 2002 Spring semester 손진희 교수님의 기계학습 및 딥러닝의 Homework 2 문제와 풀이입니다. 틀린내용 있을 수 있으니 너그러히 봐주시면 감사하겠습니다. 문제 시에 곧바로 삭제하겠습니다.
Problem 1: Decision Tree
Consider the following table of observations.
1.What is the entropy of Play Golf?
\begin{align} & \text{Counts of \{Play Golf=T\} = 5, Counts of \{Play Golf=F\} = 5}\label{1} \\
& P(\text{Play Golf=T})= \frac{1}{2}, P(\text{Play Golf=F}) = \frac{1}{2} \label{2}
\end{align}
위의 두 식으로부터
\begin{align*}H(\text{Play Golf}) & = – P (\text{Play Golf=T}) \log_2 P(\text{Play Golf=T}) – P (\text{Play Golf=F}) \log_2 P(\text{Play Golf=F})\\ &= -\frac{1}{2} \log_2 \frac{1}{2}-\frac{1}{2} \log_2 \frac{1}{2}\\ &= 1\end{align*}
$$\text{Therefore} H(\text{Play Golf}) = 1$$
2. Assume that \emph{Temperature} is chosen for the root of the decision tree. What is the information gain associated with this attribute?
\begin{align*}
&P(\text{Temperature = cool}) = 3/10
\\ &P(\text{Play Golf=T},\text{Temperature = cool}) = 0
\\ &P(\text{Play Golf=F},\text{Temperature = cool}) = 3/10\\
& P(\text{Temperature = hot}) = 4/10
\\& P(\text{Play Golf=T},\text{Temperature = hot}) = 2/10
\\ &P(\text{Play Golf=F},\text{Temperature = hot}) = 2/10\\
& P(\text{Temperature = mild}) = 3/10
\\ &P(\text{Play Golf=T},\text{Temperature = mild})=3/10
\\ &P(\text{Play Golf=F},\text{Temperature = mild})=0
\end{align*}
\begin{align*}
&P(\text{Play Golf=T}|\text{Temperature = cool}) = 0
, P(\text{Play Golf=F}|\text{Temperature = cool}) = 1\\
&P(\text{Play Golf=T}|\text{Temperature = hot})= 1/2
, P(\text{Play Golf=F}|\text{Temperature = hot})= 1/2\\
&P(\text{Play Golf=T}|\text{Temperature = mild})=1
, P(\text{Play Golf=F}|\text{Temperature = mild})=0
\end{align*}
\begin{align*}
H(\text{Play Golf}|\text{Temperature=cool})&=-P(\text{Play Golf=F}|\text{Temperature = cool})\log_2 {P(\text{Play Golf=F}|\text{Temperature = cool})}\\
&=0\\
H(\text{Play Golf}|\text{Temperature=mild})&=-P(\text{Play Golf=T}|\text{Temperature = mild})\log_2 {P(\text{Play Golf=T}|\text{Temperature = mild})}\\&=0\\
H(\text{Play Golf}|\text{Temperature=hot})=&-P(\text{Play Golf=T}|\text{Temperature = hot})\log_2 {P(\text{Play Golf=T}|\text{Temperature = hot})}\\
&-P(\text{Play Golf=F}|\text{Temperature = hot})\log_2 {P(\text{Play Golf=F}|\text{Temperature = hot})}\\
=&1
\end{align*}
\begin{align*}
H(\text{Play Golf}|\text{Temperature})=&H(\text{Play Golf}|\text{Temperature=cool})P(\text{Temperature = cool})\\
&+H(\text{Play Golf}|\text{Temperature=mild})P(\text{Temperature = mild})\\
&+H(\text{Play Golf}|\text{Temperature=hot})P(\text{Temperature = hot})\\
=& \frac{4}{10}
\end{align*}
$$H(\text{Play Golf}|\text{Temperature})=\frac{4}{10}$$
$$\text{Information gain } IG(\text{Temperature}) = H(\text{Play Golf})-H(\text{Play Golf}|\text{Temperature})= \frac{6}{10}$$
3. Draw the full decision tree learned from this data by computing and comparing information gain (without any pruning)
Let PG, O, Te, W be random variables representing Play Golf, Outlook, Temperature, Windy respectively.
\begin{align*}
&P(O=rain)=1/2, P(PG=T|O=rain)=2/5, P(PG=F|O=rain)=3/5\\
&P(O=sunny)=1/2, P(PG=T|O=sunny)=3/5, P(PG=F|O=sunny)=2/5\\
&H(PG|O=rain) = -\sum_{pg=\in \{T,F\}} P(PG=pg|O=rain)\log_2 P(PG=pg|O=rain) = 0.971 \\
&H(PG|O=sunny) = -\sum_{pg=\in \{T,F\}} P(PG=pg|O=sunny)\log_2 P(PG=pg|O=sunny) = 0.971\\
&H(PG|O) = \sum_{o \in \{sunny,rain\}}H(PG | O=o)P(O=o) = 0.971 \\
&IG(O)=H(PG) – H(PG|O)=0.029\\
&P(W=Y)=4/10, P(PG=T|W=Y)=1/4, P(PG=F|W=Y)=3/4\\
&P(W=N)=6/10, P(PG=T|W=N)=4/6, P(PG=F|W=N)=2/6\\
&H(PG|W=Y) = -\sum_{pg=\in \{T,F\}} P(PG=pg|W=Y)\log_2 P(PG=pg|W=Y) = 0.811 \\
&H(PG|W=N) = -\sum_{pg=\in \{T,F\}} P(PG=pg|W=N)\log_2 P(PG=pg|W=N) = 0.918\\
&H(PG|W) = \sum_{w \in \{Y,N\}}H(PG | W=w)P(W=w) = 0.875 \\
&IG(W)=H(PG) – H(PG|W)=0.125\\
\end{align*}
since, $IG(Te) = 0.6$ is the largest of information gains. Attribute $Te$ is chosen.
In this situation, data table is below.Now suppose that Temperature=hot is given.
Since $P(O = sunny | T=hot) =1 $, $P(PG | T= hot, O=sunny) = P(PG|T=hot)$. Therefore, Entropy does not change even if $O$ is given.Therefore, $IG(O)=0$ when Temperature=hot is given.
\begin{align*}
&P(PG=F | T=hot, W=Y) = 1, P(PG=T | T=hot, W=N) =1, H(PG|W,T=hot)=0\\
&H(PG | T = hot, W=Y) = H(PG | T=hot, W=N)=0\\
\end{align*}
Therefore, Windy is selected and full decision tree is the below.
Problem 2 : K-Nearest Neighbor/VC-dimension
1. You are given the above dataset, consisting of two classes, A and B. Classify the following points using the $k$-nearest neighbor($k$-NN) algorithm for $k=1, 3, 5$. (If the point is on the decision boundary, just write `decision boundary’. If the points have multiple nearest neighbors having a same distance, then you can choose the majority. You don’t have to write justification for your answer.)
For each case, the nearest points are listed in the order of closestness as follows.}
$(-1, -1) \to
(-2,-1), (-2,0), (0,-2),(-1,1),(1,-1)$
$(-1, 0)\to(-2,0), (-1,1),(-2,-1),(-1,2),(1,0)$
$(1, 0)\to(1,0),(1,-1),(1,1),(2,-1),(3,0)$
2. What factors affect the number of computations required for the $k$-NN algorithm? Briefly explain how those factors affect the number of computations.
1.Dimension of data : in high dimension space, the computation number increases because all coordinates of point is used for computation.
2.k value : Decision boundary depends on k value.
3.Memory and the number of data:All distances between all data should be computed and stored.Large volume memory is need.
4.Distance metric : Distance depends on metric.Distance varies on selection of metric.Therefore, Nearest points can changes if another metric is selected.
3. Show the VC-dimension of a 2D classifier with an axis-aligned rectangle in 2D-plane is 4.
Consider 4-points data set in Figure 3. Suppose that there are two classes(red and blue). In Figure 4, Rectangles can be drawn to classify for four cases. For symmetry of data set, There are rectangles classifying data set for remaining cases.It means that an axis-aligned rectangles shatters this data set.Therefore the VC-dimension $\geq$ 4.Suppose that there are 5 points in 2D plane. Consider the smallest rectangle containing these 5 points. When there are points lying inside this rectangle, there is not rectangle classify if interior points is labeled as blue. When all data set points lie on boundary of the smallest rectangle, data points can be labeled so that there is not rectangle classifying. Therefore, the VC-dimension is less than 5. That is the VC-dimension is 4.
Problem 3 : Na ̈ıve Bayes
Consider a two-class (category) classification problem (class 1 and class 2) with jointly Gaussian distributed feature vectors and with the mean $\mathbf{\mu}_1$ and $\mathbf{\mu}_2$ respectively and the same covariance matrix $\mathbf{\Sigma}$ in both classes.
For simplicity assume the prior probability of the two classes are equal scalar constants.
1.Derive the Na\”ive Bayes classifier (Maximum-a-posteriori) for this problem.
– hint: Draw a decision boundary where $\mathbf{\mu}_1=[3,1]^T, \mathbf{\mu}_2=[1,0]^T$, $\mathbf{\Sigma}={\sigma}^2\boldsymbol{I}$ is a diagonal matrix which is simply $\sigma^2$ times the identity matrix $\boldsymbol{I}$, and explain.
$p(y|\mathbf{x})=\frac{p(\mathbf{x}, y)}{\mathbf{x}}=\frac{p(\mathbf{x}|y)p(y)}{p(\mathbf{x})}=\frac{1}{2}\frac{p(\mathbf{x} | y)}{p(\mathbf{x})}$. Let $N(x) = \frac{p(y=\text{class1}|\mathbf{x})}{p(y=\text{class2}|\mathbf{x})}$. $\mathbf{x}$ has a label class1 if $N(\mathbf{x}) > 1$ or has a label class2 if $N(\mathbf{x}) \leq 1$. Let $LN(\mathbf{x}) = \ln N(\mathbf{x})$, then
$$\text{class of } \mathbf{x} \text{ is} \begin{cases} \text{class1 if } LN(\mathbf{x}) > 0 \\ \text{class2 if } LN(\mathbf{x}) \leq 0 \end{cases} $$
\begin{align*}
LN(\mathbf{x})&=-\frac{1}{2}\{(\mathbf{x} – \mathbf{\mu_1})^\top \Sigma^{-1} (\mathbf{x} – \mathbf{\mu_1})-(\mathbf{x} – \mathbf{\mu_2})^\top \Sigma^{-1} (\mathbf{x} – \mathbf{\mu_2}) \}\\
&=-\frac{1}{2}\left(\mathbf{x}^\top \Sigma^{-1} \mathbf{x} – \mathbf{x}^\top \Sigma^{-1} \mathbf{\mu_1}-\mathbf{\mu_1}^\top \Sigma^{-1} \mathbf{x}+\mathbf{\mu_1}^\top \Sigma^{-1} \mathbf{\mu_1}
-\mathbf{x}^\top \Sigma^{-1} \mathbf{x}
+ \mathbf{x}^\top \Sigma^{-1} \mathbf{\mu_2}
+\mathbf{\mu_2}^\top \Sigma^{-1} \mathbf{x}
-\mathbf{\mu_2}^\top \Sigma^{-1} \mathbf{\mu_2}\right)\\
&=-\frac{1}{2} \left(
-2 \mathbf{\mu_1}^\top \Sigma^{-1} \mathbf{x}
+\mathbf{\mu_1}^\top \Sigma^{-1} \mathbf{\mu_1}
+2\mathbf{\mu_2}^\top \Sigma^{-1} \mathbf{x}
-\mathbf{\mu_2}^\top \Sigma^{-1} \mathbf{\mu_2}
\right)\\
&=\left(\mathbf{\mu_1} – \mathbf{\mu_2} \right)^\top \Sigma^{-1} \mathbf{x} – \frac{1}{2} \left(\mathbf{\mu_1}^\top \Sigma^{-1} \mathbf{\mu_1} – \mathbf{\mu_2}^\top \Sigma^{-1} \mathbf{\mu_2}\right)
\end{align*}
2.Explain the decision boundary of this Na\”ive Bayes classifier in a geometric view.
From Problem 3.1,
\begin{align*}LN(\mathbf{x}) > 0 \iff (\mathbf{x} – \mathbf{\mu_1})^\top \Sigma^{-1} (\mathbf{x} – \mathbf{\mu_1})<(\mathbf{x} – \mathbf{\mu_2})^\top \Sigma^{-1} (\mathbf{x} – \mathbf{\mu_2}) \end{align*}
Since $\mathbf{\Sigma}^{-1} = \sigma^{-2} \boldsymbol{I}$, $(\mathbf{x} – \mathbf{\mu_i})^\top \Sigma^{-1} (\mathbf{x} – \mathbf{\mu_i})= \sigma^{-2}\lVert \mathbf{x} – \mathbf{\mu_i} \lVert ^2 \text{ where } \lVert\cdot\lVert\text{ Euclidean distance.}$ Therefore,
$$\text{ class 1} \iff LN(\mathbf{x}) > 0 \iff \lVert \mathbf{x} – \mathbf{\mu_1} \lVert <\lVert \mathbf{x} – \mathbf{\mu_2} \lVert $$
$$\text{ class 2} \iff LN(\mathbf{x}) \leq 0 \iff \lVert \mathbf{x} – \mathbf{\mu_1} \lVert \geq \lVert \mathbf{x} – \mathbf{\mu_2} \lVert $$
$$\text{Decision boundary} = \{\mathbf{x} \mid LN(\mathbf{x}) = 0\}=\{\mathbf{x} \mid \lVert \mathbf{x} – \mathbf{\mu_1} \lVert = \lVert \mathbf{x} – \mathbf{\mu_2} \lVert\}=\{(x_1, x_2) \mid x_2 = \frac{9}{2} – 2x_1\}$$