• True</code> flag you can print high dimension Confusion Matrix more suitable with ignoring rows and columns fully filled with 0s.</li></ul>该提问来源于开源项目：sepandhaghighi/pycm</p></div>
• Unfortunately, because of the high dimension, a trivial execution of pymultinest always gives points outside the valid region. Basically, because of the high dimension, pymultinest can never find the...
• 1、支持高维向量的相似度计算 2、有前端界面，对用户友好。
• let's look at a question first. how can we derive (2.10) from (2.9)?...Give the definition of integration by part in high dimension from wiki first. {[from https://en.wikipedia.org/wiki/Integra
let's look at a question first. how can we derive (2.10) from (2.9)?

Give the definition of integration by part in high dimension from wiki first.
{[from https://en.wikipedia.org/wiki/Integration_by_parts]

Higher dimensions

The formula for integration by parts can be extended to functions of several variables. Instead of an interval one needs to integrate over an n-dimensional set. Also, one replaces the derivative with apartial
derivative. $\int_\Omega \varphi\, \operatorname{div}\, \vec v \; \mathrm d V = \int_{\partial \Omega} \varphi\, \vec v \cdot \mathrm d \vec S - \int_\Omega \vec v\cdot \operatorname{grad}\, \varphi \; \mathrm dV$.
More specifically, suppose Ω is an open bounded
subset of ℝn with a piecewise
smooth boundary Γ. If u and v are two continuously
differentiable functions on the closure of Ω, then the formula for integration
by parts is $\int_{\Omega} \frac{\partial u}{\partial x_i} v \,d\Omega = \int_{\Gamma} u v \, \hat\nu_i \,d\Gamma - \int_{\Omega} u \frac{\partial v}{\partial x_i} \, d\Omega,$
where $\hat{\mathbf{\nu}}$ is the outward unit surface
normal to Γ, $\hat\nu_i$ is its i-th
component, and i ranges from 1 to n.

Replacing v in the above formula with vi and summing over i gives the vector formula $\int_{\Omega} \nabla u \cdot \mathbf{v}\, d\Omega = \int_{\Gamma} u (\mathbf{v}\cdot \hat{\nu})\, d\Gamma - \int_\Omega u\, \nabla\cdot\mathbf{v}\, d\Omega,$
where v is a vector-valued function with components v1, ..., vn.

Setting u equal to the constant function 1 in the above formula gives the divergence
theorem $\int_{\Gamma} \mathbf{v} \cdot \hat{\nu}\, d\Gamma = \int_\Omega \nabla\cdot\mathbf{v}\, d\Omega.$
For $\mathbf{v}=\nabla v$ where $v\in C^2(\bar{\Omega})$,
one gets $\int_{\Omega} \nabla u \cdot \nabla v\, d\Omega = \int_{\Gamma} u\, \nabla v\cdot\hat{\nu}\, d\Gamma - \int_\Omega u\, \nabla^2 v\, d\Omega,$
which is the first Green's identity.
}

Then we give the relationship between the gradient and directional derivative:
{[from math guidebook for graduate entrance examination] }
At the end, the whole derivation process will be shown: 展开全文 • <div><p>该提问来源于开源项目：sepandhaghighi/pycm</p></div>
• </li><li>Every entry in the list must be of the same type</li><li>High-dimension lists are "square" in that the max length of the top-level list is also the max-length for any child lists....
• Dimension and Step Damage Identification for High Rise Frame Structure
• This happens only with selective delayed mirrors and diagonal and active update and large population size and large dimension but irrespectively of the step-size adaptation method.</p><p>该提问来源于...
• <p>I am trying cvxpylayer on a problem with dimension 4k. It takes quite a while to initialize the layer. Moreover, the memory cost is also huge with 400G consumption. The memory is NOT released after...
• <div><p>I am going to test high dimension (20~40 parameters, or more) objective function. Do you have some empirical idea to me how is BO learning process for high dimension? From my review about ...
• 利用高维量子密集编码的量子直接安全通讯方案，王川， 邓富国，本文提出了一个利用高维量子密集编码的量子直接通讯方案.本方案结合了块传输、量子乒乓直接通讯和量子密集编码的思想。这个方案�
• <div><p>The download all data link for 983.2.c.a throws a server error (the other download links seem to work) <p>...LMFDB/lmfdb</p></div>
• <div><p>I am trying to perform Anderson F.C., el al., A Dynamic Optimization Solution for Vertical Jumping, and to find a solution through optimization. There is something that me and some students ...
• Deep Neural Networks for High Dimension, Low Sample Size DatacodedatasetIntroductionRelated WorkDNP ModelDNP for High DimensionalityDNP for Small Sample SizeStagewise vs StepwiseTime ...
Deep Neural Networks for High Dimension, Low Sample Size DatacodedatasetIntroductionRelated WorkDNP ModelDNP for High DimensionalityDNP for Small Sample SizeStagewise vs StepwiseTime ComplexityExperimentsGeneral experimental partExperiments on Synthetic DataExperiments on Real-World Biological DatasetsUnique experiments in the improved modelthe effect of the size of the training datathe role of multiple dropoutshow the hyper-parameters influence the performance of DNPConclusions
Publication: IJCAI’17: Proceedings of the 26th International Joint Conference on Artificial IntelligenceAugust 2017
code
GBFS算法：http://www.cse.wustl.edu/˜xuzx/research/code/GBFS.zip（已连不上）
HSIC-Lasso code: http://www.makotoyamada-ml.com/software.html（页面中已过期）
dataset
Biological datasets: http://featureselection.asu.edu/datasets.php
Introduction
In bioinformatics, gene expression data suffers from the growing challenges of high dimensionality and low sample size. This kind of high dimension, low sample size (HDLSS) data is also vital for scientific discoveries in other areas such as chemistry, financial engineering, and etc [Fan and Li, 2006]. When processing this kind of data, the severe overfitting and high-variance gradients are the major challenges for the majority of machine learning algorithms [Friedman et al., 2000].
Feature selection has been widely regarded as one of the most powerful tools to analyze the HDLSS data. However, selecting the optimal subset of features is known to be NP-hard [Amaldi and Kann, 1998]. Instead, a large body of compromised methods for feature selection have been proposed.

Lasso [Tibshirani, 1996] pursue sparse linear models：sparse linear models ignore the nonlinear input-output relations and interactions among features.
nonlinear feature selection via kernel methods [Li et al., 2005; Yamada et al., 2014] or gradient boosted tree：address the curse of dimensionality under the blessing of large sample size.

The deep neural networks (DNN) methods light up new scientific discoveries. DNN has achieved breakthroughs in modeling nonlinearity in wide applications. The deeper architecture of a DNN is, the more complex relations it can model. DNN has harvested initial successes in bioinformatics for modeling splicing [Xiong et al., 2015] and sequence specificity [Alipanahi et al., 2015]. Estimating a huge amount of parameters for DNN using abundant samples may suffer from severe overfitting, not to mention the HDLSS setting.
To address the challenges of the HDLSS data, we propose an end-to-end DNN model called Deep Neural Pursuit (DNP). DNP simultaneously selects features and learns a classifier to alleviate severe overfitting caused by high dimensionality. By averaging over multiple dropouts, DNP is robust and stable to high-variance gradients resulting from the small sample size. From the perspective of feature selection, the DNP model selects features greedily and incrementally, similar to the matching pursuit [Pati et al., 1993]. More concretely, starting from an empty subset of features and a bias, the proposed DNP method incrementally selects an individual feature according to the backpropagated gradients. Meantime, once more features are selected, DNP is updated using the backpropagation algorithm.
The main contribution of this paper is to tailor the DNN for the HDLSS setting using feature selection and multiple dropouts.
Related Work
we discuss feature selection methods that are used to analyze the HDLSS data including linear, nonlinear and incremental methods.

sparsity-inducing regularizer is one of the dominating feature selection methods for the HDLSS data.
Lasso [Tibshirani, 1996] minimizes the objective function penalized by the l_1 norm of feature weights, leading to a sparse model. Unfortunately, Lasso ignores the nonlinearity and interactions among features.
(1) Kernel methods are often used for nonlinear feature selection.
Feature Vector Machine (FVM) [Li et al., 2005]；
HSIC-Lasso [Yamada et al., 2014] improves FVM by allowing different kernel functions for features and labels；
LAND [Yamada et al., 2016] further accelerates HSIC-Lasso for data with large sample size via kernel approximation and distributed computation
(2) Decision tree models are also qualified for modeling nonlinear input-output relations.
random forests [Breiman, 2001]
Gradient boosted feature selection (GBFS) [Xu et al., 2014]
The aforementioned nonlinear methods, including FVM, random forests and GBFS, require training data with large sample size.
HSIC-Lasso and LAND fits the HDLSS setting. However, compared to the proposed DNP model which is end-to-end, HSIC-Lasso and LAND are two-stage algorithms which separate feature selection from the classification
Besides DNP method, there exist other greedy and incremental feature selection algorithms.
**SpAM：**sequentially selects an individual feature in an additive manner, thereby missing important interactions among features.
Grafting method & convex neural network ：only consider single hidden layer；differ from DNP in the motivation.（Grafting focuses on the acceleration of algorithms and convex neural network focuses on the theoretical understanding of neural networks.）
Deep feature selection (DFS)：selects features in the context of DNN；However, according to our experiments, DFS fails to achieve sparse connections when facing the HDLSS data.

DNP Model
introduce notations：$FϵR^{d}$ ——input feature space in the d-dimension；
$X=(X_{1},X_{2},…,X_{n})$，$y=(y_{1},y_{2},…,y_{n})^{T}$—— data matrix of n samples and their corresponding labels（d≫n）
$f(X|W)$——a feed-forward neural network whose weights of all connections are denoted by W
$W_{F}$——the input weights which are the weights of connections between the input layer and the first hidden layer
$G_{F}$——the corresponding gradients Figure 1: (1) The selected features and the corresponding sub-network.
(2) The selection of a single feature.
(3) Calculate gradients with lower variance via multiple dropouts.
DNP for High Dimensionality
detail the DNP model for feature selection which alleviates overfitting caused by the high dimensionality.
For a feed-forward neural network, we select a specific input feature if at least one of the connections associated with that feature has non-zero weight.
To achieve this goal, we place the $l_{p,1}$ norm to constrain the input weights, i.e.,$||W_{F} ||_{(p,1)}$
$W_{F_{j} }$——the weights associated with the j-th input node in $W_{F}$
define the $l_{p,1}$ norm of the input weights as $||W_{F} ||_{(p,1)}={\sum}_{j}||W_{F_{j}} ||_{p}$ , where $||·||_{p}$ is the $l_{p}$ norm on a vector.
we assume that weights in $W_{F_{j} }$ form a group.
A general form of the objective function for training the feed-forward network in formulated as: we only consider the binary classification problem and use the logistic loss in problem (1). (Extensions to multi-class classification, regression or unsupervised reconstruction are very easy.)
To directly optimize problem (1) over the HDLSS data is highly tricky for two reasons：(1) directly minimizing the l_(p,1)-constrained problem is difficult for the back propagation algorithm. direct optimization using all features easily gets stuck in a local optimum which suffers from severe overfitting.
Instead, we optimize problem (1) in a greedy and incremental manner.
The main idea of the proposed DNP：we optimize problem (1) over a small sub-network containing a small subset of features, which is less difficult. The information obtained during the training process, in turn, guides us to incorporate more features, and the sub-network serves as the initialization for a larger sub-network with more features involved.
The DNP method enjoys two advantages：(1) the optimization improves to a large extent.
(2) DNP simultaneously selects features and minimizes the training loss over the labeled data in an end-to-end manner; selection process is not independent of the learning process
The whole process of the feature selection in the DNP:
We maintain two sets, i.e., a selected set $S$ and a candidate set $C$, with $S ∪ C = F$. (Step 7) how to select features using $G_{F}$?
the gradient’s magnitude implies how much the objective function may decrease by updating the corresponding weight; the norm of a group of gradients infers how much the loss may decrease by updating this group of weights together
we assume that the larger the $||G_{F_{j}}||_{q}$  is, the more jth feature contributes to minimizing problem (1). Consequently, we select the feature with the maximum  $||G_{F_{j}}||_{q}$
DNP for Small Sample Size
we present the use of multiple dropouts to handle high-variance gradients caused by the small sample size.
Multiple dropouts could improve our DNP method in the following two algorithmic aspects：(1) step 6: DNP randomly drops neurons multiple times, computes $G_{F_{c}}$ based on the remaining neurons and connections, and averages multiple $G_{F_{c}}$. Such obtains averaged gradients with low variance.
(2) multiple dropouts empower DNP with the stable feature selection. Multiple dropouts combine selected features over many random sub-networks to make the DNP method more stable and powerful.
Stagewise vs Stepwise
Updating input weights $W_{S}$ in step 5 of Algorithm 1 has two choices, i.e., the stagewise and stepwise approaches.
We combine both approaches. We dynamically adapt the learning rate for each weight according to the Adagrad. As a result, like the stepwise approach, all selected weights  $W_{S}$ enjoy updates but, like the stagewise approach, newly selected features $G_{F_{j}}$ enjoy more.
Time Complexity
The time complexity of DNP is dominated by the backpropagation which is O(hknd), where h is a constant decided by the network structure of DNP. The time complexity grows linearly with respect to the number of selected features k, the sample size n, and the feature dimension d.
Experiments
General experimental part
We compare the proposed DNP method with three representative feature selection algorithms, including $l_{1}$-penalized logistic regression (LogR-$l_{1}$), gradient boosted feature selection (GBFS) [Xu et al., 2014], and HSIC-Lasso [Yamada et al., 2014].
evaluation standard: the F1 score of correct selection of true features (identify features that the labels truly depend on); the test AUC score (learn an accurate classifier based on selected features).
Experiments on Synthetic Data
We first synthesize highly complex and nonlinear data to investigate the performance of different algorithms.
generate the synthetic data: we firstly draw input samples X from the uniform distribution U(-1,1) (feature dimension d is fixed to be 10,000); Afterwards, we obtain the corresponding labels by passing X into the feed-forward neural network with {50, 30, 15, 10} ReLU hidden units in four hidden layers. Input weights connecting with the first m dimensions, i.e., $W_{F_{1…m}}$, are randomly sampled from a Gaussian distribution N(0,0.5). The remaining connections are kept zero. (first m features are the true features that decide the label). In order to add noises into data, we randomly flip 5% labels. For each setting of m, we generate 800 training samples, 200 validation samples, and 7,500 test samples (sample sizes ≪d)
When m = 2, we visualize the decision boundaries learned by different algorithms: Figure 2: Decision boundaries learned by different algorithms based on 10,000-dimensional synthetic data with two true features. The x-axis and y-axis denote the two true features.
Figures (a) and (b) plot the positive samples with black and ©-(f) plot the predicted positive samples with black.
LogR-$l_{1}$ only learns a linear decision boundary which is insufficient for highly complex and nonlinear data. The GBFS uses the regression tree as a base learner, thereby achieving an axis-parallel decision boundary. The HSIC-Lasso and the proposed DNP not only model the nonlinear decision boundaries but also exactly identify the two true features. Table 1: Performance of classification and feature selection on synthetic datasets with different numbers of true features. The statistically best performance is shown in bold.
In terms of the test AUC score, DNP and HSIC-Lasso both show superior performance over others. DNP performs best on all the datasets and significantly outperforms HSIC-Lasso when m = 10 in terms of the t-test (p-value < 0.05).
In terms of the F1 score for feature selection, DNP performs the best on all datasets and it even outperforms the most competitive baseline, HSIC-Lasso, by 8.65% on average.
GBFS consistently performs worst in terms of both classification and feature selection.
Experiments on Real-World Biological Datasets
To investigate the performance of DNP on the real-world datasets, we use six public biological datasets, all of which suffer from the HDLSS problem. We report the average results for 10 times random split with 80% data for training, 10% for validation, and 10% for testing.
In Fig. 3, we investigate the average test AUC scores with respect to the number of selected features. We use a circle as an indicator when DNP is outperformed by the best baseline and a star when DNP outperforms the best baseline significantly (t-test, p-value < 0.05).
On all six datasets, test AUC scores of DNP converge quickly within fewer than 10 iterations.
On the leukemia dataset, the proposed DNP method significantly outperforms the best baseline no matter how many features are selected.
For the ALLAML and Prostate GE dataset, LogR-l_1 serves as a competitive baseline as it outperforms other methods when few features are selected. However, DNP achieves a comparable test AUC score when more features are involved.
For the other three datasets, DNP outperforms GBFS significantly and performs comparable to LogR-l_1 and HSIC-Lasso.
On average across six real-world datasets, DNP outperforms the most competitive baseline, HSIC-Lasso, by 2.53% in terms of the average test AUC score.
In summary, DNP can achieve comparable or improved performance over baselines on the six real-world datasets.
Unique experiments in the improved model
the effect of the size of the training data
we compare DNP with the baselines by varying the sample size for training, while the sample sizes for validation and test are kept fixed.
Fig. 4 shows the average test AUC scores across six real-world datasets. All the methods in comparison suffer as the training sample size decreases. GBFS, designed for large sample size, suffers the most. LogR-$l_{1}$, HSIC-Lasso, and DNP perform similarly in small sample size. However, when only 10% or 30% training samples are used, DNP slightly outperforms other baselines.
the role of multiple dropouts
we compare the performance of DNP with and without multiple dropouts. we can see that multiple dropouts can improve the test AUC score on five out of six datasets. We measure the stability of the algorithms with the Tanimoto distance [Kalousis et al., 2007].
we measure the stability of DNP by averaging the similarities calculated from all pairs of training sets generated from 10-fold cross validation. A higher stability score implies a more stable algorithm.
DNP with multiple dropouts is clearly more stable than DNP without dropout on the all six datasets.
how the hyper-parameters influence the performance of DNP
For DNP model with the specific number of hidden layers, we calculate the average test AUC score of 10 times random split.
Table 3: The best number of hidden layers for DNP On five real-world datasets, DNP with three or four hidden layers outperforms that with one, two or five hidden layers. The results coincide with our motivation that deeper neural networks are more qualified for complex datasets. Meantime, due to the small sample size, training DNNs with more hidden layers is extremely challenging, which incurs inferior performances.
Conclusions
We propose a DNP model tailored for the high dimension, low sample size data.
DNP can select features in a nonlinear way. With an incremental manner to select features, DNP is robust to high dimensionality.
By using the multiple dropouts technique, DNP can learn from a small number of samples and is stable for feature selection.
Moreover, the training of DNP is end-to-end. Empirical results verify its good performance in both classification and feature selection.
In the future, we plan to use sophisticated network architectures in replace of a simple multi-layer perceptron and apply DNP to more domains that suffer from the HDLSS problem.
（有该论文的自己总结版的汇报PPT需要私聊）


展开全文 • 代码规范
• <div><p>When trying to create a random sparse matrix of size 10^6 x 10^6, a bug occurs based on an integer overflow. <p>Test to validate: <pre><code> public void hugeDimensionalMatrix() { ...
• Instead, as mentioned in the lectures, you can implement PCA in a more efficient manner, which we call "PCA for high dimensional data" (PCA_high_dim). Below are the steps for performing PCA for ...
The data preprocessing as standarlization or feature Scaling:

https://en.wikipedia.org/wiki/Feature_scaling

Before we implement PCA, we will need to do some data preprocessing. In this assessment, some of them will be implemented by you, others we will take care of. However, when you are working on real world problems, you will need to do all these steps by yourself!

The preprocessing steps we will do are

Convert unsigned interger 8 (uint8) encoding of pixels to a floating point number between 0-1.
Subtract from each image the mean μμ.
Scale each dimension of each image by 1σ1σ where σσ is the stardard deviation.
1. PCA

Now we will implement PCA. Before we do that, let's pause for a moment and think about the steps for performing PCA. Assume that we are performing PCA on some dataset XX for MM principal components. We then need to perform the following steps, which we break into parts:

Data normalization (normalize).
Find eigenvalues and corresponding eigenvectors for the covariance matrix SS. Sort by the largest eigenvalues and the corresponding eigenvectors (eig).
After these steps, we can then compute the projection and reconstruction of the data onto the spaced spanned by the top n eigenvectors.

Recall that the principle basis is the vector related the max eigenvalue of the covariance matrix.

Code：

# PACKAGE: DO NOT EDIT THIS CELL
import numpy as np
import timeit

# PACKAGE: DO NOT EDIT THIS CELL
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from ipywidgets import interact

from load_data import load_mnist

MNIST = load_mnist()
images, labels = MNIST['data'], MNIST['target']

# GRADED FUNCTION: DO NOT EDIT THIS LINE

def normalize(X):
"""Normalize the given dataset X
Args:
X: ndarray, dataset

Returns:
(Xbar, mean, std): tuple of ndarray, Xbar is the normalized dataset
with mean 0 and standard deviation 1; mean and std are the
mean and standard deviation respectively.

Note:
You will encounter dimensions where the standard deviation is
zero, for those when you do normalization the normalized data
will be NaN. Handle this by setting using std = 1 for those
dimensions when doing normalization.
"""
#mu = np.zeros(X.shape) # <-- EDIT THIS, compute the mean of X
mu = np.mean(X, axis = 0, keepdims = True)
std = np.std(X, axis = 0, keepdims = True)
std_filled = std.copy()
std_filled[std==0] = 1.
#Xbar = X                  # <-- EDIT THIS, compute the normalized data Xbar
Xbar = (X - mu) / std_filled
return Xbar, mu, std

def eig(S):
"""Compute the eigenvalues and corresponding eigenvectors
for the covariance matrix S.
Args:
S: ndarray, covariance matrix

Returns:
(eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors

Note:
the eigenvals and eigenvecs should be sorted in descending
order of the eigen values
"""
eVals, eVecs = np.linalg.eig(S)
order = np.argsort(eVals)[::-1] # sort the eigenvals in descending order.
eVals = eVals[order]
eVecs = eVecs[:, order]
return (eVals, eVecs) # <-- EDIT THIS to return the eigenvalues and corresponding eigenvectors

def projection_matrix(B):
"""Compute the projection matrix onto the space spanned by B
Args:
B: ndarray of dimension (D, M), the basis for the subspace

Returns:
P: the projection matrix
"""
P = B @ np.linalg.inv(B.T @ B) @ B.T
return P # <-- EDIT THIS to compute the projection matrix

def PCA(X, num_components):
"""
Args:
X: ndarray of size (N, D), where D is the dimension of the data,
and N is the number of datapoints
num_components: the number of principal components to use.
Returns:
X_reconstruct: ndarray of the reconstruction
of X from the first num_components principal components.
"""
# your solution should take advantage of the functions you have implemented above.
N, D = X.shape
X_normalized, mu, std = normalize(X)
X_normalized.shape
# covariance matrix with mean 0
S = (X_normalized.T @ X_normalized) / N
code, onb = eig(S)
code = code[:num_components]
onb = onb[:, :num_components]
# P with the dimension(D, D)
P = projection_matrix(onb)
X_projection = P @ X.T
return X_projection.T # <-- EDIT THIS to return the reconstruction of X

## Some preprocessing of the data
NUM_DATAPOINTS = 1000
X = (images.reshape(-1, 28 * 28)[:NUM_DATAPOINTS]) / 255.
Xbar, mu, std = normalize(X)

for num_component in range(1, 20):
from sklearn.decomposition import PCA as SKPCA
# We can compute a standard solution given by scikit-learn's implementation of PCA
pca = SKPCA(n_components=num_component, svd_solver='full')
sklearn_reconst = pca.inverse_transform(pca.fit_transform(Xbar))
reconst = PCA(Xbar, num_component)
np.testing.assert_almost_equal(reconst, sklearn_reconst)
print(np.square(reconst - sklearn_reconst).sum())

Result：

(8.5153870005e-24+0j)
(8.09790151532e-24+0j)
(9.61487939311e-24+0j)
(6.39164394758e-24+0j)
(1.19817697147e-23+0j)
(9.18939009489e-24+0j)
(2.46356799263e-23+0j)
(2.04450491509e-23+0j)
(2.35281327024e-23+0j)
(2.33297802189e-22+0j)
(9.45193136857e-23+0j)
(9.82734807213e-23+0j)
(1.596514124e-22+0j)
(7.20916435378e-23+0j)
(2.9098190907e-23+0j)
(3.7462168164e-23+0j)
(3.22053322424e-23+0j)
(2.71427239921e-23+0j)
(1.11240190546e-22+0j)

Calculate the MSE for data set

def mse(predict, actual):
"""Helper function for computing the mean squared error (MSE)"""
return np.square(predict - actual).sum(axis=1).mean()

loss = []
reconstructions = []
# iterate over different number of principal components, and compute the MSE
for num_component in range(1, 100):
reconst = PCA(Xbar, num_component)
error = mse(reconst, Xbar)
reconstructions.append(reconst)
print('n = {:d}, reconstruction_error = {:f}'.format(num_component, error))
loss.append((num_component, error))

reconstructions = np.asarray(reconstructions)
reconstructions = reconstructions * std + mu # "unnormalize" the reconstructed image
loss = np.asarray(loss)

import pandas as pd
# create a table showing the number of principal components and MSE
pd.DataFrame(loss).head()

fig, ax = plt.subplots()
ax.plot(loss[:,0], loss[:,1]);
ax.axhline(100, linestyle='--', color='r', linewidth=2)
ax.xaxis.set_ticks(np.arange(1, 100, 5));
ax.set(xlabel='num_components', ylabel='MSE', title='MSE vs number of principal components');

@interact(image_idx=(0, 1000))
def show_num_components_reconst(image_idx):
fig, ax = plt.subplots(figsize=(20., 20.))
actual = X[image_idx]
# concatenate the actual and reconstructed images as large image before plotting it
x = np.concatenate([actual[np.newaxis, :], reconstructions[:, image_idx]])
ax.imshow(np.hstack(x.reshape(-1, 28, 28)[np.arange(10)]),
cmap='gray');
ax.axvline(28, color='orange', linewidth=2)

@interact(i=(0, 10))
def show_pca_digits(i=1):
"""Show the i th digit and its reconstruction"""
plt.figure(figsize=(4,4))
actual_sample = X[i].reshape(28,28)
reconst_sample = (reconst[i, :] * std + mu).reshape(28, 28)
plt.imshow(np.hstack([actual_sample, reconst_sample]), cmap='gray')
plt.show()

2. PCA for high-dimensional datasets

Sometimes, the dimensionality of our dataset may be larger than the number of samples we have. Then it might be inefficient to perform PCA with your implementation above. Instead, as mentioned in the lectures, you can implement PCA in a more efficient manner, which we call "PCA for high dimensional data" (PCA_high_dim).

Below are the steps for performing PCA for high dimensional dataset # GRADED FUNCTION: DO NOT EDIT THIS LINE
### PCA for high dimensional datasets

def PCA_high_dim(X, n_components):
"""Compute PCA for small sample size but high-dimensional features.
Args:
X: ndarray of size (N, D), where D is the dimension of the sample,
and N is the number of samples
num_components: the number of principal components to use.
Returns:
X_reconstruct: (N, D) ndarray. the reconstruction
of X from the first num_components pricipal components.
"""
N, D = X.shape
S_prime = (X @ X.T) / N
code_prime, onb_prime = eig(S_prime)
code_prime = code_prime[:n_components]
onb_prime = onb_prime[:, :n_components]
# calculate the principle subspace U
U = X.T @ onb_prime # (D, N) @ (N, n_components)
P = projection_matrix(U)
X_projection = P @ X.T
return X_Projection.T # <-- EDIT THIS to return the reconstruction of X

Test CASE

np.testing.assert_almost_equal(PCA(Xbar, 2), PCA_high_dim(Xbar, 2))

Time Complexity Analysis:

Now let's compare the running time between PCA and PCA_high_dim.

Tips for running benchmarks or computationally expensive code:

When you have some computation that takes up a non-negligible amount of time. Try separating the code that produces output from the code that analyzes the result (e.g. plot the results, comput statistics of the results). In this way, you don't have to recompute when you want to produce more analysis.

The next cell includes a function that records the time taken for executing a function f by repeating it for repeat number of times. You do not need to modify the function but you can use it to compare the running time for functions which you are interested in knowing the running time.

def time(f, repeat=10):
times = []
for _ in range(repeat):
start = timeit.default_timer()
f()
stop = timeit.default_timer()
times.append(stop-start)
return np.mean(times), np.std(times) times_mm0 = []
times_mm1 = []

# iterate over datasets of different size
for datasetsize in np.arange(4, 784, step=20):
XX = Xbar[:datasetsize] # select the first datasetsize samples in the dataset
# record the running time for computing X.T @ X
mu, sigma = time(lambda : XX.T @ XX)
times_mm0.append((datasetsize, mu, sigma))

# record the running time for computing X @ X.T
mu, sigma = time(lambda : XX @ XX.T)
times_mm1.append((datasetsize, mu, sigma))

times_mm0 = np.asarray(times_mm0)
times_mm1 = np.asarray(times_mm1)

fig, ax = plt.subplots()
ax.set(xlabel='size of dataset', ylabel='running time')
bar = ax.errorbar(times_mm0[:, 0], times_mm0[:, 1], times_mm0[:, 2], label="$X^T X$ (PCA)", linewidth=2)
ax.errorbar(times_mm1[:, 0], times_mm1[:, 1], times_mm1[:, 2], label="$X X^T$ (PCA_high_dim)", linewidth=2)
ax.legend()  Benchmark for PCA and PCA high dimension

times0 = []
times1 = []

# iterate over datasets of different size
for datasetsize in np.arange(4, 784, step=100):
XX = Xbar[:datasetsize]
npc = 2
mu, sigma = time(lambda : PCA(XX, npc), repeat=10)
times0.append((datasetsize, mu, sigma))

mu, sigma = time(lambda : PCA_high_dim(XX, npc), repeat=10)
times1.append((datasetsize, mu, sigma))

times0 = np.asarray(times0)
times1 = np.asarray(times1)

fig, ax = plt.subplots()
ax.set(xlabel='number of datapoints', ylabel='run time')
ax.errorbar(times0[:, 0], times0[:, 1], times0[:, 2], label="PCA", linewidth=2)
ax.errorbar(times1[:, 0], times1[:, 1], times1[:, 2], label="PCA_high_dim", linewidth=2)
ax.legend(); 展开全文  PCA 数据科学
• Quantum information protocols often rely on tomographic techniques to determine the state of the system. A popular method of encoding information is on the different paths a photon may take, e.g., ...
• <p>is invoked and the app crashes when high dimension images are used in production (in debug mode some of the images turn gray). <p><strong>To Reproduce 1) Create a map 2) Add 500+ points with ...
• 摘要  最近邻查询或近邻查询问题出现在大量的数据库应用中，通常在相似性搜索的上下文中。 最近，对建立用于对高维数据执行相似性搜索的搜索索引结构，例如图像数据库，文档集合， 时间序列数据库和基因组数据库。...       参考：http://wenku.baidu.com/link?url=-gWqvPnuc16G3zZSAXJLvTdnRw81R-BFNqoj1gCGB9D-ZxLYMC10fGqqoYiDCsgjCrZakyd6yR9PmYQHuMKdWha0pVSE_Udperc0AQc3vuu
LSH原文：www.people.csail.mit.edu/indyk/


展开全文  图像检索 算法 搜索 信息检索
• 5、对于类似于“Sales Number”这样高Cardinality值的Line Item Dimension，同时将其设置成为High Cardinality维度。   转载于:https://www.cnblogs.com/hanmos/archive/2013/03/19/2969710.html
• STUDY ON FOUR-DIMENSION NUMERICAL MODELING OF SRRS EFFECTS ON BEAM DISTRIBUTION IN NEAR FIELD AFTER ICF HIGH POWER ULTRAVIOLET LASER PROPAGATING THROUGH A LONG AIR PATH 研究论文
• <div><p>First, comparing sep-aCMA-ES with and without gradient injection (GI), GI allows fast convergence: The difference in initial value is due to not showing iteration 0 (will be ticketed and ...
• 高长径比宏孔硅阵列光电化学腐蚀中孔径控制技术，王国政，王蓟，高长径比宏孔硅阵列（MSA）在光子晶体、硅微通道板、MEMS 器件等领域应用前景广阔，引起人们广泛关注。为制备理想的MSA结构，本文开�
• <div><p>I think the amount of mob spawns in this dimension is a bit much. Since there's no time and it's always "night", so mobs can constantly spawn. Eventually the amount of mobs at ...
• <p>Our video resolution has a high resolution(2048*1536) , which decreases after entering the process of finding objects. After finding the location of the objects, we will convert the coordinates of ...
• ((processed_observations - ob_space.low) / (ob_space.high - ob_space.low)) <p>The tensor "processed_observations" has shape (, 8) but the pandas variable ob_space.low has shape (8,). The ...
• If we ask for a thumbnail of 20 pixels high, for example, it works fine on the same image because this only gives us an image of 712px high. If we ask for one that is 30px, however, it tries to ...  ...