Using PySEF¶
How to get started?¶
After installing PySEF (see Installation), simply import sef_dr, create a SEF object, fit it and transform your data:
import sef_dr
proj = sef_dr.LinearSEF(input_dimensionality=784, output_dimensionality=9)
proj.fit(data=data, target_labels=data, target='supervised', iters=10)
transformed_data = proj.transform(data)
The input_dimensionality parameter defines the dimensionality of the input data, while the output_dimensionality refers to the desired dimensionality of the data. Then, we can learn the projection using the .fit() function. The method that will be used for reducing the dimensionality of the data is specified in the target parameter of the .fit() method (PySEF provides many predefined targets/methods for dimensionality reduction, even though new methods can be also easily implemented as shown in Extending PySEF). Several different dimensionality reduction scenarios are discussed in the following sections (for all the conducted experiments the well-known MNIST dataset is used).
Using GPU acceleration¶
Following the PyTorch calling conventions, to use the GPU for the optimization/projection the .cuda() method can be used:
proj.cuda()
To move the model back to cpu, the .cpu() method should be called:
proj.cpu()
Data loading¶
To allow for easily evaluating and comparing different dimensionality reduction techniques we have included data loading capabilities in PySEF. Before running any of the following examples, please download the pre-extracted feature vectors from the following drobpox folder. After downloading them into a folder, e.g., let’s say that we download them into the /home/nick/my_data folder, you can easily load any of the six supported datasets as follows:
from sef_dr.datasets import dataset_loader
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='mnist', dataset_path='/home/nick/my_data')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='20ng', dataset_path='/home/nick/my_data')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='15scene', dataset_path='/home/nick/my_data')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='corel', dataset_path='/home/nick/my_data')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='yale', dataset_path='da/home/nick/my_datata')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='kth', dataset_path='da/home/nick/my_datata')
The MNIST dataset and the 20NG dataset will be automatically downloaded into the specified folder the first time that the dataset_loader() function will be called. Please refer to this paper for a detailed description of the evaluation setup and feature extraction process.
Recreating the geometry of a high dimensional space into a space with less dimensions¶
In unsupervised_approximation.py we demonstrate how to recreate the 50-d PCA using just 10 dimensions:
# Learn a high dimensional projection
proj_to_copy = PCA(n_components=50)
proj_to_copy.fit(train_data[:n_train_samples, :])
target_data = np.float32(proj_to_copy.transform(train_data[:n_train_samples, :]))
# Approximate it using the SEF and 10 dimensions
proj = LinearSEF(train_data.shape[1], output_dimensionality=10)
proj.cuda()
loss = proj.fit(data=train_data[:n_train_samples, :], target_data=target_data, target='copy', epochs=50, batch_size=128, verbose=True, learning_rate=0.001, regularizer_weight=0.001)
# Evaluate the method
acc = evaluate_svm(proj.transform(train_data[:n_train_samples, :]), train_labels[:n_train_samples], proj.transform(test_data), test_labels)
The experimental results demonstrate the ability of the proposed method to efficiently recreate the geometry of a high dimensional space into a space with less dimensions:
Method | Accuracy |
---|---|
PCA 10-d | 82.88% |
Linear SEF mimics PCA-20d | 84.87% |
Re-deriving similarity-based versions of well-known techniques¶
In supervised_reduction.py we demonstrate how to rederive similarity-based versions of well-known techniques. More specifically, a similarity-based LDA-like technique is derived:
proj = LinearSEF(train_data.shape[1], output_dimensionality=(n_classes - 1))
proj.cuda()
loss = proj.fit(data=train_data[:n_train, :], target_labels=train_labels[:n_train], epochs=50, target='supervised', batch_size=128, regularizer_weight=0.001, verbose=True)
The SEF-based method leads to superior results:
Method | Dimensionality | Accuracy |
---|---|---|
LDA | 9d | 85.66% |
Linear SEF | 9d | 88.89% |
Linear SEF | 18d | 89.48% |
Providing out-of-sample extensions¶
In linear_outofsample.py and kernel_outofsample.py we use the SEF to provide (linear and kernel) out-of-sample extensions for the ISOMAP technique. Note that the SEF, unlike the regression-based method, is not limited by the number of dimensions of the original technique:
isomap = Isomap(n_components=10, n_neighbors=20)
train_data_isomap = np.float32(isomap.fit_transform(train_data[:n_train_samples, :]))
proj = LinearSEF(train_data.shape[1], output_dimensionality=10)
proj.cuda()
loss = proj.fit(data=train_data[:n_train_samples, :], target_data=train_data_isomap, target='copy', epochs=50, batch_size=128, verbose=True, learning_rate=0.001, regularizer_weight=0.001)
The results are shown in the following tables:
Method | Dimensionality | Accuracy |
---|---|---|
Linear Regression | 10d | 85.25% |
Linear SEF | 10d | 85.76% |
Linear SEF | 20d | 89.48% |
Method | Dimensionality | Accuracy |
---|---|---|
Kernel Regression | 10d | 89.48% |
Kernel SEF | 10d | 88.60% |
Kernel SEF | 20d | 90.88% |
Performing SVM-based analysis¶
Finally, in svm_approximation.py an SVM-based analysis technique that mimics the similarity induced by the hyperplanes of the 1-vs-1 SVMs is used to perform DR. This method allows for using a light-weight classifier, such as the NCC, to perform fast classification:
# Learn an SVM
scaler = MinMaxScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)
parameters = {'kernel': ['linear'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]}
model = grid_search.GridSearchCV(svm.SVC(max_iter=10000, decision_function_shape='ovo'), parameters, n_jobs=-1, cv=3)
model.fit(train_data[:n_train], train_labels[:n_train])
# Learn a similarity embedding
params = {'model': model, 'n_labels': np.unique(train_labels).shape[0], 'scaler': scaler}
proj = LinearSEF(train_data.shape[1], output_dimensionality=dims)
proj.cuda()
loss = proj.fit(data=train_data[:n_train, :], target_data=train_data[:n_train, :], target_labels=train_labels[:n_train], target='svm', target_params=params, epochs=50, learning_rate=0.001, batch_size=128, verbose=True, regularizer_weight=0.001)
This code repeatedly calls the SVM to calculate the similarity matrix for the samples in each batch. If the whole similarity matrix can fit into the memory, we can speed up this process by using a precomputed similarity matrix as follows:
from sef_dr.targets import generate_svm_similarity_matrix, sim_target_svm_precomputed
# Precompute the similarity matrix
Gt = generate_svm_similarity_matrix(train_data, train_labels, len(np.unique(train_labels)), model, scaler)
params = {'Gt': Gt}
proj = LinearSEF(train_data.shape[1], output_dimensionality=dims)
proj.cuda()
loss = proj.fit(data=train_data, target_data=train_data, target_labels=train_labels, target=sim_target_svm_precomputed, target_params=params, epochs=50, learning_rate=0.001, batch_size=128, verbose=True, regularizer_weight=0.001)
The results are shown in the following table:
Method | Dimensionality | Accuracy |
---|---|---|
NCC - Original | 784d | 80.84% |
NCC - Linear SEF | 10d | 86.50% |
NCC - Linear SEF | 20d | 86.67% |