Using PySEF

How to get started?

After installing PySEF (see Installation), simply import sef_dr, create a SEF object, fit it and transform your data:

import sef_dr
proj = sef_dr.LinearSEF(input_dimensionality=784, output_dimensionality=9)
proj.fit(data=data, target_labels=data, target='supervised', iters=10)
transformed_data = proj.transform(data)

The input_dimensionality parameter defines the dimensionality of the input data, while the output_dimensionality refers to the desired dimensionality of the data. Then, we can learn the projection using the .fit() function. The method that will be used for reducing the dimensionality of the data is specified in the target parameter of the .fit() method (PySEF provides many predefined targets/methods for dimensionality reduction, even though new methods can be also easily implemented as shown in Extending PySEF). Several different dimensionality reduction scenarios are discussed in the following sections (for all the conducted experiments the well-known MNIST dataset is used).

Using GPU acceleration

Following the PyTorch calling conventions, to use the GPU for the optimization/projection the .cuda() method can be used:

proj.cuda()

To move the model back to cpu, the .cpu() method should be called:

proj.cpu()

Data loading

To allow for easily evaluating and comparing different dimensionality reduction techniques we have included data loading capabilities in PySEF. Before running any of the following examples, please download the pre-extracted feature vectors from the following drobpox folder. After downloading them into a folder, e.g., let’s say that we download them into the /home/nick/my_data folder, you can easily load any of the six supported datasets as follows:

from sef_dr.datasets import dataset_loader
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='mnist', dataset_path='/home/nick/my_data')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='20ng', dataset_path='/home/nick/my_data')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='15scene', dataset_path='/home/nick/my_data')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='corel', dataset_path='/home/nick/my_data')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='yale', dataset_path='da/home/nick/my_datata')
train_data, train_labels, test_data, test_labels = dataset_loader(dataset='kth', dataset_path='da/home/nick/my_datata')

The MNIST dataset and the 20NG dataset will be automatically downloaded into the specified folder the first time that the dataset_loader() function will be called. Please refer to this paper for a detailed description of the evaluation setup and feature extraction process.

Recreating the geometry of a high dimensional space into a space with less dimensions

In unsupervised_approximation.py we demonstrate how to recreate the 50-d PCA using just 10 dimensions:

# Learn a high dimensional projection
proj_to_copy = PCA(n_components=50)
proj_to_copy.fit(train_data[:n_train_samples, :])
target_data = np.float32(proj_to_copy.transform(train_data[:n_train_samples, :]))

# Approximate it using the SEF and 10 dimensions
proj = LinearSEF(train_data.shape[1], output_dimensionality=10)
proj.cuda()
loss = proj.fit(data=train_data[:n_train_samples, :], target_data=target_data, target='copy', epochs=50, batch_size=128, verbose=True, learning_rate=0.001, regularizer_weight=0.001)

# Evaluate the method
acc = evaluate_svm(proj.transform(train_data[:n_train_samples, :]), train_labels[:n_train_samples], proj.transform(test_data), test_labels)

The experimental results demonstrate the ability of the proposed method to efficiently recreate the geometry of a high dimensional space into a space with less dimensions:

Method Accuracy
PCA 10-d 82.88%
Linear SEF mimics PCA-20d 84.87%

Re-deriving similarity-based versions of well-known techniques

In supervised_reduction.py we demonstrate how to rederive similarity-based versions of well-known techniques. More specifically, a similarity-based LDA-like technique is derived:

proj = LinearSEF(train_data.shape[1], output_dimensionality=(n_classes - 1))
proj.cuda()
loss = proj.fit(data=train_data[:n_train, :], target_labels=train_labels[:n_train], epochs=50, target='supervised', batch_size=128, regularizer_weight=0.001, verbose=True)

The SEF-based method leads to superior results:

Method Dimensionality Accuracy
LDA 9d 85.66%
Linear SEF 9d 88.89%
Linear SEF 18d 89.48%

Providing out-of-sample extensions

In linear_outofsample.py and kernel_outofsample.py we use the SEF to provide (linear and kernel) out-of-sample extensions for the ISOMAP technique. Note that the SEF, unlike the regression-based method, is not limited by the number of dimensions of the original technique:

isomap = Isomap(n_components=10, n_neighbors=20)
train_data_isomap = np.float32(isomap.fit_transform(train_data[:n_train_samples, :]))
proj = LinearSEF(train_data.shape[1], output_dimensionality=10)
proj.cuda()
loss = proj.fit(data=train_data[:n_train_samples, :], target_data=train_data_isomap, target='copy', epochs=50, batch_size=128, verbose=True, learning_rate=0.001, regularizer_weight=0.001)

The results are shown in the following tables:

Method Dimensionality Accuracy
Linear Regression 10d 85.25%
Linear SEF 10d 85.76%
Linear SEF 20d 89.48%
Method Dimensionality Accuracy
Kernel Regression 10d 89.48%
Kernel SEF 10d 88.60%
Kernel SEF 20d 90.88%

Performing SVM-based analysis

Finally, in svm_approximation.py an SVM-based analysis technique that mimics the similarity induced by the hyperplanes of the 1-vs-1 SVMs is used to perform DR. This method allows for using a light-weight classifier, such as the NCC, to perform fast classification:

# Learn an SVM
scaler = MinMaxScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)

parameters = {'kernel': ['linear'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]}
model = grid_search.GridSearchCV(svm.SVC(max_iter=10000, decision_function_shape='ovo'), parameters, n_jobs=-1, cv=3)
model.fit(train_data[:n_train], train_labels[:n_train])

# Learn a similarity embedding
params = {'model': model, 'n_labels': np.unique(train_labels).shape[0], 'scaler': scaler}
proj = LinearSEF(train_data.shape[1], output_dimensionality=dims)
proj.cuda()
loss = proj.fit(data=train_data[:n_train, :], target_data=train_data[:n_train, :], target_labels=train_labels[:n_train], target='svm', target_params=params, epochs=50, learning_rate=0.001, batch_size=128, verbose=True, regularizer_weight=0.001)

This code repeatedly calls the SVM to calculate the similarity matrix for the samples in each batch. If the whole similarity matrix can fit into the memory, we can speed up this process by using a precomputed similarity matrix as follows:

from sef_dr.targets import generate_svm_similarity_matrix, sim_target_svm_precomputed

# Precompute the similarity matrix
Gt = generate_svm_similarity_matrix(train_data, train_labels, len(np.unique(train_labels)), model, scaler)
params = {'Gt': Gt}

proj = LinearSEF(train_data.shape[1], output_dimensionality=dims)
proj.cuda()
loss = proj.fit(data=train_data, target_data=train_data, target_labels=train_labels, target=sim_target_svm_precomputed, target_params=params, epochs=50, learning_rate=0.001, batch_size=128, verbose=True, regularizer_weight=0.001)

The results are shown in the following table:

Method Dimensionality Accuracy
NCC - Original 784d 80.84%
NCC - Linear SEF 10d 86.50%
NCC - Linear SEF 20d 86.67%