Machine Learning: Feature Selection and Extraction with Examples (2024)

Introduction

It is always worth putting more time and effort on understanding the dataset you are dealing with. Selecting a machine learning algorithm without deep understanding datasets is blindfolded, and very likely ends up frustration and wasting time.

Dataset cleansing, feature selection and feature extraction are the steps to achieve this understanding.

Feature Selection

Machine learning is about the extract target related information from the given feature sets. Given a feature dataset and target, only those features can contribute the target are relevant in the machine learning process. Irrelevant features not only waste the computing resource but also introducing unnecessary noises. An example of feature selection is discribed in this article.

Correlation Analysis is a key for eliminating irrelevant features. here are criteria:

  • A feature dataset should not be a constant or should have a certain variant level.
  • A feature should be correlated with target, or it does not have any contribution an target estimation
  • Features should not be highly correlated, or one of them does not offer any additional information than other ones. It can only add sampling noises at this point.

There are couple of tools in sklearn module for this, please refer this paper for more details:

https://scikit-learn.org/stable/modules/feature_selection.html

In addition, Lasso Regression can also eliminate irrelevant features during model training process, but it limits only to linear estimations.

Feature Extraction

This is one step further from feature selection. To make machine learning effective and responsive, we are expecting smaller feature dimension space, and each of them to have more contribution to the estimation target. Feature extraction is a transformation to have a new set of feature where new feature sets

  • Have a smaller dimension
  • Have a maximum correlation with target

For linear system, PCA, ICA, Maniford are typical algorithms. For nonlinear system, a variety of Maniford based algorithm, kernelled ICA. For text, image datasets, they often have large feature dimension, and features are highly coorelated, where deep learning based embedding, or CNN, RNN based algorithm fit well.

One thing worth mentioning, in most of cases, feature extraction is part of core machine learning itself. To have feature extraction in a separate process pipeline or not depends on the data collection, storage, and processing infrastructure, and also depends on engineering and business requirements.

Manifold Example

We use famous digits dataset from sklearn module in this example.

from sklearn import datasets
digits = datasets.load_digits()
plt.hist(digits.target, histtype = 'barstacked', rwidth=0.8)
for key in digits.keys():print(key)
print(digits.data.shape)
print(digits.feature_names)

let normalized it and apply GaussianNB estimator.

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
norm_model = MinMaxScaler()
data_norm = norm_model.fit_transform(digits.data)
X_train, X_test, y_train, y_test = train_test_split(data_norm, digits.target, train_size = 0.7, random_state = 41)bay = GaussianNB()
bay.fit(X_train, y_train)
y_model1 = bay.predict(X_test)
print(accuracy_score(y_test, y_model1))
mat1 = confusion_matrix(y_test, y_model1)
sns.heatmap(mat1, annot=True, cbar = False)
Machine Learning: Feature Selection and Extraction with Examples (3)

We have 80% accurcy rate which is not bad for this simple and fast GaussianNB.

Here we want to add feature extraction to reduce feature dimension and improve accuracy Let’s view the visualized the dataset first.

fig, axes = plt.subplots(10, 10, figsize=(5, 5),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(digits.target[i]), transform=ax.transAxes, color='green')
Machine Learning: Feature Selection and Extraction with Examples (4)

As we can see from the digit image, dots are clustered to form a number. Manifold seems a good fit for this situation. let’s use Isomap to reduce feature dimension to 15.

from sklearn.manifold import Isomap
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
model = make_pipeline(Isomap(n_components = 15), GaussianNB())
model.fit(X_train, y_train)
y_model = model.predict(X_test)
print(accuracy_score(y_test, y_model))mat = confusion_matrix(y_test, y_model)
sns.heatmap(mat, annot=True, cbar = False)
Machine Learning: Feature Selection and Extraction with Examples (5)

As we can see the feature dimension goes down from 48 to 15, and at the same time the estimation accuracy is improved from 80% to 97%.

As a note, the hyperparameter n_components = 15 was selected with GridSearchCV(). The code is as follows:

from sklearn.model_selection import GridSearchCVmodel = make_pipeline(Isomap(), GaussianNB())
grid = GridSearchCV(model, param_grid = {'isomap__n_components':[2,5,7,9,10,12,15,20,30]}, cv = 7)
grid.fit(X_train, y_train)
print(grid.best_params_){'isomap__n_components': 15}
Machine Learning: Feature Selection and Extraction with Examples (6)

the Chart shows 15 is a best number before it goes to overfit.

VAE Example

Deep learning model works on both linear and nonlinear data. For the highly correlated feature sets (like text, image) CNN, RNN can dramatically reduce feature dimension by learning patterns among feature.

In this example, 3 portrait photos are compressed in to 4-D vector, and later recovered back into images accurately. To make it more interesting, I am using VAE model instead. Instead of outputting vectors in AEM model, VAEM model outputs a gaussian distribution, which vectors can be sampled from. By sampling vectors from the output distributions, and decoding vector back into images, we can see image’s transition from one to another.

Here is the tensor board scalars training curves:

Machine Learning: Feature Selection and Extraction with Examples (7)
Machine Learning: Feature Selection and Extraction with Examples (8)

After training, we have following vectors decoded into images.

original images → vectors → decoded images:

Machine Learning: Feature Selection and Extraction with Examples (9)

Since the model is trained based on the same 3 images, this extreme result doesn’t have a practical value. It merely showcases here how far deep learning can go in the feature reduction area.

Interestingly we can sample vectors on the (z_mean, z_log_var) distribution to get some blended images:

Machine Learning: Feature Selection and Extraction with Examples (10)

The project notebook is available here

Machine Learning: Feature Selection and Extraction with Examples (2024)

References

Top Articles
56 Brilliant Woodworking Tips for Beginners
PANASONIC MX-H2201 USER MANUAL Pdf Download
Best Pizza Novato
What to Do For Dog Upset Stomach
Z-Track Injection | Definition and Patient Education
Back to basics: Understanding the carburetor and fixing it yourself - Hagerty Media
Weather In Moon Township 10 Days
What is the difference between a T-bill and a T note?
Shreveport Active 911
Painting Jobs Craigslist
Alexandria Van Starrenburg
The Superhuman Guide to Twitter Advanced Search: 23 Hidden Ways to Use Advanced Search for Marketing and Sales
065106619
Razor Edge Gotti Pitbull Price
Golden Abyss - Chapter 5 - Lunar_Angel
Craigslist List Albuquerque: Your Ultimate Guide to Buying, Selling, and Finding Everything - First Republic Craigslist
How to Watch the Fifty Shades Trilogy and Rom-Coms
Walgreens Alma School And Dynamite
BMW K1600GT (2017-on) Review | Speed, Specs & Prices
Joan M. Wallace - Baker Swan Funeral Home
Puretalkusa.com/Amac
Red8 Data Entry Job
Gina Wilson Angle Addition Postulate
Getmnapp
Move Relearner Infinite Fusion
Craigslist Hunting Land For Lease In Ga
Skidware Project Mugetsu
Democrat And Chronicle Obituaries For This Week
CohhCarnage - Twitch Streamer Profile & Bio - TopTwitchStreamers
Everything You Need to Know About Ñ in Spanish | FluentU Spanish Blog
County Cricket Championship, day one - scores, radio commentary & live text
Davita Salary
Ilabs Ucsf
1987 Monte Carlo Ss For Sale Craigslist
M3Gan Showtimes Near Cinemark North Hills And Xd
1-800-308-1977
Bay Focus
888-333-4026
Payrollservers.us Webclock
Paul Shelesh
Valls family wants to build a hotel near Versailles Restaurant
Oklahoma City Farm & Garden Craigslist
Ucla Basketball Bruinzone
Craigslist Houses For Rent Little River Sc
Sinai Sdn 2023
A jovem que batizou lei após ser sequestrada por 'amigo virtual'
Euro area international trade in goods surplus €21.2 bn
10 Best Tips To Implement Successful App Store Optimization in 2024
The 5 Types of Intimacy Every Healthy Relationship Needs | All Points North
Hampton Inn Corbin Ky Bed Bugs
Spongebob Meme Pic
Https://Eaxcis.allstate.com
Latest Posts
Article information

Author: Kimberely Baumbach CPA

Last Updated:

Views: 5546

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Kimberely Baumbach CPA

Birthday: 1996-01-14

Address: 8381 Boyce Course, Imeldachester, ND 74681

Phone: +3571286597580

Job: Product Banking Analyst

Hobby: Cosplaying, Inline skating, Amateur radio, Baton twirling, Mountaineering, Flying, Archery

Introduction: My name is Kimberely Baumbach CPA, I am a gorgeous, bright, charming, encouraging, zealous, lively, good person who loves writing and wants to share my knowledge and understanding with you.