SelectFromModelを使用した特徴選択

Feature Selection Using Selectfrommodel

SelectFromModel

Sklearnには、特徴選択モジュールに組み込みのSelectFromModelがあります。このモデルは、モデル自体によって与えられた指標に基づいて特徴を選択できます。その役割はその名前と非常に一致しており、モデルから（機能）を選択します。
SelectFromModelはユニバーサルコンバーターであり、必要なモデルはconef_またはfeature_importances属性のみで、SelectFromModelのモデルとして使用できます。該当する場合coef_またはfeatureimportances属性値が事前設定されたしきい値よりも低い場合、これらの機能は重要でないと見なされ、削除されます。数値のしきい値を指定するだけでなく、組み込みのヒューリスティックを使用して、文字列パラメーターを指定することで適切なしきい値を見つけることもできます。使用できるヒューリスティックは、平均、中央値、およびこれらに浮動小数点数を掛けたものです（たとえば、0.1 * mean）。

さまざまな基本的な学習によると、Estimatorには2つのオプションがあります

1つ目は、L1特徴選択に基づいています。 L1正則化を使用する線形モデルは、スパース解を取得します。目標が次元削減である場合、LinearSVC、ロジスティック回帰、Lassoなど、sklearnでL1正則化を行う線形モデルを使用できます。ただし、SVMおよびロジスティック回帰では、パラメーターCを使用してスパース性を制御することに注意してください。Cが小さいと、選択される機能が少なくなります。なげなわでは、アルファ値が大きいほど、選択される機能が少なくなります。

2つ目は、Treeに与えられた機能の選択です。決定木、ランダムフォレストなどを含むツリーアルゴリズムは、トレーニング後にさまざまな機能の重要性を取得します。この重要な属性を使用して機能を選択することもできます。

ただし、どの種類の学習者を選択する場合でも、特徴選択の最終的な基準は、特定の方法に従って選択するのではなく、最良の特徴を選択することであることを覚えておく必要があります。

いくつかの重要なパラメータ、プロパティ、メソッド

しきい値：しきい値、文字列、浮動小数点数、オプションのデフォルトなし
- 使用可能：中央値または平均、または1.25 *平均形式。
- 使用パラメータのペナルティがL1に設定されている場合、使用されるしきい値は1e-5です。それ以外の場合は、デフォルトで平均が使用されます。
prefit：ブール値。トレーニング済みモデルであるかどうかに関係なく、デフォルトはFalseです（cv、GridSearchCV、または推定量のクローンを作成することはできません）。Falseの場合、最初に適合してから変換します。
threshold_：使用されるしきい値

簡単な例：

特徴選択にL1を使用する

from sklearn.svm import LinearSVC from sklearn.datasets import load_iris from sklearn.feature_selection import SelectFromModel # Load the boston dataset. load_iris = load_iris() X, y = load_iris['data'], load_iris['target'] print('X has %s features'%X.shape[1]) lsvc = LinearSVC(C=0.01, penalty='l1', dual=False).fit(X, y) model = SelectFromModel(lsvc,prefit=True) X_new = model.transform(X) print('X_new has %s features'%X_new.shape[1])

X has 4 features X_new has 3 features

ツリーベースの特徴選択

from sklearn.ensemble import ExtraTreesClassifier clf = ExtraTreesClassifier().fit(X,y) print('clf.feature_importances_ :',clf.feature_importances_) model_2 = SelectFromModel(clf,prefit=True) X_new_2 = model_2.transform(X) print('X_new_2 has %s features'%X_new_2.shape[1]) model_3 = SelectFromModel(clf,prefit=True,threshold=0.15) X_new_3 = model_3.transform(X) print('The threshold of the model is: %s'%model_3.threshold) print('X_new_3 has %s features'%X_new_3.shape[1])

clf.feature_importances_ : [0.14016636 0.06062787 0.47708914 0.32211664] X_new_2 has 2 features The threshold of the model is: 0.15 X_new_3 has 2 features

その他の例

特徴選択は必ずしもパフォーマンスの向上を意味するわけではなく、すべての特徴選択で一貫しています。

sklearnの私の例（SelectFromModelとLassoCVを使用した特徴選択）は、少し変更を加えたもので、一目でわかります。

import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import load_boston from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import LassoCV # Load the boston dataset. boston = load_boston() X, y = boston['data'], boston['target'] # We use the base estimator LassoCV since the L1 norm promotes sparsity of features. clf = LassoCV() # Set a minimum threshold of 0.25 sfm = SelectFromModel(clf, threshold=0.0) sfm.fit(X, y) n_features = sfm.transform(X).shape[1] def GetCVScore(estimator,X,y): from sklearn.model_selection import cross_val_score nested_score = cross_val_score(clf, X=X, y=y, cv=5) nested_score_mean = nested_score.mean() return nested_score_mean # Reset the threshold till the number of features equals two. # Note that the attribute can be set directly instead of repeatedly # fitting the metatransformer. nested_scores = [] n_features_list = [] while n_features > 2: sfm.threshold += 0.01 X_transform = sfm.transform(X) n_features = X_transform.shape[1] nested_score = GetCVScore(estimator=clf, X=X_transform, y=y) nested_scores.append(nested_score) n_features_list.append(n_features) # print('nested_score: %s'%nested_score) # print('n_features: %s'%n_features) # print('threshold: %s'%sfm.threshold) # Plot the selected two features from X. plt.title( 'Features selected from Boston using SelectFromModel with ' 'threshold %0.3f.' % sfm.threshold) feature1 = X_transform[:, 0] feature2 = X_transform[:, 1] plt.plot(feature1, feature2, 'r.') plt.xlabel('Feature number 1') plt.ylabel('Feature number 2') plt.ylim([np.min(feature2), np.max(feature2)]) plt.show() plt.scatter(n_features_list,nested_scores,c=u'b',marker=u'.',label = 'Selected') plt.scatter(X.shape[1],GetCVScore(estimator=clf, X=X, y=y),c=u'r',marker=u'*',label = 'old feature') plt.title('The reduction of features does not necessarily bring up the performance of the model') plt.xlabel('number of features') plt.ylabel('score of model') plt.show()

SelectFromModelを使用した投げ縄

上記の最初の例は、selectFromModelとLassoを同時に使用する方法を示しています。後で追加したコンテンツは、次のことを示しています。機能を減らしても、必ずしもモデルのパフォーマンスが向上するとは限りません。

特徴選択は必ずしも増加するわけではありません。すべての特徴が有効な場合、削除された特徴はモデルのパフォーマンスの低下につながるだけです。それらのすべてが効果的であるとは限らない場合でも、多くの場合、重要度の低い機能は、必ずしもモデルのパフォーマンスに確実につながることを意味するわけではありません。特定の測定方法は機能の最終的な効果を表していないため、辞退します。多くの場合、私たちの測定方法は単なる参照にすぎません。

SelectFromModelを使用した特徴選択