Feature Extraction
<aside>
💡 sklearn.feature_extraction
</aside>
DictVectorizer
Converts lists of mappings of feature name and feature value, into a matrix.
FeatureHasher
- High-speed, low-memory vectorizer that uses feature hashing technique.
- Instead of building a hash table of the features, as the vectorizers do, it applies a hash function to the features to determine their column index in sample matrices directly.
- This results in increased speed and reduced memory usage, at the expense of inspectability; the hasher does not remember what the input features looked like and has no inverse_transform method.
- The output is scipy.sparse matrix.
Data Cleaning
Handling missing values
<aside>
💡 sklearn.impute
</aside>
SimpleImputer
Fills missing values with one of the following strategies:
'mean', 'median', 'most_frequent' and 'constant'.
KNNImputer
- Uses the k-nearest neighbours approach to fill missing values in a dataset.
- The missing value of an attribute in a specific example is filled with the mean value of the same attribute of n_neighbors closest neighbours.
- The nearest neighbours are decided based on Euclidean distance.