scikit-learn builds on NumPy and SciPy—sparse matrices, distances, optimization in some estimators. Export X = df.to_numpy() with shape (n_samples, n_features) before fitting models on the AI track.
Shared foundations
- Both expect float numeric arrays
- SciPy sparse formats used in text vectorizers
- Train/test split before fitting scalers—same leakage rules as Pandas pipelines
- Standardize with sklearn; hypothesis tests with scipy.stats on residuals
Workflow
Pandas clean → NumPy feature matrix → sklearn fit → SciPy tests on residuals or subgroup metrics for model monitoring.
Distance example
import numpy as np
from scipy.spatial.distance import cdist
X = np.array([[0, 0], [1, 0], [0, 1]], dtype=float)
D = cdist(X, X, metric='euclidean')
print(D)
Important interview questions and answers
- Q: X shape convention?
A: (n_samples, n_features)—rows are observations, columns are features. - Q: SciPy in sklearn?
A: Internal—sparse LA, stats; you still call scipy.stats explicitly for formal inference.
Self-check
- What shape should X have for sklearn?
- Name one SciPy module sklearn may use internally.
Pitfall: Fit scalers on train split only—same leakage rule as Pandas ML pipelines.
Interview prep
- X shape?
(n_samples, n_features) float matrix for sklearn.
- Leakage?
Fit preprocessors on train only.