NumPy performance comes from contiguous memory, ufuncs, and optional BLAS/LAPACK backends. Profile before micro-optimizing; vectorization beats Python loops by orders of magnitude.
Anti-patterns
- Python
forloops over elements - Repeated allocation inside loops
- Object dtype arrays
- Excessive fancy indexing in hot paths
Best practices
- Express logic with ufuncs and axis reductions
- Preallocate output with
np.emptyandout=parameter - Use broadcasting instead of tile/repeat when possible
- Consider float32 if precision allows
Timing sketch
import numpy as np
a = np.arange(1_000_000)
print('vectorized sum:', a.sum())
print('prefer this over sum(a[i] for i in range(len(a)))')
Important interview questions and answers
- Q: Why loops slow?
A: Python per-element overhead; NumPy batches in C. - Q: BLAS?
A: Basic Linear Algebra Subprograms—optimized matmul behind @ operator.
Self-check
- Name two vectorization best practices.
- Why avoid object dtype for numeric work?
Pitfall: Object dtype arrays disable fast vectorized paths.
Interview prep
- Anti-pattern?
Python for-loop over millions of elements.
- BLAS?
Optimized linear algebra behind @ operator.