Parquet is a columnar binary format preserving dtypes, supporting compression, and loading faster than CSV. Use to_parquet / read_parquet in production pipelines (requires pyarrow or fastparquet locally).
Parquet vs CSV
| Feature | CSV | Parquet |
|---|---|---|
| Schema | Inferred each read | Embedded types |
| Size | Text, large | Compressed binary |
| Speed | Slow parse | Fast column reads |
| Human readable | Yes | No |
Conceptual API
import pandas as pd
df = pd.DataFrame({'id': [1, 2], 'val': [1.5, 2.5]})
# df.to_parquet('data.parquet', index=False) # local
# df2 = pd.read_parquet('data.parquet')
print(df.dtypes)
Playground note
This playground has no persistent disk—practice API mentally and run Parquet IO on your machine with pip install pyarrow.
Important interview questions and answers
- Q: Why index=False?
A: Same as CSV—avoid storing default RangeIndex as a column in file. - Q: Column pruning?
A: Parquet readers can load subset of columns—efficient for wide tables.
Self-check
- Name two advantages of Parquet over CSV.
- What engine packages enable Parquet in Pandas?
Tip: Practice to_parquet/read_parquet locally with pip install pyarrow.
Interview prep
- Parquet vs CSV?
Parquet: typed, compressed, columnar; CSV: human-readable text.
- pyarrow?
Common engine enabling read_parquet/to_parquet.