Multimodal models accept images, audio, or video alongside text—useful for support screenshots, diagram Q&A, and accessibility.
Use cases
- Upload UI bug screenshot → steps to reproduce
- Invoice image → structured JSON (with validation)
- Alt-text generation for images
Costs and limits
Vision tokens are expensive; resize images, crop regions, redact faces and serial numbers before upload.
Safety
Moderate uploaded media; block CSAM and biometric abuse per policy and law.
Important interview questions and answers
- Q: Why resize images?
A: Reduces tokens, latency, and accidental PII exposure.
Self-check
- Name two multimodal use cases.
- One cost control?
Pitfall: Uploading full-resolution screenshots—resize and redact serial numbers first.
Interview prep
- Resize images?
Cuts vision tokens, latency, and accidental PII in pixels.
- Moderate uploads?
Block abusive or illegal media per policy and law.