Safety means reducing harmful, biased, or policy-violating outputs while keeping the product useful—alignment training, runtime filters, and product design together.
Layers
- Pretraining data curation (vendor)
- RLHF / preference tuning (vendor)
- System policies and refusals (your prompts)
- Moderation APIs and blocklists (your stack)
- Human review for edge cases
Policy examples
Refuse illegal instructions, minimize medical/legal advice without disclaimers, block hate harassment, protect minors—tailor to jurisdiction and industry.
Trade-offs
Over-refusal frustrates users; under-refusal creates liability. Measure both task success and safety incidents.
Important interview questions and answers
- Q: Is alignment only prompt engineering?
A: No—it's training plus inference-time controls plus UX.
Self-check
- Name three safety layers.
- What is over-refusal?
Tip: Layer vendor alignment + your policies + moderation—no single control is enough.
Interview prep
- Alignment layers?
Vendor training plus your policies, moderation, and human review.
- Over-refusal?
Excessive blocking harms UX—balance safety and utility metrics.