Master these 31 carefully curated interview questions to ace your next Ai ml interview.
AI is the broad field of intelligent machines; ML is a subset using data to learn patterns; DL is a subset of ML using neural networks.
AI: any technique enabling machines to mimic human intelligence. ML: algorithms that learn from data without explicit programming (supervised, unsupervised, reinforcement). DL: multi-layer neural networks that learn hierarchical representations. Relationship: AI ⊃ ML ⊃ DL. Examples: AI (Siri), ML (spam filter), DL (image recognition, GPT).
Supervised learning trains on labeled data (input→output pairs); unsupervised learning finds patterns in unlabeled data.
Supervised: classification (spam/not spam), regression (predict price). Algorithms: Linear Regression, Decision Trees, SVM, Neural Networks. Unsupervised: clustering (customer segments), dimensionality reduction (PCA), association rules. Algorithms: K-Means, DBSCAN, Hierarchical Clustering, t-SNE. Semi-supervised: mix of labeled and unlabeled data. Self-supervised: creates labels from data itself (BERT, GPT).
Overfitting is when a model learns training noise instead of patterns, performing well on training but poorly on new data.
Signs: high training accuracy, low test accuracy. Prevention: (1) More training data. (2) Regularization (L1/L2). (3) Cross-validation. (4) Dropout (neural networks). (5) Early stopping. (6) Simpler model. (7) Data augmentation. (8) Ensemble methods. (9) Feature selection. Underfitting: model too simple, poor on both train and test. Balance bias-variance tradeoff.
Classification: accuracy, precision, recall, F1-score, AUC-ROC. Regression: MSE, RMSE, MAE, R-squared.
Classification: Accuracy (overall correctness), Precision (true positives / predicted positives), Recall (true positives / actual positives), F1 (harmonic mean of precision/recall), AUC-ROC (discrimination ability). Regression: MSE (mean squared error), RMSE (root MSE, same units), MAE (mean absolute error), R² (variance explained). Choose based on business need: medical diagnosis needs high recall; spam filter needs high precision.
A neural network is layers of interconnected nodes (neurons) that learn to transform inputs into outputs through weighted connections.
Architecture: input layer, hidden layers, output layer. Each neuron: weighted sum of inputs → activation function → output. Training: forward pass (predict), loss calculation, backward pass (gradients via backpropagation), weight update (optimizer like SGD, Adam). Activation functions: ReLU (hidden layers), Sigmoid (binary output), Softmax (multi-class). Deep networks: many hidden layers enable learning complex hierarchical features.
Bias is error from wrong assumptions (underfitting); variance is error from sensitivity to training data noise (overfitting).
High bias: model too simple, misses patterns (linear model for nonlinear data). High variance: model too complex, memorizes noise. Total error = bias² + variance + irreducible error. Balance: (1) Model complexity vs data size. (2) Regularization reduces variance. (3) Ensemble methods (bagging reduces variance, boosting reduces bias). (4) Cross-validation to estimate both. Goal: minimize total error, not just one component.
CNNs process spatial data (images) using convolutional filters; RNNs process sequential data (text, time series) with memory cells.
CNN: convolutional layers extract spatial features (edges → shapes → objects), pooling layers reduce dimensions. Used for: image classification, object detection, segmentation. RNN: hidden state carries information across time steps. Variants: LSTM (long-term memory), GRU (simpler). Used for: text, speech, time series. Modern: Transformers largely replaced RNNs for NLP (attention mechanism avoids sequential bottleneck). Vision Transformers (ViT) challenging CNNs.
Transfer learning uses a pre-trained model on a large dataset as a starting point, then fine-tunes it for a specific task with less data.
Process: (1) Take model pre-trained on large dataset (ImageNet, Wikipedia). (2) Freeze early layers (general features). (3) Replace final layers for your task. (4) Fine-tune on your smaller dataset. Benefits: less data needed, faster training, better performance. Examples: BERT/GPT fine-tuned for classification, ResNet fine-tuned for medical imaging. Foundation models (GPT-4, DALL-E) are the ultimate transfer learning — trained once, used for many tasks.
Gradient descent optimizes model parameters by iteratively moving in the direction that reduces the loss function.
Variants: (1) Batch GD: uses entire dataset per step (stable but slow). (2) Stochastic GD (SGD): one sample per step (noisy but fast). (3) Mini-batch GD: compromise (most common, batch size 32-256). Optimizers: Momentum (accelerates convergence), RMSprop (adaptive learning rate), Adam (combines momentum + RMSprop, most popular). Learning rate scheduling: warmup, cosine annealing, step decay. Gradient clipping prevents exploding gradients.
Transformers use self-attention mechanisms to process sequences in parallel, powering models like GPT, BERT, and Vision Transformers.
Architecture: encoder-decoder with multi-head self-attention and feed-forward layers. Self-attention: each token attends to all other tokens (computes relevance scores). Positional encoding adds sequence order. Benefits over RNNs: parallel processing, long-range dependencies, scalable. BERT: encoder-only (understanding). GPT: decoder-only (generation). T5: encoder-decoder (both). Scale: GPT-4 has trillions of parameters. Attention complexity: O(n²) — ongoing research to reduce (Flash Attention, sparse attention).
RAG combines a retrieval system with a generative model, fetching relevant documents to ground the model's responses in factual data.
Architecture: (1) Document ingestion: chunk documents, generate embeddings, store in vector database (Pinecone, Weaviate, FAISS). (2) Query: embed user question, retrieve top-k similar chunks via vector similarity search. (3) Generation: feed retrieved context + question to LLM for grounded response. Benefits: reduces hallucination, uses up-to-date information, auditable sources. Challenges: chunk size optimization, retrieval quality, context window limits. Tools: LangChain, LlamaIndex.
GANs consist of a Generator (creates fake data) and Discriminator (distinguishes real from fake), training adversarially.
Architecture: Generator creates samples from noise, Discriminator classifies real vs fake. Both train simultaneously — Generator improves at fooling Discriminator, Discriminator improves at detecting fakes. Loss: minimax game theory. Variants: DCGAN (convolutional), StyleGAN (high-res faces), CycleGAN (domain transfer), Pix2Pix (paired image translation). Challenges: mode collapse, training instability, evaluation difficulty. Applications: image synthesis, data augmentation, super-resolution. Being replaced by diffusion models for image generation.
Likely data drift, training-serving skew, or feature pipeline differences between training and production environments.
Common causes: (1) Data drift: production data distribution differs from training data. (2) Training-serving skew: feature computation differs. (3) Data leakage: training used future information. (4) Concept drift: underlying patterns changed over time. (5) Scale issues: model can't handle production volume. Solutions: (1) Monitor input distributions. (2) A/B testing. (3) Shadow deployment. (4) Regular retraining. (5) Feature stores for consistency. (6) Canary deployments.
Use collaborative filtering (user-item interactions), content-based filtering (item features), or hybrid approaches with matrix factorization.
Approaches: (1) Collaborative filtering: find similar users/items based on behavior. Matrix factorization (SVD, ALS) for scalability. (2) Content-based: recommend items similar to what user liked (TF-IDF, embeddings). (3) Hybrid: combine both. (4) Deep learning: neural collaborative filtering, two-tower models. (5) Cold start: use content features for new users/items. Evaluation: precision@k, recall@k, NDCG, MAP. Production: candidate generation → ranking → re-ranking pipeline.
Google uses BERT/MUM for query understanding, RankBrain for ranking, and neural embeddings for semantic search.
ML in Search: (1) BERT: understands query context and intent. (2) MUM: multimodal, multilingual understanding. (3) RankBrain: ML-based ranking signal. (4) Neural matching: understand vague queries. (5) Passage indexing: find relevant passages within pages. (6) Spam detection: ML identifies low-quality content. (7) Featured snippets: extract direct answers. (8) Knowledge Graph: structured information. Search processes 8.5 billion queries daily.
LLMs are transformer-based models trained on massive text corpora to predict the next token, enabling text generation and understanding.
Training: (1) Pre-training: predict next token on trillion-token dataset (self-supervised). (2) Fine-tuning: instruction tuning on curated datasets. (3) RLHF: human feedback aligns model with human preferences. Architecture: decoder-only transformer with billions of parameters. Inference: autoregressive generation (one token at a time). Capabilities emerge from scale: reasoning, coding, translation. Challenges: hallucination, computational cost, safety alignment, context window limits.
Bias is error from wrong assumptions (underfitting); variance is error from sensitivity to training data (overfitting). Balance both.
High bias: model is too simple, misses patterns, high training error (underfitting). High variance: model is too complex, captures noise, low training error but high test error (overfitting). Tradeoff: reducing bias increases variance and vice versa. Solutions for high bias: more features, complex model, less regularization. Solutions for high variance: more data, regularization (L1/L2), dropout, cross-validation, ensemble methods. Ideal: low bias & low variance (rarely perfect). Diagnosis: learning curves, train vs validation error comparison.
Supervised uses labeled data; unsupervised finds patterns in unlabeled data; reinforcement learning learns through rewards and actions.
Supervised: input-output pairs. Classification (categorical output): logistic regression, SVM, decision trees, neural networks. Regression (continuous): linear regression, random forest. Unsupervised: no labels. Clustering (K-means, DBSCAN), dimensionality reduction (PCA, t-SNE), anomaly detection. Reinforcement: agent takes actions in environment, receives rewards. Q-learning, policy gradient, PPO. Applications: game AI (AlphaGo), robotics, recommendation systems. Semi-supervised: mix of labeled and unlabeled. Self-supervised: creates labels from data itself (BERT, GPT pretraining).
Transformers use self-attention mechanism to process entire sequences in parallel, replacing RNNs for NLP and beyond.
Architecture: encoder (understanding) + decoder (generation). Self-attention: each token attends to all other tokens, learning relationships. Multi-head attention: multiple attention mechanisms in parallel capture different relationships. Positional encoding: adds position information (transformers have no inherent sequence notion). Key/Query/Value: attention score = softmax(QK^T/sqrt(d)) * V. Advantages over RNNs: parallel processing, captures long-range dependencies, scalable. Models: BERT (encoder-only), GPT (decoder-only), T5 (encoder-decoder). Used for: NLP, vision (ViT), audio, multimodal (CLIP).
Fine-tuning adapts a pretrained model to specific tasks by training on domain-specific data with lower learning rates.
Full fine-tuning: update all model weights on task data. Expensive for large models. Parameter-Efficient Fine-Tuning (PEFT): LoRA (Low-Rank Adaptation — adds trainable rank-decomposition matrices, ~0.1% parameters), QLoRA (quantized + LoRA), Adapters (small layers between transformer blocks). Instruction tuning: train on instruction-response pairs. RLHF (Reinforcement Learning from Human Feedback): align model outputs with human preferences. Data requirements: hundreds to thousands of examples. Tools: Hugging Face transformers, PEFT library, Axolotl, LitGPT. Catastrophic forgetting: model loses general knowledge — mitigate with replay or elastic weight consolidation.
RAG combines retrieval of relevant documents with LLM generation, grounding responses in factual data and reducing hallucinations.
Pipeline: (1) Index: chunk documents, generate embeddings, store in vector database. (2) Retrieve: embed user query, find similar chunks (cosine similarity). (3) Augment: add retrieved context to LLM prompt. (4) Generate: LLM produces answer grounded in retrieved documents. Vector databases: Pinecone, Weaviate, ChromaDB, Qdrant. Embedding models: OpenAI ada-002, BGE, Cohere. Chunking strategies: fixed size, semantic, recursive. Advanced: re-ranking retrieved results, hybrid search (semantic + keyword), query expansion, multi-step retrieval. Benefits: no fine-tuning needed, updatable knowledge.
Check for data drift, training-serving skew, overfitting, feature pipeline issues, and data quality differences.
Causes: (1) Data drift: production data distribution differs from training (concept drift, data drift). (2) Training-serving skew: feature computation differs between training and serving pipelines. (3) Overfitting: model memorized training data. (4) Data leakage: training used future or unavailable information. (5) Feature issues: missing values handled differently, categorical encoding mismatches. (6) Scale differences: training on clean data, production has noise. Monitoring: track feature distributions, prediction confidence, model metrics over time. Solutions: retrain on recent data, feature store for consistency, A/B testing, shadow deployment.
Use resampling (SMOTE, oversampling, undersampling), class weights, ensemble methods, and appropriate metrics (F1, AUPRC).
Data-level: (1) Oversampling minority (SMOTE — synthetic examples, ADASYN). (2) Undersampling majority (random, Tomek links, NearMiss). (3) Combined: SMOTE + Tomek. Algorithm-level: (1) Class weights: class_weight='balanced' adjusts loss. (2) Cost-sensitive learning: higher penalty for minority misclassification. (3) Ensemble: BalancedRandomForest, EasyEnsemble. Evaluation: avoid accuracy (misleading). Use precision, recall, F1-score, AUROC, AUPRC. Threshold tuning: adjust classification threshold based on business needs. Collection: gather more minority class data when possible.
Attention allows models to focus on relevant parts of input when producing output, weighting importance of different elements.
Concept: instead of fixed-size context vector, attention scores determine how much to focus on each input element. Types: (1) Bahdanau attention (additive): learned alignment function. (2) Luong attention (multiplicative): dot product between states. (3) Self-attention: each element attends to all elements in same sequence (Transformers). Multi-head: parallel attention with different learned projections — captures different types of relationships. Cross-attention: decoder attends to encoder outputs. Flash Attention: memory-efficient implementation. Attention weights are interpretable — show what the model focuses on.
Embeddings are dense vector representations of data (text, images) in continuous space where similar items are closer together.
Text embeddings: Word2Vec, GloVe (word-level), BERT/Sentence-BERT (sentence-level), OpenAI embeddings (document-level). Properties: semantic similarity = cosine similarity in vector space. 'king' - 'man' + 'woman' ≈ 'queen'. Applications: (1) Semantic search: find similar documents. (2) Clustering: group similar items. (3) Recommendation: user/item embeddings. (4) RAG: retrieve relevant context for LLMs. (5) Classification: embedding + classifier. Dimension: typically 384-3072 dimensions. Storage: vector databases for efficient similarity search. Fine-tuning: train on domain-specific similarity pairs.
Transfer learning uses a model pretrained on large datasets as starting point for new tasks, requiring less data and training time.
Concept: knowledge from one task helps another. ImageNet-pretrained CNN for medical imaging. GPT trained on internet text fine-tuned for customer service. Approaches: (1) Feature extraction: freeze pretrained layers, train new head. (2) Fine-tuning: unfreeze some/all layers, train with low learning rate. (3) Domain adaptation: align source and target distributions. Benefits: less training data needed, faster convergence, better performance. Popular pretrained models: BERT, GPT, ResNet, ViT, CLIP. Foundation models: large pretrained models adaptable to many downstream tasks. Practice: almost all modern ML uses transfer learning.
Accuracy, precision, recall, F1-score, AUROC, and confusion matrix — chosen based on business context and class balance.
Accuracy: correct/total — misleading for imbalanced data (99% accuracy on 99:1 split by predicting majority). Precision: TP/(TP+FP) — of predicted positives, how many correct. Important when false positives costly (spam filter). Recall: TP/(TP+FN) — of actual positives, how many found. Important when false negatives costly (cancer detection). F1-score: harmonic mean of precision and recall. AUROC: area under ROC curve, threshold-independent. Confusion matrix: TP, TN, FP, FN visualization. Multi-class: macro/micro/weighted averaging. Business-specific metrics often matter more than ML metrics.
Containerize model with Docker, serve via REST/gRPC API, implement monitoring, versioning, and A/B testing.
Serving: (1) REST API (Flask, FastAPI, TensorFlow Serving). (2) Batch: scheduled prediction on datasets. (3) Edge: ONNX Runtime, TensorRT for mobile/IoT. Infrastructure: Docker + Kubernetes, serverless (AWS Lambda + SageMaker). MLOps: (1) Model registry (MLflow): version, stage, metadata. (2) CI/CD for ML: data validation → training → evaluation → deployment. (3) Monitoring: data drift detection, prediction quality, latency. (4) A/B testing: compare model versions on real traffic. Tools: MLflow, Kubeflow, SageMaker, Vertex AI. Feature store: Feast for consistent feature computation. Model format: ONNX for framework-agnostic deployment.
Define problem, collect and explore data, select features, choose models, train, evaluate, deploy, and iterate.
CRISP-DM framework: (1) Business understanding: what problem, what metric matters, baseline. (2) Data understanding: EDA (distributions, correlations, missing values, outliers). (3) Data preparation: cleaning, feature engineering, encoding, normalization, train/val/test split. (4) Modeling: start simple (baseline), iterate complexity. Cross-validation. Hyperparameter tuning (grid search, Bayesian optimization). (5) Evaluation: test set metrics, business metrics, fairness analysis. (6) Deployment: API, monitoring, CI/CD. Common mistake: jumping to complex models before understanding data. Rule: 80% of effort in data preparation.
Batch processes large datasets periodically (recommendations); real-time processes individual requests instantly (fraud detection, search).
Batch inference: pre-compute predictions for all items/users, store results, serve from cache/database. Use when: latency not critical, predictions don't need fresh features. Tools: Spark, Airflow scheduled jobs. Real-time inference: model hosted as service, processes each request on-demand. Use when: immediate response needed, features change frequently. Tools: TensorFlow Serving, TorchServe, Triton. Near-real-time: streaming (Kafka + Flink + model). Hybrid: batch for bulk recommendations, real-time for re-ranking with fresh signals. Cost: batch is cheaper (shared resources), real-time needs always-on infrastructure.
Ready to master Ai ml?
Start learning with our comprehensive course and practice these questions.