Master these 31 carefully curated interview questions to ace your next Sql interview.
SQL manages relational databases. DDL (define schema), DML (manipulate data), DCL (permissions), TCL (transactions).
DDL: CREATE, ALTER, DROP, TRUNCATE. DML: SELECT, INSERT, UPDATE, DELETE. DCL: GRANT, REVOKE. TCL: COMMIT, ROLLBACK, SAVEPOINT. SQL is declarative. Standard ANSI SQL with vendor extensions (PostgreSQL, MySQL, SQL Server).
INNER (matching rows), LEFT (all left + matching), RIGHT (all right + matching), FULL (all from both), CROSS (cartesian).
INNER JOIN: only matching rows. LEFT JOIN: all from left, NULL for non-matching right. FULL OUTER: all from both. CROSS JOIN: every combination. SELF JOIN: table joined with itself. Performance: INNER fastest, CROSS most expensive.
Indexes are data structures speeding up retrieval by creating sorted references to rows, like a book's index.
Types: B-tree (default), Hash (exact match), GIN (full-text), GiST (geometric). Benefits: faster SELECT, WHERE, JOIN. Costs: slower INSERT/UPDATE, extra storage. Index columns in WHERE/JOIN/ORDER BY. EXPLAIN ANALYZE to verify usage.
Window functions calculate across related rows without collapsing them: ROW_NUMBER, RANK, LAG, LEAD over PARTITION BY.
Syntax: function() OVER (PARTITION BY col ORDER BY col). Functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM, AVG. Unlike GROUP BY, preserves individual rows. Use for rankings, running totals, comparing to previous rows.
CTEs (WITH clause) are temporary named result sets improving readability and enabling recursive queries for hierarchies.
WITH cte AS (SELECT ...) SELECT FROM cte. Recursive: WITH RECURSIVE for hierarchical data (org charts, categories). CTEs are not always materialized — optimizer may inline them.
Normalization eliminates data redundancy by organizing into related tables (1NF-3NF). Denormalization adds redundancy for read speed.
1NF: atomic values. 2NF: no partial dependencies. 3NF: no transitive dependencies. Benefits: integrity, reduced storage. Denormalization: intentional redundancy for joins avoidance. Normalize for OLTP, denormalize for OLAP.
Use EXPLAIN ANALYZE, add indexes, rewrite queries, optimize JOINs, use pagination, and partition large tables.
Steps: (1) EXPLAIN ANALYZE for execution plan. (2) Add indexes for sequential scans. (3) Check JOIN order. (4) Covering indexes. (5) Avoid SELECT *. (6) Replace subqueries with JOINs. (7) Pagination. (8) Partition large tables. (9) Connection pooling. (10) Materialized views for complex aggregations.
Atomicity (all or nothing), Consistency (valid state), Isolation (concurrent safety), Durability (survives crashes).
Isolation levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE. Trade safety for performance. BEGIN/COMMIT/ROLLBACK. PostgreSQL default: READ COMMITTED. SERIALIZABLE prevents all anomalies but may need retry logic.
Profile with EXPLAIN, add composite indexes, use materialized views, consider partitioning and pre-aggregation.
Steps: (1) EXPLAIN ANALYZE. (2) Add composite indexes. (3) Materialized views for aggregations. (4) Partition by date. (5) Pre-aggregate in summary tables. (6) Consider columnar storage for analytics. (7) Parallel query. (8) Application-level caching.
Core: users, products, categories, orders, order_items, addresses, payments, reviews with proper foreign keys and indexes.
Tables: users(id, email), products(id, title, price, category_id, stock), categories(id, name, parent_id), orders(id, user_id, status, total), order_items(order_id, product_id, quantity, price), addresses, payments, reviews. Indexes on foreign keys and frequently queried columns.
WHERE filters rows before grouping; HAVING filters groups after GROUP BY aggregation.
WHERE: applied to individual rows before GROUP BY, cannot use aggregate functions (SUM, COUNT, AVG). HAVING: applied to groups after GROUP BY, can use aggregate functions. Example: SELECT dept, COUNT(*) FROM employees WHERE salary > 50000 GROUP BY dept HAVING COUNT(*) > 5 — first filters rows by salary, then groups and filters groups by count. HAVING without GROUP BY applies to entire result as one group. Performance: WHERE reduces data before aggregation (more efficient). Use WHERE for row-level filters, HAVING for aggregate conditions.
Joins combine rows from two or more tables based on related columns. Types: INNER, LEFT, RIGHT, FULL, CROSS, SELF.
INNER JOIN: returns matching rows from both tables. LEFT JOIN: all rows from left + matching from right (NULL for no match). RIGHT JOIN: all rows from right + matching from left. FULL OUTER JOIN: all rows from both (NULLs where no match). CROSS JOIN: cartesian product (every row × every row). SELF JOIN: table joined with itself (hierarchical data, comparisons). Natural join: auto-matches same-named columns (avoid — fragile). Performance: join order matters, indexes on join columns critical. Use EXPLAIN to analyze join performance.
UNION combines results and removes duplicates (with sort overhead); UNION ALL keeps all rows including duplicates (faster).
UNION: combines result sets, removes duplicate rows (implicit DISTINCT + sort). UNION ALL: combines without removing duplicates — faster because no sorting/comparison. Rules: same number of columns, compatible data types, column names from first SELECT. Use UNION ALL when you know there are no duplicates or duplicates are acceptable — much better performance. INTERSECT: rows in both. EXCEPT/MINUS: rows in first but not second. UNION ALL is almost always preferred over UNION for performance.
Indexes are data structures (usually B-tree) that speed up data retrieval at the cost of slower writes and additional storage.
B-tree index: default, ordered structure for range and equality queries. Hash index: equality lookups only (faster for exact match). Composite index: multiple columns — column order matters (leftmost prefix rule). Covering index: includes all query columns, avoiding table lookup. Unique index: enforces uniqueness. Partial index: indexes subset of rows (WHERE condition). Full-text index: text search. GIN/GiST: for arrays, JSON, geometric data (PostgreSQL). Trade-offs: faster reads, slower writes (index maintenance), disk space. Use EXPLAIN ANALYZE to verify index usage.
Window functions perform calculations across a set of rows related to the current row, without collapsing rows like GROUP BY.
Syntax: function() OVER (PARTITION BY col ORDER BY col ROWS BETWEEN ...). Functions: ROW_NUMBER() (unique sequential), RANK() (gaps on ties), DENSE_RANK() (no gaps), NTILE(n) (divide into buckets), LAG/LEAD (previous/next row), FIRST_VALUE/LAST_VALUE, SUM/AVG/COUNT as running totals. Frame: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (running total). PARTITION BY creates independent windows. Use cases: running totals, rankings, moving averages, comparing current to previous row, top-N per group. More powerful than GROUP BY for analytical queries.
Normalization organizes data to reduce redundancy and improve integrity through progressive normal forms (1NF through 5NF).
1NF: atomic values, no repeating groups. 2NF: 1NF + no partial dependencies (all non-key columns depend on entire primary key). 3NF: 2NF + no transitive dependencies (non-key columns don't depend on other non-key columns). BCNF: every determinant is a candidate key. 4NF: no multi-valued dependencies. 5NF: no join dependencies. Denormalization: intentionally adding redundancy for read performance (reporting, analytics). Most applications normalize to 3NF. Data warehouses often denormalize (star/snowflake schema). Balance normalization with query performance.
Stored procedures are precompiled SQL programs stored in the database, offering performance, security, and code reuse benefits.
Benefits: precompiled execution plans (faster), reduced network traffic (batch operations), centralized business logic, security (grant EXECUTE without table access), transaction management. Drawbacks: harder to version control, database-vendor lock-in, debugging complexity, can hide business logic. Use for: complex data operations, batch processing, data validation rules, audit logging. Avoid for: simple queries, logic that changes frequently, portability requirements. Functions return values; procedures perform actions.
Isolation levels control how transactions interact: Read Uncommitted, Read Committed, Repeatable Read, and Serializable.
Read Uncommitted: sees uncommitted changes (dirty reads). Read Committed: only committed data (PostgreSQL default). Repeatable Read: consistent snapshot throughout transaction (MySQL InnoDB default). Serializable: full isolation, transactions appear sequential. Problems: Dirty reads (reading uncommitted data), Non-repeatable reads (row changes between reads), Phantom reads (new rows appear matching query). Higher isolation = more consistency but lower concurrency. MVCC (Multi-Version Concurrency Control) in PostgreSQL/MySQL avoids locking by keeping row versions.
Use EXPLAIN ANALYZE, add indexes, rewrite joins, reduce result set early, avoid SELECT *, and use query caching.
Steps: (1) EXPLAIN ANALYZE — identify full table scans, nested loops. (2) Add indexes on WHERE/JOIN/ORDER BY columns. (3) Avoid SELECT * — fetch only needed columns. (4) Replace subqueries with JOINs when possible. (5) Use EXISTS instead of IN for correlated subqueries. (6) Limit results early (WHERE before GROUP BY). (7) Avoid functions in WHERE (col = UPPER('value') prevents index use). (8) Partition large tables. (9) Materialized views for complex aggregations. (10) Query caching (Redis). (11) Check for N+1 queries in application code. (12) Database-specific: PostgreSQL pg_stat_statements.
CTEs define named temporary result sets within a query using WITH clause, improving readability and enabling recursive queries.
Syntax: WITH cte_name AS (SELECT ...) SELECT * FROM cte_name. Multiple CTEs: WITH cte1 AS (...), cte2 AS (...) SELECT .... Recursive CTE: WITH RECURSIVE tree AS (base UNION ALL recursive) — for hierarchical data (org charts, category trees, graphs). Benefits: readability, reuse within query, recursive traversal. Performance: some databases materialize CTEs (PostgreSQL), others inline (MySQL). CTE vs subquery: same performance usually, CTE is more readable. CTE vs temp table: CTE exists only for one query; temp table persists.
Use migration tools, apply incrementally, test on staging, ensure backward compatibility, and plan rollback strategies.
Best practices: (1) Migration tools: Flyway, Liquibase, Django migrations, Knex.js. (2) Version control migrations. (3) Forward-only: migration + rollback script. (4) Backward compatible: add columns with defaults before code deploy, remove later. (5) Schema changes: add column → deploy code → backfill data → add constraints. (6) Large table changes: pt-online-schema-change (MySQL), pg_repack (PostgreSQL). (7) Test on staging with production-size data. (8) Blue-green deploy: run old and new code simultaneously during migration.
Core tables: users, products, categories, orders, order_items, payments, addresses with proper normalization and indexes.
Schema: users (id, email, name, password_hash). products (id, name, price, category_id, stock, sku, description). categories (id, name, parent_id for hierarchy). orders (id, user_id, total, status, created_at). order_items (order_id, product_id, quantity, price_at_purchase — snapshot price). addresses (user_id, type, street, city, zip). payments (order_id, method, amount, status, transaction_id). Indexes: user email, product sku, order user_id+status. Considerations: soft deletes, audit trail, inventory management (optimistic locking), search (full-text or Elasticsearch).
Sharding horizontally partitions data across multiple databases based on a shard key, enabling horizontal scaling.
Strategy: choose shard key (user_id, region) that distributes data evenly. Types: range-based (id 1-1000 → shard 1), hash-based (hash(id) % num_shards), directory-based (lookup table). Challenges: cross-shard queries (expensive), rebalancing when adding shards, no cross-shard foreign keys, distributed transactions. Alternatives before sharding: read replicas, caching, vertical partitioning, connection pooling. Tools: Vitess (MySQL), Citus (PostgreSQL). Start with single database, optimize, then shard when truly necessary. Most applications never need sharding.
CAP theorem states distributed systems can guarantee only two of three: Consistency, Availability, and Partition tolerance.
Consistency: all nodes see same data simultaneously. Availability: every request gets a response. Partition tolerance: system works despite network splits. CP systems: prioritize consistency (MongoDB with strict read). AP systems: prioritize availability (Cassandra, DynamoDB). CA: impossible in distributed systems (partitions are inevitable). In practice: choose between consistency and availability during partition. PACELC extends CAP: during Partition choose A or C; Else (no partition) choose Latency or Consistency. Most systems default to eventual consistency for better performance.
SQL databases are relational with structured schemas and ACID transactions; NoSQL offers flexible schemas with horizontal scaling.
SQL (PostgreSQL, MySQL): structured schemas, ACID transactions, powerful joins, strong consistency, vertical scaling primarily. NoSQL types: Document (MongoDB — flexible JSON), Key-Value (Redis — fast lookups), Column-family (Cassandra — wide rows), Graph (Neo4j — relationships). NoSQL benefits: horizontal scaling, flexible schema, high write throughput. SQL benefits: data integrity, complex queries, transaction support. Choose SQL for: financial data, complex relationships, ACID requirements. Choose NoSQL for: high scale, flexible data, simple queries, eventual consistency acceptable.
DELETE removes specific rows with rollback; TRUNCATE removes all rows fast without logging each; DROP removes the entire table.
DELETE: DML, removes rows matching WHERE (all if no WHERE), logged per row, can rollback, triggers fire, slower for bulk. TRUNCATE: DDL, removes ALL rows, minimal logging (deallocates pages), faster, resets auto-increment, cannot rollback in some databases. DROP: DDL, removes table structure and data entirely, cannot rollback. Performance: TRUNCATE >> DELETE for clearing tables. DELETE with WHERE for selective removal. CASCADE option in DROP/TRUNCATE affects dependent objects. Foreign key constraints may prevent TRUNCATE.
Triggers are stored programs that automatically execute in response to INSERT, UPDATE, or DELETE events on a table.
Types: BEFORE (validate/modify data pre-operation), AFTER (log/audit post-operation), INSTEAD OF (override operation on views). Row-level: fires for each affected row. Statement-level: fires once per statement. Access: OLD (pre-change values), NEW (post-change values). Use cases: audit trails, data validation, auto-updating timestamps, maintaining denormalized data, cascade operations. Caution: hidden logic (hard to debug), performance impact, can cause infinite loops (trigger fires trigger). Alternative: application-level logic for complex business rules.
Deadlock occurs when transactions wait for each other's locks indefinitely; databases detect and kill one transaction to resolve.
Scenario: Transaction A locks row 1, waits for row 2. Transaction B locks row 2, waits for row 1 — circular wait. Database detects deadlock and rolls back one transaction (victim). Prevention: (1) Access tables/rows in consistent order. (2) Keep transactions short. (3) Use appropriate isolation level. (4) Add indexes (reduces locking). (5) Use row-level locking (not table). (6) SELECT FOR UPDATE NOWAIT to fail instead of wait. (7) Retry logic for deadlock victims. Monitoring: MySQL: SHOW ENGINE INNODB STATUS. PostgreSQL: pg_stat_activity, deadlock_timeout setting.
Use keyset/cursor pagination instead of OFFSET, limit result set, add proper indexes, and consider caching.
OFFSET pagination: SELECT * FROM items ORDER BY id LIMIT 20 OFFSET 10000 — scans and discards 10000 rows. Slow for large offsets. Keyset pagination: WHERE id > last_seen_id ORDER BY id LIMIT 20 — uses index, constant performance regardless of page. Cursor-based: encode last item's sort values as cursor token. Requirements: stable sort column, index on sort column. Additional: estimated total count (SELECT reltuples FROM pg_class vs exact COUNT), infinite scroll vs page numbers, cache frequently accessed pages, materialized views for complex queries.
Use built-in full-text indexes (PostgreSQL tsvector, MySQL FULLTEXT), or external search engines like Elasticsearch for complex needs.
PostgreSQL: CREATE INDEX idx ON articles USING GIN (to_tsvector('english', title || body)). Query: WHERE to_tsvector('english', title) @@ to_tsquery('search & terms'). ts_rank() for relevance scoring. MySQL: FULLTEXT index, MATCH...AGAINST syntax. Limitations: SQL full-text works for basic search. For advanced needs: Elasticsearch (distributed, fuzzy matching, facets, synonyms, autocomplete), MeiliSearch (lightweight), or Typesense. Architecture: sync data from DB to search engine, search engine for queries, DB for source of truth. Consider: indexing latency, data consistency, autocomplete requirements.
Ready to master Sql?
Start learning with our comprehensive course and practice these questions.