99 SQL Interview Questions and Answers (2026)

Q1.
What are the differences between `DELETE`, `TRUNCATE`, and `DROP` in terms of transaction logging, speed, and rollback ability?

Junior

All three remove data, but at different levels: DELETE removes rows (DML), TRUNCATE empties a whole table fast (usually DDL), and DROP removes the table object itself.

DELETE:
- Row-by-row, honors WHERE and fires triggers.
- Fully logged per row, so slower on large tables but fully rollback-able in a transaction.
TRUNCATE:
- Deallocates data pages instead of logging each row, so it's very fast.
- No WHERE, typically resets identity/auto-increment, and doesn't fire row triggers.
- Rollback depends on the engine: recoverable in a transaction on SQL Server/PostgreSQL, but implicitly commits in MySQL/Oracle.
DROP:
- Removes the table structure, data, indexes, and constraints entirely.
- DDL; generally cannot be rolled back once committed (auto-commits in many engines).

Q2.
What is a `CASE` expression and how would you use it inside a `SELECT` or `ORDER BY` clause?

Junior

A CASE expression is SQL's inline if/then/else: it evaluates conditions in order and returns the first matching result. It's an expression, not a statement, so it can appear anywhere a value is expected.

Two forms:
- Searched: CASE WHEN condition THEN ... ELSE ... END, each branch a full boolean test.
- Simple: CASE expr WHEN value THEN ... END, compares one expression to values.
In SELECT: Derive labels/buckets or do conditional aggregation like SUM(CASE WHEN ... THEN 1 ELSE 0 END).
In ORDER BY: Impose custom sort priority, e.g. push a status to the top regardless of alphabetical order.
Returns NULL if no branch matches and no ELSE is given.

sql

SELECT name,
       CASE WHEN score >= 90 THEN 'A'
            WHEN score >= 80 THEN 'B'
            ELSE 'C' END AS grade
FROM students
ORDER BY CASE WHEN status = 'active' THEN 0 ELSE 1 END, name;

Q3.
What are the main categories of SQL statements, `DDL`, `DML`, `DCL`, and `TCL`, and what does each include?

Junior

SQL statements group into four families by what they act on: structure (DDL), data (DML), permissions (DCL), and transactions (TCL).

DDL (Data Definition Language): Defines schema objects: CREATE, ALTER, DROP, TRUNCATE. Often auto-commits.
DML (Data Manipulation Language): Reads and changes data: SELECT, INSERT, UPDATE, DELETE. (Some call SELECT DQL.)
DCL (Data Control Language): Manages permissions: GRANT, REVOKE.
TCL (Transaction Control Language): Controls transaction boundaries: COMMIT, ROLLBACK, SAVEPOINT.

Q4.
What is the relational model, and how do relations, tuples, attributes, and domains map to tables, rows, columns, and data types?

Junior

The relational model (Codd, 1970) represents data as relations (sets of tuples) governed by set theory and predicate logic. Its formal terms map directly onto the table vocabulary practitioners use every day.

Relation → table: An unordered set of tuples sharing the same attributes.
Tuple → row: A single record; being a set, there are no duplicate tuples and no inherent order.
Attribute → column: A named property with a defined type.
Domain → data type: The set of permissible values an attribute may take (e.g. integers, dates).
Caveat: SQL is a loose implementation: SQL tables allow duplicate rows and use NULL, so they are technically "bags," not pure mathematical relations.

Q5.
Why is using `SELECT *` generally discouraged in production queries?

Junior

SELECT * fetches every column, which wastes I/O and network, defeats index-only plans, and makes queries fragile to schema changes. Naming columns explicitly is clearer and safer in production.

Performance cost:
- Reads and transfers columns you don't need, including large text/blob fields.
- Prevents covering-index optimizations, forcing extra lookups to the base table.
Fragility:
- Column additions/reordering can break code relying on position or shift result shape.
- Ambiguous or duplicate column names in joins.
Clarity: Explicit lists document exactly what a query depends on.
Acceptable uses: Ad-hoc exploration or EXISTS (SELECT * ...), where the projection is never materialized.

Q6.
What is a `Self-Join`, and in what real-world scenario would you use one?

Junior

A self-join is a join where a table is joined to itself, using table aliases to treat it as two logical copies. It is used to relate rows within the same table, typically for hierarchical or comparative relationships.

How it works: You give the table two aliases and join on a column that references another row in the same table.
Classic scenario: employee/manager hierarchy: An employees table has a manager_id pointing to another employee's id; a self-join pairs each employee with their manager.
Other uses: Finding pairs (e.g., customers in the same city) or comparing rows to each other (e.g., duplicate detection).

sql

SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

Q7.
Explain the difference between `JOIN` and `UNION`.

Junior

A JOIN combines columns from multiple tables side by side (horizontally) based on a related condition, while a UNION stacks the rows of two result sets on top of each other (vertically).

JOIN widens the result:
- Merges rows from different tables into wider rows using a join condition (e.g., matching keys).
- The tables usually have different columns.
UNION lengthens the result: Appends rows from one query below another; column count and types must align.
Mnemonic: JOIN adds columns; UNION adds rows.

Q8.
What are the different types of `JOIN`s and how do they handle non-matching rows?

Junior

The main JOIN types differ in how they treat rows that have no match on the other side: some drop them, some keep them and pad with NULL.

INNER JOIN: Returns only rows with a match on both sides; non-matching rows are discarded.
LEFT (OUTER) JOIN: Keeps all left rows; unmatched right columns become NULL.
RIGHT (OUTER) JOIN: Keeps all right rows; unmatched left columns become NULL.
FULL (OUTER) JOIN: Keeps all rows from both sides; missing values on either side are NULL.
CROSS JOIN: Cartesian product: every left row paired with every right row, no match condition.

Q9.
Explain the difference between an `INNER JOIN` and an `OUTER JOIN` (`LEFT/RIGHT/FULL`).

Junior

An INNER JOIN returns only rows that match in both tables, while an OUTER JOIN preserves unmatched rows from one or both sides and fills the missing columns with NULL.

INNER JOIN: Intersection: rows present in both tables based on the join condition.
LEFT OUTER JOIN: All rows from the left table plus matches; unmatched right side is NULL.
RIGHT OUTER JOIN: Mirror of LEFT: all rows from the right table, unmatched left side is NULL.
FULL OUTER JOIN: Union of both: all rows from both tables, with NULL where either side is missing.
Practical tip: Use a LEFT JOIN with a WHERE right.key IS NULL filter to find rows in one table that have no counterpart in the other.

Q10.
Explain the difference between an `INNER JOIN` and a `LEFT JOIN`. When would you use a `CROSS JOIN` or a `SELF JOIN` in a real-world application?

Junior

An INNER JOIN returns only matching rows from both tables, while a LEFT JOIN returns all rows from the left table plus matches (NULL where none exist). CROSS JOIN and SELF JOIN serve more specialized needs.

INNER JOIN vs LEFT JOIN:
- INNER JOIN: intersection, unmatched rows dropped.
- LEFT JOIN: all left rows kept; use it to preserve records that may have no related data (e.g., customers with no orders).
CROSS JOIN in practice: Generating all combinations, such as pairing every product with every available color/size, or building a calendar grid.
SELF JOIN in practice: Relating rows within one table, like matching each employee to their manager in the same employees table, or finding duplicate/related records.

sql

-- Customers and their orders, keeping customers with none
SELECT c.name, o.id
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id;

Q11.
Explain the concept of referential integrity and how foreign keys enforce it.

Junior

Referential integrity is the guarantee that a foreign key value always refers to an existing, valid row in the parent table (or is NULL). Foreign keys enforce it by rejecting or handling operations that would leave a child row pointing at a non-existent parent.

What it enforces:
- No orphan rows: you can't insert or update a child with a key that has no matching parent.
- You can't delete or change a parent's key while children still reference it, unless a cascade or SET NULL rule applies.
How the FK enforces it:
- The referenced parent column must be a primary or unique key.
- The database checks the constraint at write time (and per-statement or deferred within a transaction).
Reaction rules: RESTRICT/NO ACTION block the change; CASCADE, SET NULL, SET DEFAULT propagate it.
Note: a NULL foreign key is allowed and means 'no relationship', which does not violate integrity.

sql

CREATE TABLE orders (
  id INT PRIMARY KEY,
  customer_id INT REFERENCES customers(id) ON DELETE RESTRICT
);
-- Fails: no customer 999 exists
INSERT INTO orders (id, customer_id) VALUES (1, 999);

Q12.
What is the purpose of a Foreign Key constraint? Explain the difference between `ON DELETE CASCADE` and `ON DELETE SET NULL` in terms of data integrity.

Junior

A foreign key constraint enforces referential integrity by guaranteeing that a value in the child table corresponds to a real row in the parent table. ON DELETE CASCADE and ON DELETE SET NULL differ in whether the child rows are removed or simply detached when the parent disappears.

Purpose of a foreign key:
- Ties related tables together and rejects orphaned references (e.g. an order pointing to a deleted customer).
- Documents the relationship and lets the DB, not application code, enforce it.
ON DELETE CASCADE:
- Child rows are deleted with the parent: good when children are meaningless without the parent (order line items).
- Integrity view: no dangling rows, but risk of unintentionally deleting large chains of data.
ON DELETE SET NULL:
- Child rows survive but their FK becomes NULL: good when the child is still valuable independently (an employee whose department is dropped).
- Integrity view: preserves history/records at the cost of losing the relationship link; requires a nullable column.

Q13.
What are `CHECK`, `DEFAULT`, and `NOT NULL` constraints, and how do they help enforce data integrity?

Junior

These are column-level constraints that keep bad data out at the database level: NOT NULL requires a value, DEFAULT supplies one when none is given, and CHECK validates that values satisfy a condition.

NOT NULL: Rejects rows where the column is missing a value: enforces that a fact is always known (e.g. every user needs an email).
DEFAULT: Provides a fallback value when the INSERT omits the column (e.g. status DEFAULT 'active' or created_at DEFAULT CURRENT_TIMESTAMP).
CHECK:
- Enforces a boolean rule per row (e.g. CHECK (price >= 0)); the write fails if it evaluates to false.
- Note: a CHECK passes when the condition is NULL (unknown), so pair with NOT NULL when needed.
Together they push validation into the schema, so integrity holds regardless of which application or query writes the data.

sql

CREATE TABLE products (
  id       INT PRIMARY KEY,
  name     VARCHAR(100) NOT NULL,
  price    NUMERIC CHECK (price >= 0),
  status   VARCHAR(20) DEFAULT 'active'
);

Q14.
What is the difference between `WHERE` and `HAVING`, and why can't you use an aggregate function like `SUM()` in a `WHERE` clause?

Junior

WHERE filters individual rows before grouping and aggregation, while HAVING filters groups after aggregation. You can't use an aggregate like SUM() in WHERE because the aggregate doesn't exist yet at the point WHERE runs.

WHERE:
- Applied row-by-row before GROUP BY; can only reference raw column values, not aggregates.
- Reduces rows early, so it's cheaper and can use indexes.
HAVING: Applied after grouping; can reference aggregate results (e.g. HAVING SUM(amount) > 1000).
Why no aggregate in WHERE: Logical order of evaluation is FROM then WHERE then GROUP BY then HAVING; aggregates are computed at the grouping step, after WHERE has already finished.
Best practice: filter with WHERE whenever possible and reserve HAVING for conditions on aggregates.

sql

SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE status = 'paid'        -- filters rows first
GROUP BY customer_id
HAVING SUM(amount) > 1000;   -- filters groups after

Q15.
What do the `GROUP BY` and `HAVING` clauses do, and what are the common aggregate functions like `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX`?

Junior

GROUP BY collapses rows that share the same values into one row per group so aggregate functions can summarize each group, and HAVING filters those resulting groups. Aggregate functions compute a single value across the rows of each group.

GROUP BY: Partitions rows by the listed columns; every non-aggregated column in the SELECT must appear in GROUP BY.
HAVING: Filters the grouped output using aggregate conditions (the group-level counterpart to WHERE).
Common aggregates:
- COUNT(): number of rows (COUNT(*) counts all, COUNT(col) skips NULL).
- SUM() and AVG(): total and mean of numeric values.
- MIN() and MAX(): smallest and largest value.
Note: most aggregates ignore NULL values (except COUNT(*)).

Q16.
How does the `DISTINCT` keyword work, and what are the performance considerations of using it?

Junior

DISTINCT removes duplicate rows from a result set, returning only unique combinations of the selected columns. It's convenient but can be costly because the database must compare and deduplicate every row, usually via a sort or hash.

How it works:
- Applies across all columns in the SELECT list, not just the first one: DISTINCT a, b dedupes on the pair.
- Treats multiple NULLs as equal, so they collapse to one.
Performance considerations:
- Requires a sort or hash aggregation over the full result: extra CPU and memory, possibly spilling to disk on large sets.
- An index on the deduplicated columns can let the planner avoid a sort.
Watch out for:
- Using DISTINCT to mask duplicates caused by a bad JOIN: fix the join instead.
- For existence checks prefer EXISTS over SELECT DISTINCT, and use GROUP BY when you also need aggregates.

Q17.
How does `COUNT(*)` differ from `COUNT(column)` and `COUNT(DISTINCT column)`?

Junior

All three count rows, but they differ in what they include: COUNT(*) counts every row, COUNT(column) skips NULLs, and COUNT(DISTINCT column) counts unique non-NULL values.

COUNT(*) counts all rows:
- Includes rows with NULLs; it never inspects column values, just row existence.
- Optimizers often satisfy it from an index or table metadata.
COUNT(column) counts non-NULL values in that column: Rows where the column is NULL are excluded, so it can be less than COUNT(*).
COUNT(DISTINCT column) counts unique non-NULL values:
- Duplicates collapse to one; NULLs still ignored.
- Usually the most expensive: requires deduplication (sort or hash).

Q18.
What is ER modeling, and how do you represent entities, relationships, and cardinality when designing a relational schema?

Junior

ER (Entity-Relationship) modeling is a conceptual design technique that describes data as entities, the attributes that describe them, and the relationships between them, before it becomes physical tables. It lets you reason about structure and cardinality independent of any specific database.

Entities:
- Real-world things you store data about (Customer, Order); each becomes a table with a primary key.
- Attributes become columns; the identifying attribute is the key.
Relationships: Associations between entities (a Customer places an Order), implemented with foreign keys.
Cardinality:
- One-to-one: rare, often merged or split for security/sparsity reasons.
- One-to-many: the "many" side holds a foreign key to the "one" side.
- Many-to-many: resolved with a junction (bridge) table holding both foreign keys.
Optionality: Whether participation is mandatory or optional maps to NOT NULL vs nullable foreign keys.

Q19.
What do `COALESCE` and `NULLIF` do, and how do they help handle `NULL` values?

Junior

Both are NULL-handling functions: COALESCE returns the first non-NULL argument from a list (a substitute/default), while NULLIF turns a specific value into NULL by comparing two expressions.

COALESCE(a, b, c, ...):
- Returns the first argument that is not NULL, or NULL if all are NULL.
- Great for defaults/fallbacks: COALESCE(discount, 0) or picking the first available column.
- Standard SQL and portable; prefer it over vendor-specific ISNULL or IFNULL.
NULLIF(a, b):
- Returns NULL if a = b, otherwise returns a.
- Common use: guard against divide-by-zero with x / NULLIF(y, 0) (turns a 0 denominator into NULL instead of erroring).
- Also handy to treat sentinel values (like empty string) as NULL: NULLIF(name, '').
They pair well together: Use NULLIF to create NULLs, then COALESCE to collapse them into a safe default.

sql

SELECT COALESCE(nickname, first_name, 'Guest') AS display_name,
       revenue / NULLIF(orders, 0) AS avg_order_value
FROM users;

Q20.
Explain the logical order of a SQL query's execution. Why does `SELECT` happen after `WHERE` and `GROUP BY`, and how does this affect using aliases?

Mid

SQL is declarative: you write SELECT first, but the engine evaluates clauses in a different logical order, resolving the data source before it can project or sort columns. This ordering explains why aliases defined in SELECT aren't visible to WHERE or GROUP BY.

FROM / JOIN: Assemble and join the source tables into a working row set.
WHERE: Filter individual rows before any grouping happens.
GROUP BY: Collapse rows into groups.
HAVING: Filter the groups (can reference aggregates).
SELECT: Compute expressions and assign column aliases.
ORDER BY: Sort the final projected result.
LIMIT / OFFSET: Trim the sorted output.

Why SELECT is late: Filtering and grouping decide which rows exist; projection only makes sense after that set is fixed.
Alias visibility: A SELECT alias can't be used in WHERE or GROUP BY (they run earlier); most engines do allow it in ORDER BY (which runs later).

Q21.
What is the `MERGE` statement (upsert), and when would you use it?

Mid

MERGE (an "upsert") combines insert, update, and optionally delete in one statement: it compares a source data set against a target table and applies the right action per row based on whether a match exists.

How it works: Join source to target ON a key, then define WHEN MATCHED (update/delete) and WHEN NOT MATCHED (insert) clauses.
When to use:
- Sync/ETL loads where incoming rows may be new or existing.
- Maintaining dimension tables or applying a batch of changes atomically.
Cautions: Syntax and concurrency behavior vary by engine; SQL Server's MERGE has had known bugs, so many teams prefer INSERT ... ON CONFLICT (PostgreSQL) or INSERT ... ON DUPLICATE KEY UPDATE (MySQL).

sql

MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET t.amount = s.amount
WHEN NOT MATCHED THEN
  INSERT (id, amount) VALUES (s.id, s.amount);

Q22.
What factors should you consider when choosing data types for columns, such as `CHAR` vs `VARCHAR` or the right numeric type?

Mid

Choose the smallest, most precise type that faithfully represents the data and enforces its constraints: this saves storage, speeds comparisons and indexing, and prevents invalid values. Match the type to the data's real shape and semantics.

CHAR vs VARCHAR:
- CHAR(n) is fixed-width (padded), good only for truly constant-length codes (e.g. country codes).
- VARCHAR(n) stores variable length, better for most strings; set a sensible max for validation.
Numeric types:
- Use exact DECIMAL/NUMERIC for money to avoid rounding errors; never use FLOAT for currency.
- Pick integer width (SMALLINT, INT, BIGINT) to fit the expected range.
Dates and time: Use real date/time types (DATE, TIMESTAMP), and prefer timezone-aware types for global data.
General principles:
- Smaller types mean more rows per page and faster indexes.
- The type is a constraint: it rejects bad data at the storage layer.
- Keep join-key types identical across tables to avoid implicit conversions that kill index usage.

Q23.
What is the difference between `UNION` and `UNION ALL`, and when would you prefer one over the other?

Mid

Both combine the result sets of two queries with matching columns, but UNION removes duplicate rows while UNION ALL keeps every row. Prefer UNION ALL unless you actually need deduplication.

UNION deduplicates: It must sort or hash the combined rows to eliminate duplicates, which adds cost.
UNION ALL keeps everything: No dedup step, so it is faster and uses fewer resources: it just concatenates results.
Requirements for both: Same number of columns and compatible data types in the same order.
When to prefer which:
- Use UNION ALL when duplicates are impossible or acceptable (better performance).
- Use UNION only when you genuinely need a distinct result set.

Q24.
When would you use a `CROSS JOIN`, and what are the performance risks?

Mid

A CROSS JOIN produces the Cartesian product of two tables (every combination of rows), so you use it deliberately when you actually want all pairings, and you must watch out for the row-count explosion.

Legitimate uses:
- Generating combinations: e.g., every product paired with every size, or building a matrix of options.
- Creating dense date/number series by cross joining against a generated set.
- Test-data or scaffolding queries where all permutations are intentional.
Performance risks:
- Output size is rows(A) × rows(B): two 10,000-row tables yield 100 million rows.
- An accidental CROSS JOIN (a missing join condition) is a common cause of runaway, slow queries.

Q25.
What is the difference between the `INTERSECT` and `EXCEPT` set operators, and how do they differ from a `JOIN`?

Mid

Both are set operators that combine two result sets row-wise (comparing whole rows), whereas a JOIN combines tables column-wise by matching a condition. INTERSECT returns rows in both queries; EXCEPT returns rows in the first but not the second.

INTERSECT: Returns distinct rows present in both result sets; both queries must have compatible column counts and types.
EXCEPT (called MINUS in Oracle): Returns distinct rows from the first query that do not appear in the second; order matters.
Row-wise vs column-wise:
- Set operators compare entire rows and are duplicate-eliminating by default (add ALL to keep duplicates).
- A JOIN stitches columns from two tables together based on a predicate and can multiply rows (many-to-many).
Practical note: an anti-join written with EXCEPT compares all columns, whereas a LEFT JOIN ... IS NULL matches only on join keys.

Q26.
What is the difference between the `ON` and `USING` clauses in a `JOIN`, and what is a `NATURAL JOIN`?

Mid

ON specifies an arbitrary join predicate; USING is shorthand when both tables share identically named columns and merges them into one; NATURAL JOIN auto-joins on all commonly named columns with no explicit condition.

ON:
- Most flexible: any boolean expression, columns can have different names, supports inequalities.
- Both join columns remain in the output (e.g. a.id and b.id).
USING:
- Requires the column name to exist in both tables; equates them by equality only.
- Collapses the shared column into a single output column (unqualified).
NATURAL JOIN:
- Implicitly joins on every column with a matching name in both tables.
- Risky in practice: adding a coincidentally-named column later (like created_at) silently changes join behavior, so many teams avoid it.

sql

-- Equivalent joins
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id;
SELECT * FROM orders JOIN customers USING (customer_id);
SELECT * FROM orders NATURAL JOIN customers;  -- joins on all shared column names

Q27.
What is the difference between a primary key and a unique key, can a table have multiple unique keys, and can a unique key contain a `NULL` value?

Mid

A primary key uniquely identifies each row and disallows NULLs (there can be only one per table); a unique key also enforces uniqueness but permits NULLs, and a table can have many of them. Both are typically backed by an index.

Primary key: Exactly one per table; NOT NULL is implicit; conceptually the row's main identity.
Unique key:
- Multiple allowed per table (e.g. unique email and unique username).
- Enforces a secondary business constraint beyond the PK.
NULLs in a unique key:
- Allowed, because NULL is not equal to NULL, so uniqueness isn't violated.
- Vendor behavior differs: most (PostgreSQL, Oracle) permit multiple NULLs; SQL Server allows only one NULL per single-column unique index.

Q28.
What are 'Cascading Actions' like `ON DELETE CASCADE` in foreign keys, and what are the risks of using them in a complex schema?

Mid

Cascading actions tell a foreign key to automatically propagate changes from a parent row to its child rows: ON DELETE CASCADE deletes children when the parent is deleted, and ON UPDATE CASCADE updates child keys when the parent key changes. They keep referential integrity without manual cleanup, but hide destructive effects.

Available actions: CASCADE (propagate), SET NULL, SET DEFAULT, RESTRICT/NO ACTION (block the operation).
Benefits: Guarantees no orphaned child rows and reduces boilerplate cleanup code.
Risks in complex schemas:
- Chained cascades: one delete can silently wipe many tables across several levels.
- Hard to audit: the destructive impact isn't visible in the DELETE statement itself.
- Performance: a single delete can lock and rewrite huge numbers of child rows.
- Conflicts: multiple cascade paths to the same table can error (notably in SQL Server) or interact with triggers unexpectedly.
Rule of thumb: use for tightly-owned children (e.g. order lines under an order); prefer RESTRICT or soft deletes for shared or business-critical data.

Q29.
What are the trade-offs of using a `UUID` or Auto-increment ID (Surrogate) versus a Social Security Number or Email (Natural) as a Primary Key?

Mid

A surrogate key (auto-increment or UUID) is a system-generated, meaningless identifier, while a natural key (SSN, email) carries business meaning. Surrogates are generally preferred because natural keys tend to change and leak sensitive data, though natural keys can simplify certain lookups.

Surrogate keys (advantages):
- Immutable: never need to change, so foreign keys stay stable.
- Compact and uniform, and don't expose real-world data.
- Not tied to business rules that may evolve.
Surrogate keys (drawbacks):
- Add a meaningless column; still need a unique constraint on the natural attribute to prevent duplicates.
- Require a join to get the human-meaningful value.
Natural keys (drawbacks):
- Mutable: an email or SSN can change or be reassigned, cascading through FKs.
- Privacy/security risk: SSNs shouldn't be scattered as keys.
- Often wider and slower as index/FK targets.
Common practice: use a surrogate PK and enforce the natural attribute with a separate UNIQUE constraint.

Q30.
What is referential integrity, and can you explain the different On Delete actions like `CASCADE`, `SET NULL`, and `RESTRICT`?

Mid

Referential integrity is the rule that a foreign key value must either match an existing key in the referenced (parent) table or be NULL: it prevents orphaned rows that point to parents that don't exist. ON DELETE actions define what happens to child rows when the parent they reference is deleted.

Referential integrity:
- Enforced by a FOREIGN KEY constraint linking a child column to a parent's primary/unique key.
- Blocks inserts of non-existent references and controls deletes/updates of parents.
CASCADE: Deleting the parent automatically deletes the matching child rows: keeps things clean but can cascade further than expected.
SET NULL: Child's foreign key is set to NULL, preserving the child row but severing the link (the column must be nullable).
RESTRICT / NO ACTION: Prevents deleting the parent while children exist; the delete errors out. RESTRICT checks immediately, NO ACTION can defer to end of statement.
SET DEFAULT also exists: sets the FK to its column default.

Q31.
What is a candidate key, and how does it relate to primary keys and composite keys?

Mid

A candidate key is any minimal set of columns that uniquely identifies each row in a table. The primary key is the one candidate key you choose as the main identifier, and a composite key is simply a key made of more than one column.

Candidate key:
- Uniquely identifies rows and is minimal (no column can be removed without losing uniqueness).
- A table can have several: e.g. a user's id, email, and username may each be candidate keys.
Primary key:
- The single candidate key chosen as the row identifier; implies UNIQUE plus NOT NULL.
- The remaining candidate keys are called alternate keys (often enforced with UNIQUE constraints).
Composite key: A candidate/primary key spanning multiple columns because no single column is unique (e.g. (order_id, product_id) in a line-items table).

Q32.
What are the trade-offs between using a `Subquery` and a `Common Table Expression (CTE)`? When would you prefer a `Recursive CTE`?

Mid

A subquery is an inline query nested in another statement; a CTE (WITH) is a named, readable result defined up front. They are usually equivalent in performance, so the choice is mostly about readability and reuse; a recursive CTE is for hierarchical or iterative data a plain subquery cannot express.

Subquery:
- Compact for one-off logic embedded in a WHERE, SELECT, or FROM.
- Hard to read and impossible to reference twice if deeply nested.
CTE:
- Named and defined at the top, so complex queries read top-to-bottom.
- Can be referenced multiple times in the same statement.
- Caveat: in some engines a CTE is an optimization fence (materialized), which can hurt; in others it's inlined.
Prefer a Recursive CTE when data references itself:
- Hierarchies (employee/manager, folders), graph traversal, or generating number/date sequences.
- It has an anchor member plus a recursive member unioned together, iterating until no new rows appear.

Q33.
What is the difference between `IN` and `EXISTS`, and which is generally more efficient?

Mid

Both test membership against a set, but IN compares a value against a list of returned values, while EXISTS checks whether a subquery returns any row at all. Modern optimizers often rewrite them into the same plan, so "which is faster" depends on data and NULL handling more than the keyword.

IN evaluates the full value list:
- Good when the subquery returns a small, distinct set.
- NULL trap: NOT IN with a NULL in the list returns no rows, because comparisons become UNKNOWN.
EXISTS short-circuits on the first match:
- Often better for large or correlated subqueries: it stops as soon as one row qualifies.
- NULL-safe: NOT EXISTS behaves predictably where NOT IN does not.
Rule of thumb: small in-memory list, use IN; correlated existence check or NULL concerns, prefer EXISTS.

Q34.
What is the difference between a Correlated Subquery and a Non-correlated Subquery?

Mid

A non-correlated subquery is self-contained and runs once, independent of the outer query; a correlated subquery references outer-query columns and is conceptually evaluated once per outer row.

Non-correlated subquery:
- Can be executed on its own, produces a fixed result, and is computed a single time.
- Example: WHERE salary > (SELECT AVG(salary) FROM emp).
Correlated subquery:
- References an outer column, so it cannot run standalone and re-evaluates per outer row.
- Example: WHERE salary > (SELECT AVG(salary) FROM emp e2 WHERE e2.dept = e1.dept).
Practical impact: Non-correlated is typically cheaper; correlated is more expressive but risks row-by-row cost.

Q35.
What do the `ANY` and `ALL` operators do in a subquery comparison, and can you give an example?

Mid

ANY and ALL compare a value against a set returned by a subquery: ANY is true if the comparison holds for at least one value, while ALL is true only if it holds for every value.

ANY (synonym SOME):
- x > ANY(...) means greater than the minimum returned value.
- = ANY(...) is equivalent to IN.
ALL:
- x > ALL(...) means greater than the maximum returned value.
- <> ALL(...) is equivalent to NOT IN.
NULL/empty caveat: ALL against an empty set is true; ANY against an empty set is false.

sql

-- Employees earning more than every employee in dept 10
SELECT name FROM emp
WHERE salary > ALL (SELECT salary FROM emp WHERE dept = 10);

Q36.
What is the difference between a scalar subquery and a derived table?

Mid

A scalar subquery returns exactly one row and one column (a single value), whereas a derived table is a subquery in the FROM clause that returns a full result set treated as a temporary table.

Scalar subquery:
- Used where a single value is expected: in SELECT, WHERE, or an expression.
- Errors if it returns more than one row; returns NULL if it returns none.
- Example: SELECT name, (SELECT MAX(amt) FROM orders o WHERE o.cid = c.id) FROM cust c.
Derived table:
- Appears in FROM, must have an alias, and can return many rows and columns.
- You join to it and select from its columns like any table.
- Example: FROM (SELECT cid, SUM(amt) t FROM orders GROUP BY cid) d.

Q37.
What is the difference between a temporary table and a `CTE`, and when would you choose one over the other?

Mid

A CTE is a named query that exists only for the single statement it precedes; a temporary table is a physical object that persists for the session (or transaction) and stores actual data. Choose a CTE for readability within one query, and a temp table when you must reuse a materialized result across multiple statements.

CTE:
- Scoped to one statement; disappears immediately after.
- Often inlined by the optimizer (no guaranteed materialization), and cannot be indexed.
- Best for clarity, decomposing complex logic, and recursion.
Temporary table:
- Persists across many statements in the session; explicitly created and dropped.
- Physically stores rows, can carry indexes and statistics, aiding the optimizer.
- Best when an expensive intermediate result is reused several times or is very large.
Rule of thumb: Single query and readability, use a CTE; reuse across steps or need indexes, use a temp table.

Q38.
Explain the difference between `2NF` and `3NF`. Why might a high-scale production system choose to denormalize certain tables?

Mid

Both build on 1NF, but they eliminate different kinds of redundancy: 2NF removes partial dependencies on part of a composite key, while 3NF removes transitive dependencies on non-key columns. Denormalization deliberately reverses this to trade write-time integrity for read-time speed.

2NF (Second Normal Form):
- Must be in 1NF and every non-key column depends on the whole primary key, not just part of it.
- Only relevant with composite keys: a partial dependency is a non-key attribute determined by a subset of the key.
3NF (Third Normal Form):
- Must be in 2NF and have no transitive dependency: a non-key column determined by another non-key column.
- Example: storing zip_code and city where zip determines city violates 3NF.
Why denormalize at scale:
- Fully normalized reads require many joins; at high query volume those joins become the bottleneck.
- Duplicating columns or precomputing aggregates cuts joins and speeds reads, especially for read-heavy or analytics workloads.
- Cost: writes must keep copies consistent (triggers, app logic, or accepted eventual consistency).

Q39.
What is database normalization and what are the specific goals of `1NF`, `2NF`, and `3NF`?

Mid

Normalization is the process of organizing columns and tables to eliminate redundancy and update anomalies by ensuring every fact is stored in exactly one place. The normal forms are progressive: each assumes the previous is met.

1NF (First Normal Form):
- Every column holds a single atomic value (no repeating groups, no arrays in a cell).
- Each row is unique, identified by a primary key.
2NF (Second Normal Form):
- In 1NF plus no partial dependency: every non-key column depends on the entire primary key.
- Matters only when the key is composite.
3NF (Third Normal Form):
- In 2NF plus no transitive dependency: non-key columns must not depend on other non-key columns.
- Informally: every non-key attribute depends on "the key, the whole key, and nothing but the key."
Overall goal: Prevent insert, update, and delete anomalies so a single change touches a single row.

Q40.
What is the difference between a standard View and a Materialized View? When is the extra storage cost of a Materialized View justified?

Mid

A standard view is a saved query that runs fresh every time you reference it, storing no data; a materialized view physically stores the query's result set on disk and must be refreshed to reflect underlying changes. You pay storage and staleness for much faster reads.

Standard View:
- Just a stored SELECT; always returns current data because it re-executes on access.
- No storage cost, but no performance gain: an expensive query stays expensive.
Materialized View:
- Results are precomputed and persisted, so reads are fast.
- Data can be stale until a refresh (REFRESH MATERIALIZED VIEW), which may be on-demand, scheduled, or incremental.
When the storage cost is justified:
- Expensive aggregations or joins that are read far more often than the base data changes.
- Dashboards and reporting where slight staleness is acceptable in exchange for low latency.
- Not justified when data must always be current or writes are frequent relative to reads.

Q41.
What is an updatable `view`, and what conditions must a `view` meet to allow `inserts` and `updates` through it?

Mid

An updatable view is a view through which you can run INSERT, UPDATE, or DELETE, and the changes propagate to the underlying base table. It's only allowed when the engine can unambiguously map each affected view row back to exactly one base-table row.

Core requirement: unambiguous row mapping: Each row in the view must correspond to exactly one row in one base table.
Conditions that typically must hold:
- Based on a single table (no joins across multiple tables in the general case).
- No aggregate functions, GROUP BY, HAVING, or DISTINCT.
- No UNION, set operators, or window functions.
- Columns must be simple references, not computed expressions (you can't insert into a derived column).
- Any NOT NULL base columns without defaults must be exposed so inserts can supply them.
Guarding row visibility: WITH CHECK OPTION: Prevents inserting/updating rows that would fall outside the view's WHERE filter (and vanish from the view).
Escape hatch for complex views: Use an INSTEAD OF trigger to define custom write logic when a view isn't naturally updatable.

Q42.
What is a `trigger` in a relational database, and what are some appropriate use cases and pitfalls of using them?

Mid

A trigger is procedural code the database automatically executes in response to a data event (INSERT, UPDATE, DELETE) on a table. It's powerful for enforcing invariants close to the data, but its implicit, hidden execution makes it easy to misuse.

Common types: BEFORE/AFTER for row/statement events; INSTEAD OF to make views writable.
Appropriate use cases:
- Audit logging: record who changed what and when.
- Enforcing complex integrity rules that constraints can't express.
- Maintaining derived/denormalized data or history tables automatically.
Pitfalls:
- Hidden behavior: logic fires invisibly, making bugs hard to trace.
- Performance cost: runs inside the DML transaction, adding latency to every write.
- Cascading/recursive triggers can fire each other and cause hard-to-follow chains.
- Row-vs-statement mismatch: a set-based update fires once per statement or per row depending on definition, easy to get wrong.
Guideline: prefer declarative constraints (CHECK, FOREIGN KEY) first; reach for triggers only when the rule truly can't be expressed declaratively.

Q43.
What is the difference between a `stored procedure` and a `user-defined function` in SQL?

Mid

Both are stored, reusable database code, but a function is meant to compute and return a value for use inside queries, while a procedure performs actions and is invoked as a standalone command.

Return behavior: A UDF must return a value (scalar or table); a procedure may return none, one, or many result sets via output params.
Where they can be used: A UDF can be embedded in SELECT, WHERE, JOIN; a procedure is called with CALL/EXEC and cannot appear inside a query.
Side effects: Functions are generally expected to be side-effect free (no DML in most engines); procedures can perform INSERT/UPDATE/DELETE, manage transactions, and run DDL.
Transactions: Procedures can COMMIT/ROLLBACK; functions typically cannot control transactions.
Rule of thumb: use a function to compute a value inside a query; use a procedure to perform a unit of work.

Q44.
What is a `cursor`, and why is it generally discouraged in favor of set-based operations?

Mid

A cursor is a database object that lets you process a result set one row at a time, imperatively. It's discouraged because it forces row-by-row (RBAR) iteration, which is far slower than letting the engine operate on the whole set at once.

How it works: You DECLARE, OPEN, FETCH each row in a loop, then CLOSE and DEALLOCATE.
Why set-based is preferred:
- The optimizer can parallelize, use indexes, and batch I/O across the whole set.
- Cursors add per-row overhead and often hold locks/resources for the loop's duration.
- Declarative set logic is shorter and expresses intent, not mechanics.
When a cursor is acceptable:
- Genuinely sequential, order-dependent work (e.g. running admin/maintenance tasks per object).
- Small result sets where clarity outweighs performance.

sql

-- Cursor (row-by-row): slow
-- ... loop and UPDATE one row at a time ...

-- Set-based equivalent: one statement, optimizer handles it
UPDATE employees
SET bonus = salary * 0.10
WHERE department = 'Sales';

Q45.
What is the physical difference between `clustered` and `non-clustered` indexes in how data is stored on disk, and why can you only have one `clustered index`?

Mid

A clustered index defines the physical order of the table's rows on disk: the table's data pages ARE the leaf level of that index. A non-clustered index is a separate structure that stores the key plus a pointer back to the row. You can have only one clustered index because data can be physically sorted in exactly one order.

Clustered index:
- The B-tree's leaf pages contain the actual full rows, ordered by the index key.
- There's no separate copy of the data: the table is the index (often called a clustered/index-organized table).
Non-clustered index:
- A separate B-tree whose leaves hold the index key plus a row locator (the clustered key, or a physical row ID / heap RID).
- Reading extra columns requires a lookup back into the table (a key/bookmark lookup).
- You can have many per table.
Why only one clustered index:
- Sorting the physical rows two different ways at once is impossible, so only one order can be the physical one.
- Choose the clustered key carefully (often the primary key): it affects range scans and is embedded in every non-clustered index.

Q46.
What is a composite (multi-column) index, and can you explain the leftmost prefix rule and why column order matters?

Mid

A composite index is a single index built on two or more columns, sorted by the first column, then the second within ties, and so on. The leftmost prefix rule says the index can only be used when your query filters on a contiguous prefix of those columns starting from the first, which is exactly why column order matters.

How the sort works: An index on (a, b, c) is like a phone book sorted by last name, then first name: ordered by a, then b within each a, then c.
Leftmost prefix rule:
- Usable for filters on a, (a, b), or (a, b, c).
- Not usable (for a seek) for a query filtering only on b, or only c, because there's no leading anchor.
- A gap breaks it: filtering a and c (skipping b) can seek on a but not efficiently use c.
Why order matters:
- Put the most selective and most frequently filtered column first (subject to your query patterns).
- A range predicate on an early column stops later columns from being used for seeking beyond that point.

Q47.
What is a 'covering index,' and how does it enable an 'index-only scan'? Why is this significantly faster than a standard `index seek`?

Mid

A covering index is one that contains every column a query needs (both filter and output columns), so the engine can answer the query entirely from the index. This enables an index-only scan: the database never touches the base table, which is what makes it fast.

What 'covering' means:
- All columns referenced in SELECT, WHERE, JOIN, ORDER BY are present in the index.
- Extra output-only columns can ride along via INCLUDE (stored at the leaf, not in the sorted key).
Index-only scan: The engine reads the index leaves and returns results directly, skipping the base table entirely.
Why it's faster than a normal seek:
- A standard non-clustered seek finds matching keys, then does a key/bookmark lookup into the table for the other columns, one extra random I/O per row.
- Covering eliminates those lookups: fewer I/Os and less random access, especially costly when many rows match.
Trade-off: Wider indexes take more storage and slow writes, so cover targeted, hot queries rather than everything.

Q48.
When can adding an index actually hurt performance? Explain the trade-off between read speed and write throughput.

Mid

Indexes speed up reads but every one must be maintained on every write, so adding them trades faster SELECT for slower INSERT/UPDATE/DELETE, plus more storage and memory pressure.

Write amplification:
- Each index is a separate structure that must be updated in sync with the table, so N indexes mean N extra maintenance operations per row change.
- Updating an indexed column can cause page splits and node rebalancing.
Storage and memory cost: Indexes take disk space and compete for the buffer cache; too many can evict useful data pages.
Optimizer overhead and redundancy:
- More candidate indexes means more plans to evaluate, and duplicate/overlapping indexes add cost with no benefit.
- Low-selectivity indexes (e.g. a boolean flag) are often ignored yet still maintained.
When it hurts most:
- Write-heavy or bulk-load tables, where read savings don't justify write cost.
- Rule of thumb: index for your real query patterns, then drop unused indexes.

Q49.
Explain the 'N+1 Query Problem' from a database perspective. Why does it happen, and how can it be mitigated?

Mid

The N+1 problem is when code runs one query to fetch a list of parent rows, then fires one additional query per row to load related data: 1 + N round trips instead of a single join. It's usually caused by ORM lazy loading in a loop.

Why it happens:
- ORMs load associations lazily, so accessing order.customer inside a loop silently triggers a query each iteration.
- The cost is dominated by round-trip latency, not the queries themselves: 1000 tiny queries are far slower than one.
How to mitigate:
- Eager loading: fetch parents and children together (JOIN or the ORM's JOIN FETCH/includes).
- Batch loading: collect the IDs and issue one WHERE id IN (...) query (the dataloader pattern).
- Detection: watch query logs or APM for a burst of identical queries differing only by a key.
Caveat: eager loading everything can over-fetch, so load only the associations you actually use.

Q50.
How do you read an `EXPLAIN` plan, and what is the difference between an Index Seek and an Index Scan?

Mid

An EXPLAIN plan shows the tree of operators the optimizer chose to run a query, read bottom-up (or inner-to-outer). The key thing to spot is the access method: an Index Seek navigates directly to the rows it needs, while an Index Scan reads the whole index.

How to read it:
- Each node is an operation (scan, seek, join, sort, aggregate); children feed rows to their parent.
- Focus on access type, estimated vs actual rows, and the most expensive operators.
- Use EXPLAIN ANALYZE to get real timings and row counts, not just estimates.
Index Seek: Traverses the B-Tree to the matching key(s) and reads only relevant rows: efficient for selective predicates.
Index Scan: Reads every entry in the index (or a large chunk); acceptable when the query needs most rows, otherwise a red flag.
Watch for: a full Table Scan on a large table, a scan where you expected a seek (often a non-SARGable predicate or missing index), and big gaps between estimated and actual rows (stale statistics).

Q51.
Explain the difference between an index scan and an index seek.

Mid

Both use an index, but a seek jumps straight to the qualifying rows using the tree structure, whereas a scan reads through the index entries. A seek is targeted; a scan is exhaustive.

Index Seek:
- Navigates the B-Tree from root to the specific key or key range, touching only matching leaves.
- Efficient for selective (WHERE id = 42) or narrow-range predicates.
Index Scan:
- Reads all (or most) leaf entries in order, then filters.
- Chosen when the query returns a large fraction of rows, the predicate isn't SARGable, or statistics say a scan is cheaper.
Key trade-off:
- Selectivity decides: for few rows a seek wins; for most rows a scan avoids repeated tree traversals and random I/O.
- A scan isn't always bad, but an unexpected scan on a selective query usually points to a missing index or a function on the column.

Q52.
How do you use an `EXPLAIN` plan to diagnose a slow-running query?

Mid

Run EXPLAIN ANALYZE to see the actual execution plan with real timings, then look for the operators consuming the most time or rows and work out why the optimizer chose them.

Step 1: get the real plan: Prefer EXPLAIN ANALYZE (actual rows and time) over plain EXPLAIN (estimates only).
Step 2: find the hotspots:
- Identify the costliest nodes and full Table Scans on large tables.
- Compare estimated vs actual rows: a big mismatch means stale statistics or bad cardinality estimates.
Step 3: check specific culprits:
- Scans where a seek was expected (missing index or non-SARGable predicate).
- Expensive join strategies (nested loops over big inputs), and costly Sort or spill-to-disk operations.
Step 4: fix and re-measure: Add or adjust an index, rewrite the predicate, or update statistics, then re-run to confirm the plan changed.

Q53.
What is a partial (filtered) index, and when is it useful?

Mid

A partial (filtered) index only indexes rows that satisfy a WHERE predicate, so it's smaller and cheaper to maintain than a full index, and it's ideal when queries repeatedly target a well-defined subset of rows.

Definition: Built with a predicate: e.g. CREATE INDEX ... WHERE active = true, so only matching rows have entries.
Why it helps:
- Smaller index means less disk, faster scans, and cheaper inserts/updates for rows outside the predicate.
- The optimizer can only use it when the query predicate is implied by the index predicate.
Good use cases:
- Indexing only "hot" rows (e.g. status = 'pending') when most rows are irrelevant.
- Enforcing conditional uniqueness (e.g. one active record per user via a partial unique index).
- Excluding many NULL or default-valued rows to shrink the index.
Caveat: the query must match the filter, or the planner falls back to a full index/scan.

Q54.
What is the difference between a full table scan and an `index seek`, and when might a full scan actually be preferable?

Mid

A full table scan reads every row, while an index seek navigates a B-tree to jump directly to the qualifying rows; a full scan is preferable when the query touches a large fraction of the table, because sequential reads beat many random index lookups.

Index seek:
- Uses the index structure to locate matching rows without reading the whole table.
- Great for highly selective predicates returning few rows.
- May require extra random I/O to fetch non-indexed columns (a lookup/bookmark).
Full table scan: Reads all pages sequentially, which is efficient per-row and cache-friendly.
When a full scan wins:
- Low selectivity: the query returns a large percentage of rows, so an index adds random-lookup overhead for little benefit.
- Small tables where the whole thing fits in a page or two.
- No usable index, or one that isn't selective (many duplicate values).
Rule of thumb: the optimizer switches to a scan past a selectivity threshold (often roughly 5-20% of rows).

Q55.
How can caching query results help performance, and what are the risks of stale cached data?

Mid

Caching stores the result of an expensive query so repeated reads are served from fast memory instead of re-executing against the database, but it trades freshness for speed: cached data can become stale when the underlying rows change.

How it helps:
- Avoids repeated compute (joins, aggregations) and reduces DB load and latency.
- Layers: application cache, a dedicated store (Redis, Memcached), or DB-level result/buffer caches.
The staleness risk:
- After a write, the cache still returns old data until it's invalidated or expires.
- Worst case: users act on incorrect data (prices, balances, inventory).
Mitigation strategies:
- TTL expiry: cheap, but a window of staleness remains.
- Explicit invalidation on write (write-through or delete-on-update): fresher, but harder to get right.
- Match strategy to tolerance: cache read-mostly, staleness-tolerant data aggressively; avoid caching strongly consistent values.
Remember: cache invalidation is one of the genuinely hard problems; keep keys and dependencies simple.

Q56.
Explain the `ACID` properties and why they are important for a relational database.

Mid

ACID is the set of guarantees that make transactions reliable: Atomicity, Consistency, Isolation, and Durability. Together they ensure that even with concurrency and failures the database stays correct.

Atomicity: A transaction is all-or-nothing: on failure it fully rolls back, leaving no partial changes.
Consistency: A transaction moves the database from one valid state to another, preserving constraints (keys, checks, triggers).
Isolation: Concurrent transactions don't corrupt each other; results match some serial ordering, tunable via isolation levels.
Durability: Once committed, changes survive crashes, typically via a write-ahead log flushed to disk.
Why it matters: Enables trustworthy multi-step operations (e.g. money transfers) without leaving inconsistent or lost data.

Q57.
When would you choose optimistic versus pessimistic locking for a high-concurrency web application?

Mid

Choose optimistic locking when write conflicts are rare and you want maximum concurrency; choose pessimistic locking when conflicts are frequent or a conflict is expensive to retry. Most high-concurrency web apps default to optimistic because holding DB locks across user think-time doesn't scale.

Optimistic locking:
- No lock held; you read a version/timestamp and check it hasn't changed at write time (UPDATE ... WHERE version = ?).
- Conflict causes the update to affect 0 rows, so the app retries or reports a conflict.
- Best when contention is low and transactions are short; avoids blocking.
Pessimistic locking:
- Acquire a lock up front (SELECT ... FOR UPDATE) so no one else can modify the row.
- Best when conflicts are common or retry is costly (e.g. inventory decrement under heavy demand).
- Risk: reduced concurrency and deadlocks; never hold locks across user interaction.
Deciding factor: Estimate conflict probability and retry cost: low conflict, optimistic; high conflict, pessimistic.

Q58.
Explain the difference between optimistic and pessimistic locking.

Mid

Both protect against concurrent modification, but they differ in when they check for conflict: pessimistic locking assumes conflict is likely and locks data up front; optimistic locking assumes conflict is rare and only verifies at commit time.

Pessimistic locking:
- Acquires a lock before reading/writing (e.g. SELECT ... FOR UPDATE), blocking others until commit.
- Best for high-contention or long critical sections; risks blocking and deadlocks.
Optimistic locking:
- No lock held; reads a version/timestamp, and on update checks it hasn't changed (UPDATE ... WHERE version = :v).
- If the row changed, the update affects 0 rows and the app retries. Best for low contention.
Trade-off: Pessimistic pays cost of locking always; optimistic pays only on conflict (via retries) but scales better under low contention.

sql

-- Optimistic: retry if rowcount = 0
UPDATE accounts SET balance = 100, version = version + 1
WHERE id = 42 AND version = 7;

Q59.
What is a deadlock, how does a database engine detect them, and what strategies can you use in application code to prevent them?

Mid

A deadlock is a cycle where two or more transactions each hold a lock the other needs, so none can proceed. Engines detect the cycle and abort a victim; application code prevents them mainly by making lock acquisition consistent and short.

What it is: Txn A holds lock on row 1 and wants row 2; Txn B holds row 2 and wants row 1: circular wait.
How engines detect it:
- Maintain a wait-for graph and periodically check for cycles; on a cycle, roll back the cheapest victim with a deadlock error.
- Some also use a lock-timeout as a fallback.
Prevention strategies:
- Acquire locks in a consistent order everywhere (e.g. always by ascending id).
- Keep transactions short and touch fewer rows; commit promptly.
- Use a single SELECT ... FOR UPDATE to lock all needed rows at once instead of escalating.
- Retry with backoff: deadlocks are expected, so catch the error and re-run the transaction.

Q60.
What is a savepoint in a transaction, and how do `COMMIT`, `ROLLBACK`, and `SAVEPOINT` work together?

Mid

A savepoint is a named marker inside a transaction that lets you roll back part of the work without discarding the entire transaction. COMMIT makes all changes permanent, ROLLBACK undoes them (fully or to a savepoint), and SAVEPOINT defines the partial rollback points.

SAVEPOINT name: Sets a marker mid-transaction so you can later undo only the statements that came after it.
ROLLBACK TO SAVEPOINT name:
- Undoes work back to the marker but keeps the transaction open and earlier work intact.
- Savepoints created after that marker are discarded.
COMMIT: Ends the transaction and persists all surviving changes; all savepoints are released.
ROLLBACK (no target): Aborts the whole transaction, discarding everything including all savepoints.
Typical use: Attempt a risky step; if it fails, roll back to just before it and continue instead of losing the entire unit of work.

sql

BEGIN;
INSERT INTO orders(id, total) VALUES (1, 100);
SAVEPOINT before_items;
INSERT INTO items(order_id, sku) VALUES (1, 'BAD');  -- oops
ROLLBACK TO SAVEPOINT before_items;  -- undo item only
INSERT INTO items(order_id, sku) VALUES (1, 'GOOD');
COMMIT;  -- order + good item persist

Q61.
Explain the difference between vertical and horizontal partitioning.

Mid

Vertical partitioning splits a table by columns (fewer columns per partition), while horizontal partitioning splits it by rows (fewer rows per partition). One narrows the table; the other shortens it.

Vertical partitioning (by columns):
- Split a wide table into groups of columns, often separating hot, frequently-read columns from large or rarely-used ones (e.g. moving a BLOB/text column out).
- Improves cache efficiency and I/O for common queries; requires a join to recombine columns.
Horizontal partitioning (by rows):
- Split rows into partitions by a key such as date range or hash of an ID.
- Each partition has the same schema; enables partition pruning and easier archival.
Rule of thumb: Too many columns hurting reads, go vertical; too many rows hurting scans/maintenance, go horizontal.

Q62.
Explain the difference between a Read Replica and Sharding.

Mid

A read replica is a full copy of the database used to scale reads, while sharding splits the data into disjoint pieces across servers to scale writes and storage. Replicas duplicate all data; shards divide it.

Read replica:
- Each replica holds the entire dataset, kept in sync from the primary.
- Scales read throughput; writes still funnel to a single primary.
- Introduces replication lag (eventual consistency) but is simple to add.
Sharding:
- Each shard holds only a subset of the data, chosen by a shard key.
- Scales writes, storage, and reads together, since load is partitioned.
- Adds routing logic and makes cross-shard joins/transactions difficult.
Bottom line: Read replicas fix a read bottleneck; sharding fixes a write/storage bottleneck. They are complementary and often used together.

Q63.
What is Connection Pooling, and why is it necessary for high-scale applications?

Mid

Connection pooling reuses a fixed set of open database connections across many requests instead of opening and closing one per request. It's necessary at scale because establishing a DB connection is expensive and databases cap how many concurrent connections they can hold.

Why raw connections are costly:
- Each new connection needs a TCP handshake, authentication, and server-side memory/process allocation.
- Doing this per request adds latency and wastes resources.
How a pool works:
- A pool keeps N ready connections; a request borrows one, uses it, and returns it.
- Requests wait briefly if all are in use, bounding concurrency instead of overwhelming the DB.
Why it matters at high scale:
- Databases have a hard max-connections limit; thousands of clients each opening connections exhausts it and causes errors.
- Pooling caps and multiplexes usage, protecting the DB and cutting per-query overhead.
- Tools: application pools (HikariCP, SQLAlchemy) or external poolers like PgBouncer for serverless/many-instance apps.

Q64.
What is the difference between an `OLTP` and an `OLAP` database workload?

Mid

OLTP (Online Transaction Processing) handles many small, concurrent read/write transactions in real time; OLAP (Online Analytical Processing) runs fewer, large read-heavy queries that aggregate huge volumes for analysis.

OLTP: operational, write-intensive:
- Short transactions (insert an order, update a balance) with high concurrency and low latency.
- Highly normalized schemas to avoid update anomalies and keep writes cheap.
- Examples: banking, e-commerce checkout, booking systems.
OLAP: analytical, read-intensive:
- Complex queries scanning millions of rows with aggregations and joins across time.
- Denormalized/star schemas and columnar storage optimize scans and roll-ups.
- Examples: dashboards, forecasting, business intelligence.
Key contrast:
- OLTP optimizes throughput of many small writes; OLAP optimizes speed of few big reads.
- Data flows from OLTP into OLAP via ETL, so they usually live in separate systems.

Q65.
Explain the star schema vs the snowflake schema in the context of data warehousing.

Mid

Both organize a fact table surrounded by dimensions; the difference is normalization. A star schema keeps dimensions flat (denormalized) in one table each, while a snowflake schema normalizes dimensions into multiple related sub-tables.

Star schema:
- One central fact table joined directly to denormalized dimension tables (forms a star shape).
- Fewer joins means faster, simpler analytical queries.
- Trade-off: redundant data and larger dimension tables.
Snowflake schema:
- Dimensions are normalized into hierarchies (e.g. Product to Category to Department).
- Saves storage and reduces redundancy, easier to maintain consistency.
- Trade-off: more joins, more complex queries, often slower.
Rule of thumb: Favor star for query performance and BI-tool simplicity; snowflake when storage/consistency of large hierarchies matters.

Q66.
What are fact tables and dimension tables in a data warehouse, and what role does `ETL` play?

Mid

A fact table stores the measurable business events (metrics), and dimension tables store the descriptive context you slice those metrics by. ETL is the pipeline that extracts source data, transforms/cleans it, and loads it into these tables.

Fact tables:
- Hold numeric measures (sales amount, quantity) plus foreign keys to dimensions.
- Typically long and narrow: many rows, few columns; grain (one row per what) must be defined clearly.
Dimension tables:
- Hold descriptive attributes (customer, product, date) used for filtering and grouping.
- Wide and shorter; often handle history via slowly changing dimensions (SCDs).
ETL's role:
- Extract: pull from OLTP systems, files, APIs.
- Transform: clean, deduplicate, conform keys, compute measures, apply business rules.
- Load: populate dimensions first (to generate surrogate keys), then facts referencing them.
- ELT is the modern variant: load raw first, transform inside the warehouse.

Q67.
Can you explain the underlying mechanism of a `SQL` injection attack, and how do parameterized queries or prepared statements actually solve the problem at the protocol level?

Mid

SQL injection happens when user input is concatenated into a query string, so the database parses attacker data as SQL code. Parameterized queries fix it by separating the query structure from the data: the SQL is compiled first, and values are sent separately so they can never be interpreted as code.

The mechanism of the attack:
- With string building, input like ' OR '1'='1 changes the query's logic instead of being treated as a literal.
- The database has no way to know which part came from the developer and which from the user: it's all one text blob it parses.
Why prepared statements solve it:
- The statement is sent with placeholders and parsed/compiled into a fixed execution plan before any values arrive.
- Parameters are then transmitted out-of-band as typed data, not concatenated into the SQL text, so they bind to value slots only.
- At the wire protocol level this is often two messages (Parse/Prepare then Bind/Execute), making it structurally impossible for a value to become new SQL tokens.
Caveat: Identifiers (table/column names) can't be parameterized: validate those against an allowlist.

sql

-- Vulnerable: input becomes part of the SQL text
-- "SELECT * FROM users WHERE name = '" + input + "'"

-- Safe: structure compiled first, value bound separately
SELECT * FROM users WHERE name = ?;   -- driver binds input as data

Q68.
How does SQL's three-valued logic work regarding `NULL` values?

Mid

SQL uses three-valued logic because NULL means "unknown," so every comparison can yield TRUE, FALSE, or UNKNOWN. Any arithmetic or comparison involving NULL propagates UNKNOWN, and only rows evaluating to TRUE are returned.

Comparisons with NULL yield UNKNOWN:
- NULL = NULL is UNKNOWN, not TRUE; so is NULL <> 5.
- Use IS NULL / IS NOT NULL to test for it.
Logical operators propagate UNKNOWN: TRUE OR UNKNOWN is TRUE; FALSE AND UNKNOWN is FALSE; but TRUE AND UNKNOWN is UNKNOWN.
Filters keep only TRUE:
- WHERE, ON, and HAVING discard both FALSE and UNKNOWN rows.
- CHECK constraints do the opposite: they pass unless the result is explicitly FALSE.
Common gotchas:
- NOT IN with a NULL in the list can silently return no rows.
- Aggregates like SUM/AVG ignore NULL, but COUNT(*) counts the rows.

Q69.
What is a Window Function, and what is the difference between `RANK()`, `DENSE_RANK()`, and `ROW_NUMBER()`?

Mid

A window function computes a value across a set of rows related to the current row (a "window") without collapsing them into one row like GROUP BY does. The three ranking functions differ in how they handle ties.

Window function basics:
- Defined with OVER (PARTITION BY ... ORDER BY ...): partition splits rows into groups, order defines ranking/running order.
- Each input row still appears in the output, with the computed column alongside it.
ROW_NUMBER(): Assigns a unique sequential number; ties are broken arbitrarily (1, 2, 3, 4).
RANK(): Ties get the same rank, then it skips numbers (1, 2, 2, 4).
DENSE_RANK(): Ties get the same rank, with no gaps (1, 2, 2, 3).

sql

SELECT name, score,
       ROW_NUMBER() OVER (ORDER BY score DESC) AS rn,
       RANK()       OVER (ORDER BY score DESC) AS rnk,
       DENSE_RANK() OVER (ORDER BY score DESC) AS drnk
FROM players;

Q70.
How would you use SQL `GRANT` and `REVOKE` commands to secure a microservice's access to a database following the principle of least privilege?

Mid

Least privilege means each microservice gets its own database role granted only the exact operations it needs, on only the objects it touches. You use GRANT to hand out those minimal rights and REVOKE to strip anything default or excessive.

Give each service its own role: Never share the owner/superuser account; a dedicated login limits blast radius if credentials leak.
Grant only needed verbs on needed tables:
- A read service gets SELECT only; a write service gets SELECT, INSERT, UPDATE but perhaps not DELETE.
- Avoid GRANT ALL and schema-wide DDL rights (CREATE, DROP).
Revoke the defaults: Databases like Postgres grant broad rights via PUBLIC; revoke those so nothing is implicitly accessible.
Prefer column-level and view-based access: Grant on specific columns or expose a view to hide sensitive fields.

sql

CREATE ROLE orders_svc LOGIN PASSWORD '...';
REVOKE ALL ON ALL TABLES IN SCHEMA public FROM PUBLIC;
GRANT SELECT, INSERT, UPDATE ON orders TO orders_svc;
GRANT SELECT ON customers TO orders_svc;  -- read-only elsewhere

Q71.
How do you implement pagination in SQL, and what are the trade-offs between `OFFSET/LIMIT` and keyset (seek) pagination?

Mid

Pagination returns results in chunks. OFFSET/LIMIT skips N rows then takes M: simple but slow on deep pages. Keyset (seek) pagination instead filters by the last-seen sorted value, staying fast at any depth.

OFFSET / LIMIT:
- Easy, supports jumping to arbitrary pages and showing total page counts.
- Cost: the DB must scan and discard all skipped rows, so page 10,000 gets progressively slower.
- Consistency bug: inserts/deletes between requests shift rows, causing duplicates or skips.
Keyset / seek pagination:
- Uses WHERE (sort_col, id) > (last_val, last_id) with an index, so it seeks directly to the next slice.
- Constant performance regardless of depth; stable under concurrent inserts.
- Limits: no random page jumps, and the sort column(s) must be unique/tie-broken and indexed.
Rule of thumb: Use offset for small admin tables with page numbers; use keyset for infinite scroll, APIs, and deep large datasets.

sql

-- OFFSET/LIMIT (slow when offset is large)
SELECT * FROM events ORDER BY created_at, id LIMIT 20 OFFSET 10000;

-- Keyset: pass the last row's (created_at, id) from the previous page
SELECT * FROM events
WHERE (created_at, id) > ('2024-01-01 10:00', 5123)
ORDER BY created_at, id
LIMIT 20;

Q72.
How do `LAG()` and `LEAD()` window functions work, and what problems do they solve?

Mid

LAG() and LEAD() are offset window functions that read a value from a row before or after the current row within an ordered partition, without a self-join. LAG() looks backward and LEAD() looks forward.

Syntax:
- LAG(expr, offset, default): offset defaults to 1, default is returned when no row exists (e.g. the first row).
- Require ORDER BY inside OVER(); use PARTITION BY to reset per group.
Problems they solve:
- Period-over-period change: compare this row's value to the previous one (day-over-day growth, deltas).
- Detecting gaps or transitions: compare current vs prior status/timestamp.
- Computing durations between consecutive events per user.
Why not a self-join:
- They avoid correlated subqueries/self-joins, which are more verbose and slower.
- Handle boundaries cleanly via the default argument instead of leaving NULLs unhandled.

sql

SELECT day, amount,
  LAG(amount, 1, 0) OVER (ORDER BY day) AS prev_amount,
  amount - LAG(amount) OVER (ORDER BY day) AS day_over_day_change
FROM sales;

Q73.
Explain the difference between a `Nested Loop Join`, a `Hash Join`, and a `Merge Join`. When would the query optimizer choose one over the others?

Senior

These are three physical algorithms the optimizer uses to execute a logical join; it picks based on table sizes, indexes, sort order, and estimated row counts to minimize cost.

Nested Loop Join:
- For each row in the outer table, probe the inner table for matches.
- Best when one side is small and the inner side has an index on the join key; cheap for small result sets.
Hash Join:
- Builds an in-memory hash table on the smaller input's join key, then probes it with the larger input.
- Best for large, unsorted, unindexed equality joins; only works for equijoins.
Merge Join:
- Both inputs are sorted on the join key, then merged in a single pass like a zipper.
- Best when inputs are already sorted (e.g., from an index) or for large joins where sorting is cheap.
How the optimizer decides: Uses statistics and cost estimates: small/indexed favors nested loop, large unsorted favors hash, pre-sorted favors merge.

Q74.
What are semi-joins and anti-joins, and how are they typically expressed in `SQL`?

Senior

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut