Preparing for a data science interview in 2026? This guide covers the most common and most important interview questions — with model answers — for SQL, Python, statistics, machine learning, and system design.

How Data Science Interviews Work in 2026

Most data science interviews follow a standard structure:

HR Screen (15-30 min) — Background, motivation, salary expectations
Technical Phone Screen (45-60 min) — SQL + Python fundamentals
Take-Home Assignment (3-5 hours) — EDA on a provided dataset
Onsite/Virtual Interviews (4-6 hours total) — ML, statistics, case study, system design
Hiring Manager Chat (30 min) — Culture fit, project discussion

SQL Interview Questions

Q1: Find the second highest salary in an employees table

-- Method 1: Subquery
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

-- Method 2: DENSE_RANK (handles ties better)
SELECT salary FROM (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rank
    FROM employees
) ranked
WHERE rank = 2;

Q2: Find customers who haven't placed an order in the last 90 days

SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
    AND o.order_date >= CURRENT_DATE - INTERVAL '90 days'
WHERE o.order_id IS NULL;

Q3: Calculate 7-day rolling average of daily sales

SELECT
    date,
    daily_sales,
    AVG(daily_sales) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as rolling_7day_avg
FROM daily_sales_table
ORDER BY date;

Q4: Identify duplicate records

SELECT email, COUNT(*) as count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

Python / Pandas Interview Questions

Q5: Given a DataFrame with sales data, find the top 3 products by revenue for each region

import pandas as pd

# Solution using groupby + rank
df['rank'] = df.groupby('region')['revenue'].rank(method='dense', ascending=False)
top3 = df[df['rank'] <= 3].sort_values(['region', 'rank'])

Q6: How do you handle missing data in Pandas?

Answer: Depends on context and proportion of missing data:

Drop rows if < 5% missing and rows are independent: df.dropna(subset=['col'])
Mean/median imputation for numerical columns: df['col'].fillna(df['col'].median())
Mode imputation for categorical: df['col'].fillna(df['col'].mode()[0])
Forward fill for time series: df.fillna(method='ffill')
Predictive imputation (KNN or regression) for high-impact features

Q7: What is the difference between apply, map, and applymap?

df['col'].map(func) — Series only, element-wise, handles NaN
df['col'].apply(func) — Series or DataFrame, row/column-wise, more flexible
df.applymap(func) — DataFrame only, every element (deprecated in newer Pandas, use df.map())

Statistics Interview Questions

Q8: What is p-value? What does p < 0.05 mean?

Answer: The p-value is the probability of observing results as extreme as the actual results, assuming the null hypothesis is true. p < 0.05 means there's less than a 5% probability the observed result happened by chance — we reject the null hypothesis. Important caveat: p-value is not the probability that the null hypothesis is true, and statistical significance is not the same as practical significance.

Q9: Explain Type I and Type II errors

Type I error (False Positive): Rejecting the null hypothesis when it's actually true. In a spam filter: marking a legitimate email as spam.
Type II error (False Negative): Failing to reject the null hypothesis when it's actually false. In a spam filter: letting actual spam pass through.
Which is worse depends on context. In medical testing, Type II errors (missing a disease) are usually worse.

Q10: What is the Central Limit Theorem and why does it matter?

Answer: The CLT states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases — regardless of the population's distribution. This matters because it justifies using many statistical tests (t-tests, z-tests) on non-normally distributed data when sample sizes are large enough (n > 30 as a rule of thumb).

Machine Learning Interview Questions

Q11: What is the bias-variance tradeoff?

Answer: Bias is error from wrong assumptions (underfitting). Variance is error from sensitivity to training data fluctuations (overfitting). Increasing model complexity reduces bias but increases variance. The goal is to find the sweet spot that minimizes total error. Techniques: cross-validation to detect overfitting, regularization (Ridge/Lasso) to reduce variance, ensemble methods to reduce both.

Q12: How do you handle class imbalance?

Answer:

Resampling: Oversample minority class (SMOTE) or undersample majority class
Class weights: class_weight='balanced' in sklearn
Algorithm choice: Tree-based models handle imbalance better than logistic regression
Metric choice: Use F1, AUC-ROC instead of accuracy
Threshold tuning: Adjust the classification threshold from 0.5

Q13: Explain Random Forest vs Gradient Boosting

	Random Forest	Gradient Boosting (XGBoost)
Building method	Trees in parallel	Trees sequentially
Reducing	Variance (bagging)	Bias (boosting)
Overfitting risk	Lower	Higher (needs tuning)
Training speed	Faster	Slower
Accuracy	Good	Usually better
Use when	Quick baseline	Competition/production

Behavioural / Case Interview Questions

Q14: How would you approach a new data science project?

Answer framework (CRISP-DM):

Business understanding — define the problem and success metric
Data understanding — EDA, identify data sources and quality issues
Data preparation — cleaning, feature engineering
Modeling — baseline model first, then iterate
Evaluation — does it meet business success criteria?
Deployment — how does it get into production?

Q15: Tell me about a time your analysis was wrong

What interviewers want: Intellectual honesty, debugging approach, lessons learned. Describe what was wrong, how you discovered it (ideally you caught it), and what process change prevented it from happening again.

FAQ

How many rounds of interviews do data science jobs have?

Typically 3-5 rounds: HR screen, technical screen, take-home test, technical interviews (1-2 rounds), and a final culture/hiring manager round. FAANG companies can have 5-7 rounds.

What SQL level is needed for data science interviews?

Intermediate to advanced: JOINs (all types), GROUP BY + HAVING, window functions (ROW_NUMBER, RANK, LAG, LEAD, moving averages), subqueries and CTEs, and basic query optimization concepts.

Do I need to know machine learning theory for data analyst interviews?

For pure data analyst roles: basic understanding of regression and classification is usually enough. For data scientist roles: expect deeper questions on model selection, evaluation, and trade-offs.

UrbanObserver

Subscribe to newsletter

Data Science Interview Questions 2026 — With Answers (SQL, Python, ML)

Table of Content