Pandas is the most important Python library for data analysis. Every data scientist uses it daily. This complete tutorial takes you from zero to proficient in Pandas — with real code examples you can run immediately.
What is Pandas?
Pandas is an open-source Python library for data manipulation and analysis. It provides two primary data structures: the DataFrame (like a table/spreadsheet) and the Series (a single column). With Pandas, you can load, clean, transform, and analyze data with just a few lines of code.
Installing Pandas
pip install pandas numpy matplotlibCreating DataFrames
import pandas as pd
import numpy as np
# From a dictionary
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [28, 34, 25, 31],
'salary': [65000, 82000, 55000, 90000],
'department': ['Data', 'Engineering', 'Data', 'Product']
}
df = pd.DataFrame(data)
print(df)Reading Data from Files
# CSV
df = pd.read_csv('data.csv')
# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON
df = pd.read_json('data.json')
# From SQL database
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://user:pass@host/db')
df = pd.read_sql('SELECT * FROM customers', engine)Exploring Your Data
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.shape) # (rows, columns)
print(df.dtypes) # Column data types
print(df.info()) # Overview + null counts
print(df.describe()) # Statistical summary
print(df.columns.tolist()) # Column namesSelecting Data
# Select one column (returns Series)
names = df['name']
# Select multiple columns (returns DataFrame)
subset = df[['name', 'salary']]
# Select by row index (iloc)
first_row = df.iloc[0] # First row
rows_1_to_3 = df.iloc[1:4] # Rows 1, 2, 3
# Select by label (loc)
row = df.loc[df['name'] == 'Alice']
# Filter rows with conditions
high_earners = df[df['salary'] > 70000]
data_team = df[df['department'] == 'Data']
data_high_earners = df[(df['department'] == 'Data') & (df['salary'] > 60000)]Cleaning Data
# Check missing values
print(df.isnull().sum())
# Drop rows with any missing values
df_clean = df.dropna()
# Fill missing values
df['age'] = df['age'].fillna(df['age'].median())
df['department'] = df['department'].fillna('Unknown')
# Remove duplicates
df = df.drop_duplicates()
# Rename columns
df = df.rename(columns={'name': 'employee_name', 'salary': 'annual_salary'})
# Change data types
df['age'] = df['age'].astype(int)
df['hire_date'] = pd.to_datetime(df['hire_date'])Transforming Data
# Add new columns
df['salary_monthly'] = df['salary'] / 12
df['senior'] = df['age'].apply(lambda x: 'Senior' if x >= 30 else 'Junior')
# Apply function to column
df['name_upper'] = df['name'].str.upper()
df['name_length'] = df['name'].str.len()
# Replace values
df['department'] = df['department'].replace({'Data': 'Data Science', 'Engineering': 'Engineering'})
# Sorting
df_sorted = df.sort_values('salary', ascending=False)
df_sorted2 = df.sort_values(['department', 'salary'], ascending=[True, False])Grouping and Aggregation
# Group by department and calculate stats
dept_stats = df.groupby('department').agg({
'salary': ['mean', 'min', 'max', 'count'],
'age': 'mean'
}).round(2)
print(dept_stats)
# Simple group by
avg_salary = df.groupby('department')['salary'].mean()
# Multiple aggregations
result = df.groupby('department').agg(
avg_salary=('salary', 'mean'),
count=('name', 'count'),
max_age=('age', 'max')
).reset_index()Merging and Joining DataFrames
# Sample data
employees = pd.DataFrame({'emp_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
salaries = pd.DataFrame({'emp_id': [1, 2, 4], 'salary': [65000, 82000, 90000]})
# Inner join (only matching rows)
result = pd.merge(employees, salaries, on='emp_id', how='inner')
# Left join (all from left, matching from right)
result = pd.merge(employees, salaries, on='emp_id', how='left')
# Stack DataFrames vertically
combined = pd.concat([df1, df2], ignore_index=True)Pivot Tables
# Like Excel pivot tables
pivot = df.pivot_table(
values='salary',
index='department',
columns='senior',
aggfunc='mean',
fill_value=0
).round(2)
print(pivot)Exporting Data
# To CSV
df.to_csv('output.csv', index=False)
# To Excel
df.to_excel('output.xlsx', sheet_name='Results', index=False)
# To JSON
df.to_json('output.json', orient='records')
# Multiple sheets in Excel
with pd.ExcelWriter('report.xlsx') as writer:
df1.to_excel(writer, sheet_name='Summary', index=False)
df2.to_excel(writer, sheet_name='Details', index=False)FAQ
What is pandas used for in data science?
Pandas is used for loading data from files and databases, cleaning messy data, transforming and reshaping datasets, calculating statistics, merging multiple data sources, and preparing data for machine learning.
Should I learn NumPy before Pandas?
You can learn them simultaneously. Pandas is built on NumPy, so knowing NumPy basics helps, but you can become productive in Pandas without deep NumPy knowledge.
How do I speed up slow Pandas operations?
Use vectorized operations instead of loops, use .apply() with caution, consider swifter or modin for parallel processing, and switch to Polars for very large datasets (it’s 5-10x faster than Pandas).



