Data Science Learning Path: Beyond NumPy and Pandas
Introduction
After mastering NumPy and Pandas, you're ready to dive into more specialized libraries that form the backbone of modern data science and machine learning. This guide will walk you through the essential libraries to learn next, with practical examples and real-world applications.
Learning Path Overview
Data Visualization
Machine Learning
Statistical Analysis
Deep Learning
Model Deployment and Production
1. Data Visualization Libraries
Understanding Data Visualization
Data visualization is crucial in data science for:
Exploring data patterns and relationships
Communicating findings to stakeholders
Identifying outliers and anomalies
Making data-driven decisions
Matplotlib
Matplotlib is the foundation of Python visualization libraries. Think of it as a low-level library that gives you precise control over every aspect of your plots.
Key Concepts:
Figure: The overall window or page that contains your plots
Axes: The actual plotting area
Artists: Everything you see in a plot (lines, text, legends)
import matplotlib.pyplot as plt
import numpy as np
# Basic plotting
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='sin(x)')
plt.title('Basic Sine Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.legend()
plt.grid(True)
plt.show()
What this code does:
Creates evenly spaced numbers from 0 to 10 using
np.linspace
Calculates the sine of these numbers
Sets up a figure with specified dimensions
Creates a line plot with labels and grid
Displays the final plot
Seaborn
Seaborn is built on top of Matplotlib but is specialized for statistical visualization. It provides:
Higher-level interface for statistical graphics
Beautiful default styles
Built-in themes for professional-looking plots
Integration with Pandas DataFrames
import seaborn as sns
# Statistical plotting
tips = sns.load_dataset('tips')
# Create multiple visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Distribution plot
sns.histplot(data=tips, x='total_bill', ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Bills')
# Box plot
sns.boxplot(data=tips, x='day', y='total_bill', ax=axes[0, 1])
axes[0, 1].set_title('Bills by Day')
# Violin plot
sns.violinplot(data=tips, x='day', y='total_bill', ax=axes[1, 0])
axes[1, 0].set_title('Bill Distribution by Day')
# Scatter plot with regression line
sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[1, 1])
axes[1, 1].set_title('Tips vs Bills')
plt.tight_layout()
plt.show()
What each plot tells us:
Histogram (histplot):
Shows the distribution of bill amounts
Helps identify the most common bill ranges
Reveals any skewness in the data
Box Plot (boxplot):
Shows median, quartiles, and outliers
Compares distributions across categories
Identifies unusual values
Violin Plot (violinplot):
Combines box plot with kernel density estimation
Shows full distribution shape
Better for comparing distributions than box plots
Regression Plot (regplot):
Shows relationship between two variables
Adds regression line automatically
Includes confidence interval
Plotly
Plotly creates interactive visualizations that users can explore in web browsers. Key features:
Interactive plots (zoom, pan, hover information)
Export to HTML for sharing
Integration with dashboarding tools
import plotly.express as px
import plotly.graph_objects as go
# Interactive scatter plot
df = px.data.iris()
fig = px.scatter(df, x='sepal_width', y='sepal_length',
color='species', size='petal_length',
hover_data=['petal_width'])
fig.show()
This plot allows users to:
Hover over points to see details
Zoom into specific regions
Pan across the plot
Save interactive version
2. Machine Learning with Scikit-learn
Understanding Machine Learning Pipelines
A machine learning pipeline is a sequence of data processing and modeling steps. It helps:
Ensure reproducibility
Prevent data leakage
Streamline the modeling process
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Sample pipeline
class MLPipeline:
def __init__(self):
self.scaler = StandardScaler()
self.model = RandomForestClassifier()
def prepare_data(self, X, y):
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
return X_train_scaled, X_test_scaled, y_train, y_test
def train_and_evaluate(self, X, y):
# Prepare data
X_train, X_test, y_train, y_test = self.prepare_data(X, y)
# Train model
self.model.fit(X_train, y_train)
# Evaluate
y_pred = self.model.predict(X_test)
print(classification_report(y_test, y_pred))
return self.model
Pipeline Components Explained:
StandardScaler:
Standardizes features by removing mean and scaling to unit variance
Important for algorithms sensitive to feature scales
Helps in faster convergence
RandomForestClassifier:
Ensemble learning method
Builds multiple decision trees
Combines their predictions for better accuracy
Train-Test Split:
Separates data into training and testing sets
Prevents overfitting assessment
Typical split is 80% training, 20% testing
Cross-Validation and Model Selection
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
def optimize_model(X, y):
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
# Parameters to search
param_grid = {
'classifier__n_estimators': [100, 200, 300],
'classifier__max_depth': [None, 10, 20],
'classifier__min_samples_split': [2, 5, 10]
}
# Grid search with cross-validation
grid_search = GridSearchCV(
pipeline, param_grid, cv=5,
scoring='accuracy', n_jobs=-1
)
grid_search.fit(X, y)
return grid_search.best_estimator_, grid_search.best_params_
3. Statistical Analysis with SciPy and StatsModels
Hypothesis Testing
from scipy import stats
import statsmodels.api as sm
def statistical_analysis(group1, group2):
# T-test
t_stat, p_value = stats.ttest_ind(group1, group2)
# Effect size (Cohen's d)
effect_size = (np.mean(group1) - np.mean(group2)) / np.sqrt(
((len(group1) - 1) * np.std(group1, ddof=1) ** 2 +
(len(group2) - 1) * np.std(group2, ddof=1) ** 2) /
(len(group1) + len(group2) - 2)
)
return {
't_statistic': t_stat,
'p_value': p_value,
'effect_size': effect_size
}
Time Series Analysis
def analyze_time_series(data):
# Decompose time series
decomposition = sm.tsa.seasonal_decompose(data, period=12)
# Check stationarity
adf_test = sm.tsa.stattools.adfuller(data)
# Fit ARIMA model
model = sm.tsa.ARIMA(data, order=(1, 1, 1))
results = model.fit()
return {
'decomposition': decomposition,
'adf_test': adf_test,
'model_results': results
}
4. Deep Learning with TensorFlow/Keras
Basic Neural Network
import tensorflow as tf
from tensorflow.keras import layers, models
def create_neural_network(input_shape, num_classes):
model = models.Sequential([
layers.Input(shape=input_shape),
layers.Dense(128, activation='relu'),
layers.Dropout(0.3),
layers.Dense(64, activation='relu'),
layers.Dropout(0.2),
layers.Dense(num_classes, activation='softmax')
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
return model
Convolutional Neural Network (CNN)
def create_cnn(input_shape, num_classes):
model = models.Sequential([
layers.Input(shape=input_shape),
layers.Conv2D(32, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(num_classes, activation='softmax')
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
return model
5. Model Deployment and Production
FastAPI Implementation
from fastapi import FastAPI
import joblib
app = FastAPI()
# Load model
model = joblib.load('model.pkl')
@app.post("/predict")
async def predict(data: dict):
# Preprocess input
processed_data = preprocess_input(data)
# Make prediction
prediction = model.predict(processed_data)
return {"prediction": prediction.tolist()}
Model Monitoring
import mlflow
def train_with_monitoring(X, y):
mlflow.start_run()
# Log parameters
mlflow.log_param("model_type", "RandomForest")
mlflow.log_param("n_estimators", 100)
# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
# Log metrics
mlflow.log_metric("accuracy", model.score(X, y))
# Save model
mlflow.sklearn.log_model(model, "model")
mlflow.end_run()
Learning Path Recommendations
Start with Data Visualization:
Master Matplotlib basics
Learn Seaborn for statistical visualization
Explore Plotly for interactive visualizations
Machine Learning Fundamentals:
Begin with scikit-learn
Focus on understanding algorithms
Practice with real datasets
Statistical Analysis:
Learn hypothesis testing
Understand statistical models
Master time series analysis
Deep Learning:
Start with TensorFlow/Keras basics
Build simple neural networks
Progress to CNNs and RNNs
Production and Deployment:
Learn FastAPI or Flask
Understand model serving
Practice MLOps basics
Additional Resources
Online Courses:
Coursera Machine Learning Specialization
Fast.ai Deep Learning Course
Stanford CS231n for Computer Vision
Books:
"Hands-On Machine Learning with Scikit-Learn and TensorFlow"
"Deep Learning with Python"
"Python for Data Analysis"
Practice Resources:
Kaggle Competitions
GitHub Projects
Real-world datasets