Appearance
Pandas With MySQL
Pandas can read from and write to a MySQL database directly, making it straightforward to pull data into a DataFrame for analysis, or push processed data back into a table.
Connecting
Pandas' database functions require a SQLAlchemy engine as the connection — they do not accept a raw PyMySQL connection. The engine is created once and reused for all operations.
python
import pandas as pd
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os
load_dotenv()
# Connection string format:
# mysql+pymysql://<user>:<password>@<host>:<port>/<database>
engine = create_engine(
f"mysql+pymysql://{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"
f"@{os.environ['DB_HOST']}/{os.environ['DB_NAME']}"
)The mysql+pymysql prefix tells SQLAlchemy to use PyMySQL as the driver. The engine manages a connection pool automatically — you don't need to open or close connections manually.
.env
This uses .env file for the connection parameters. Dotenv loads the variables from that file
bash
DB_HOST=
DB_USER=
DB_PASSWORD=
DB_NAME=
DB_PORT=3306Reading Data
Load an Entire Table
python
df = pd.read_sql("SELECT * FROM users", engine)
print(df)read_sql returns a DataFrame where each column matches a table column, and types are inferred automatically (integers, floats, datetime, etc.).
Load a Subset of Columns
python
df = pd.read_sql("SELECT id, name, email FROM users", engine)Selecting only the columns you need is good practice — it reduces memory usage and query time on large tables.
Parameterized Queries
Never interpolate user input into a query string. Pass parameters separately instead:
python
# Single value
df = pd.read_sql(
"SELECT * FROM orders WHERE user_id = %(uid)s",
engine,
params={"uid": 42}
)
# Multiple values with IN
df = pd.read_sql(
"SELECT * FROM orders WHERE status IN %(statuses)s",
engine,
params={"statuses": ("pending", "processing")}
)Read in Chunks
For very large tables, use chunksize to process data in batches rather than loading everything into memory at once:
python
for chunk in pd.read_sql("SELECT * FROM events", engine, chunksize=10_000):
process(chunk) # Each chunk is a DataFrameFiltering and Transforming in Pandas
Once the data is in a DataFrame, you can filter and reshape it using standard Pandas operations. This is useful for exploratory work and prototyping — though for large datasets it is more efficient to filter in SQL before loading.
python
# Exact match
df[df["name"] == "Alice"]
# Numeric comparison
df[df["id"] > 5]
# Multiple conditions — wrap each in parentheses
df[(df["role"] == "admin") & (df["id"] > 5)]
# String contains
df[df["email"].str.contains("@example.com")]
# Select specific columns after filtering
df[df["role"] == "admin"][["id", "name", "email"]]Writing Data
Append Rows to an Existing Table
python
df = pd.read_csv("new_users.csv")
df.to_sql(
"users",
con=engine,
if_exists="append", # Add rows to the existing table
index=False, # Don't write the DataFrame index as a column
)if_exists Options
| Value | Behaviour |
|---|---|
"append" | Add rows to the existing table |
"replace" | Drop the table, recreate it, then insert — destructive |
"fail" | Raise an error if the table already exists (default) |
Use "replace" with caution — it drops and recreates the table, so indexes, foreign keys, and constraints are lost.
Write in Batches
By default to_sql inserts one row at a time, which is very slow for large DataFrames. Set method="multi" to batch inserts, and tune chunksize to control how many rows go in each statement:
python
df.to_sql(
"users",
con=engine,
if_exists="append",
index=False,
method="multi", # Batch inserts
chunksize=500, # Rows per INSERT statement
)Import Formats
Pandas can read from many file formats before writing to MySQL. All of these return a DataFrame that can be passed directly to to_sql.
python
df = pd.read_csv("data.csv") # CSV
df = pd.read_csv("data.tsv", sep="\t") # TSV (tab-separated)
df = pd.read_excel("data.xlsx", sheet_name=0) # Excel (requires openpyxl)
df = pd.read_json("data.json") # JSON array or object
df = pd.read_parquet("data.parquet") # Parquet (requires pyarrow)
df = pd.read_feather("data.feather") # Feather (requires pyarrow)Install optional dependencies as needed:
bash
pip install openpyxl # Excel support
pip install pyarrow # Parquet and Feather supportExport Formats
Similarly, a DataFrame loaded from MySQL can be written to many formats:
python
df = pd.read_sql("SELECT * FROM users", engine)
df.to_csv("export.csv", index=False) # CSV
df.to_csv("export.tsv", sep="\t", index=False) # TSV
df.to_excel("export.xlsx", sheet_name="Users", index=False) # Excel
df.to_json("export.json", orient="records", indent=2) # JSON
df.to_parquet("export.parquet", index=False) # Parquet
df.to_markdown("export.md", index=False) # Markdown table (requires tabulate)orient Options for JSON
The orient parameter controls the JSON structure:
| Value | Output shape |
|---|---|
"records" | [{"id": 1, "name": "Alice"}, ...] — most common |
"columns" | {"id": [1, 2], "name": ["Alice", "Bob"]} |
"index" | {"0": {"id": 1, "name": "Alice"}, ...} |
"values" | Raw 2D array, no keys |
"records" is usually the right choice when sharing data with other systems.