Skip to content

Pandas With MySQL

Pandas can read from and write to a MySQL database directly, making it straightforward to pull data into a DataFrame for analysis, or push processed data back into a table.

Connecting

Pandas' database functions require a SQLAlchemy engine as the connection — they do not accept a raw PyMySQL connection. The engine is created once and reused for all operations.

python
import pandas as pd
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os

load_dotenv()

# Connection string format:
# mysql+pymysql://<user>:<password>@<host>:<port>/<database>
engine = create_engine(
    f"mysql+pymysql://{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"
    f"@{os.environ['DB_HOST']}/{os.environ['DB_NAME']}"
)

The mysql+pymysql prefix tells SQLAlchemy to use PyMySQL as the driver. The engine manages a connection pool automatically — you don't need to open or close connections manually.

.env

This uses .env file for the connection parameters. Dotenv loads the variables from that file

bash
DB_HOST=
DB_USER=
DB_PASSWORD=
DB_NAME=
DB_PORT=3306

Reading Data

Load an Entire Table

python
df = pd.read_sql("SELECT * FROM users", engine)
print(df)

read_sql returns a DataFrame where each column matches a table column, and types are inferred automatically (integers, floats, datetime, etc.).

Load a Subset of Columns

python
df = pd.read_sql("SELECT id, name, email FROM users", engine)

Selecting only the columns you need is good practice — it reduces memory usage and query time on large tables.

Parameterized Queries

Never interpolate user input into a query string. Pass parameters separately instead:

python
# Single value
df = pd.read_sql(
    "SELECT * FROM orders WHERE user_id = %(uid)s",
    engine,
    params={"uid": 42}
)

# Multiple values with IN
df = pd.read_sql(
    "SELECT * FROM orders WHERE status IN %(statuses)s",
    engine,
    params={"statuses": ("pending", "processing")}
)

Read in Chunks

For very large tables, use chunksize to process data in batches rather than loading everything into memory at once:

python
for chunk in pd.read_sql("SELECT * FROM events", engine, chunksize=10_000):
    process(chunk)  # Each chunk is a DataFrame

Filtering and Transforming in Pandas

Once the data is in a DataFrame, you can filter and reshape it using standard Pandas operations. This is useful for exploratory work and prototyping — though for large datasets it is more efficient to filter in SQL before loading.

python
# Exact match
df[df["name"] == "Alice"]

# Numeric comparison
df[df["id"] > 5]

# Multiple conditions — wrap each in parentheses
df[(df["role"] == "admin") & (df["id"] > 5)]

# String contains
df[df["email"].str.contains("@example.com")]

# Select specific columns after filtering
df[df["role"] == "admin"][["id", "name", "email"]]

Writing Data

Append Rows to an Existing Table

python
df = pd.read_csv("new_users.csv")

df.to_sql(
    "users",
    con=engine,
    if_exists="append",   # Add rows to the existing table
    index=False,          # Don't write the DataFrame index as a column
)

if_exists Options

ValueBehaviour
"append"Add rows to the existing table
"replace"Drop the table, recreate it, then insert — destructive
"fail"Raise an error if the table already exists (default)

Use "replace" with caution — it drops and recreates the table, so indexes, foreign keys, and constraints are lost.

Write in Batches

By default to_sql inserts one row at a time, which is very slow for large DataFrames. Set method="multi" to batch inserts, and tune chunksize to control how many rows go in each statement:

python
df.to_sql(
    "users",
    con=engine,
    if_exists="append",
    index=False,
    method="multi",     # Batch inserts
    chunksize=500,      # Rows per INSERT statement
)

Import Formats

Pandas can read from many file formats before writing to MySQL. All of these return a DataFrame that can be passed directly to to_sql.

python
df = pd.read_csv("data.csv")                    # CSV
df = pd.read_csv("data.tsv", sep="\t")          # TSV (tab-separated)
df = pd.read_excel("data.xlsx", sheet_name=0)   # Excel (requires openpyxl)
df = pd.read_json("data.json")                  # JSON array or object
df = pd.read_parquet("data.parquet")            # Parquet (requires pyarrow)
df = pd.read_feather("data.feather")            # Feather (requires pyarrow)

Install optional dependencies as needed:

bash
pip install openpyxl   # Excel support
pip install pyarrow    # Parquet and Feather support

Export Formats

Similarly, a DataFrame loaded from MySQL can be written to many formats:

python
df = pd.read_sql("SELECT * FROM users", engine)

df.to_csv("export.csv", index=False)                          # CSV
df.to_csv("export.tsv", sep="\t", index=False)               # TSV
df.to_excel("export.xlsx", sheet_name="Users", index=False)  # Excel
df.to_json("export.json", orient="records", indent=2)         # JSON
df.to_parquet("export.parquet", index=False)                  # Parquet
df.to_markdown("export.md", index=False)                      # Markdown table (requires tabulate)

orient Options for JSON

The orient parameter controls the JSON structure:

ValueOutput shape
"records"[{"id": 1, "name": "Alice"}, ...] — most common
"columns"{"id": [1, 2], "name": ["Alice", "Bob"]}
"index"{"0": {"id": 1, "name": "Alice"}, ...}
"values"Raw 2D array, no keys

"records" is usually the right choice when sharing data with other systems.

Lapland University of Applied Sciences

© 2026 Juha Petäjäjärvi

© 2026 Juha Petäjäjärvi