Creating SQL insert statements programmatically from CSV with Python
On tonights episode of "Reinventing the wheel" we have this useful little script I made. The problem I was facing was that I needed to create a lot of SQL insert statements for importing data that was originally sourced from a CSV file for testing out a new feature in a test database. Yes I realize many databases probably have built in ways to do this. Yes I realize there's probably libraries for this already. I still wanted to give it a go for my slightly specific syntax requirements as I suspected I might need such a script a LOT for testing. I was right. Now a days with Co Pilot riffing off ideas like this is so damn quick too that it was a pretty low effort creation that was none the less super useful. If you are looking for a mega optimized, production ready version of this task I would honestly look elsewhere but this was made a solid dev tool. Man I love dev tools.
The requirements
To make the needs of the script more specific, I had data like this in CSV
name,count,creation-date,url
test name,2,20231102,not-a-real-url
another test name,3,20231101,,
Note the last value of the last row being empty. Many many rows of this with potential null values and mixed data types. What I wanted at the end was a bunch of statements like this:
INSERT INTO testing-data (ID, NAME, COUNT, CREATION_DATE, URL) VALUES (1,'test name',2,'20231102','not-a-real-url');
INSERT INTO testing-data (ID, NAME, COUNT, CREATION_DATE, URL) VALUES (2,'another test name',3,'20231101',NULL);
This allowed me to insert each row of the CSV on its own into the table testing-data
with the types and table name being configurable. So a pretty simple algorithm here. For each row of CSV, capture each column handling its data type and assembling it into an INSERT INTO
string with the names of the columns extracted and used for the first part of the INSERT
statement plus an ID
at the beginning. The values themselves should then be used for the VALUES
portion of the statement including an ID
column inserted at the beginning and incrementing with each row. Handle nulls. Add a semi colon at the end so it won't explode. Sounds a bit like an interview question.
The solution
What I came up with was this.
import csv
def csv_to_sql_insert(csv_file_path, types, table_name, insert_columns):
"""Function to create a SQL INSERT statement based on data from csv
Parameters:
csv_file -- a csv file path containing the data to insert
types -- a list of types for each column in the csv file. Types can be
either NUMBER, VARCHAR, or DATE
table_name -- the name of the table to insert into
Returns:
sql_insert -- a SQL INSERT statement printed to the console
"""
with open(
csv_file_path,
"r",
) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",")
first_row = next(csv_reader)
num_cols = len(first_row)
csv_file.seek(0)
csv_reader = csv.reader(csv_file, delimiter=",")
for idx, row in enumerate(csv_reader):
# skip the first header row
if idx == 0:
continue
formatted_columns = ", ".join(insert_columns)
sql_insert = f"INSERT INTO {table_name} ({formatted_columns}) VALUES "
sql_insert += "("
# add an id to the data rows
sql_insert += f"{idx},"
quote_pattern = "'"
for i in range(num_cols):
value_to_add = ""
if row[i] == "":
value_to_add = "NULL"
elif types[i] == "NUMBER":
row_number = row[i]
row_number = str(row_number).replace("$", "")
row_number = row_number.replace(" ", "")
row_number = row_number.replace(",", "")
value_to_add = row_number
elif types[i] == "VARCHAR":
value_to_add = quote_pattern + row[i] + quote_pattern
elif types[i] == "DATE":
value_to_add = quote_pattern + row[i] + quote_pattern
if i == num_cols - 1:
sql_insert += value_to_add
else:
sql_insert += value_to_add + ","
sql_insert += ");"
print(sql_insert)
if __name__ == "__main__":
path = "sample_data.csv"
query_types = [
"VARCHAR",
"NUMBER",
"DATE",
"VARCHAR"
]
column_names = [
"ID",
"NAME",
"COUNT",
"CREATION_DATE",
"URL",
]
query_table_name = "testing-data"
csv_to_sql_insert(path, query_types, query_table_name, column_names)
Where the CSV data lives right next to the script in a file called sample_data.csv
, and with the types, table name, and column names hard coded into the main
block of the script. That could just as easily be factored out into params for the script itself and honestly probably should especially if this script were to sit under automation of some kind.
From there things are pretty straight forward. Using Pythons excellent with
syntax the file is opened at the beginning (seek(0)
) and it begins building the SQL statement inside as part of the sql_insert
variable. One tricky bit here was handling quotes. The quote_pattern
variable handles whether to use double or single quotes, SQL being kind of a funny beast and accepting different syntaxes in different systems. In my case I wanted single quotes. From there I handle the empty value case and convert it to NULL
, along with cases for the 4 different data types that I am using as part of the query_types
param.
Further musings on AI
Co Pilot proved really useful for this script as I already had a pretty good idea of what I wanted. Co Pilot I find generally works best given very good specific, narrow directions. Not just "make me a script that does X" although it can occasionally handle that. In this case it was great for smashing out syntax like string replaces, and even inferring the next if
case I needed to handle. Like Intellisense on steroids. I have a feeling the docstring I added at the beginning helped fuel the model too to know what I wanted to achieve and with what values though I'm not going to dig too deep into that lest I create some kind of Doc String Cargo Cult for AI.
I swear SQL and Python will be with us forever. Such useful tools.