Most undergrad programs for Data Science don’t give students the tools they need to operate in a modern data environment. Starting with the most basic engineering skills, concepts like SSH, Git, Web Servers, Virtual Environments, and Containers were not part of the curriculum I went through. This is actually an issue for standard Computer Science education too, but it gets compounded in Data Science because there’s another missing piece: Data Engineering.
Leaving aside the dearth of programs that actually teach you how to be a Data Engineer, there are basic skills and concepts from Data Engineering that Data Scientists need to succeed but have a hard time finding education for. Moving data around with Python scripts, database organization concepts like the Star Schema, frameworks like Airflow and Kafka, and ETL basics are immensely helpful skills and knowledge to have on a data team, but typically reside in the heads of Data Engineers (or whatever the equivalent position on your team is) only.
In the spirit of this popular Stitch Fix post about how vertically integrated Data Scientists should actually be, I think that all of the skills I mentioned above make a stronger, more effective Data Scientist. I’ve been picking up a lot of this stuff at work by working on Data Engineering tickets and choosing high-unknown projects, and I wanted to solidify it (and build on it) by building something myself.
I’m very into data and I’m very into sneakers, which got me wondering: how could I track my sneaker usage? When did I buy things, and when did I sell them? How often do I wear different colors? Do the seasons of the year impact the materials I chose? How much money do I spend a month on kicks? I could gather all of this data manually in a spreadsheet, but it seemed like a great candidate for a proper pipeline. Here’s the idea:
The project starts with a web application that lets the user (me) input sneaker data – like purchases, walks, and cleans – into a form. Backend logic takes that data and does two things: (1) sends a lifecycle event (LCE) to the events table, and (2) updates or creates a record in the sneakers table. Downstream, the data is transformed and loaded into a sneaker dimension table by an Airflow job. That table (and the current view) are the basis for any analytics that you’d want to do later on.
This paragraph above is pretty standard in a lot of these “building data pipelines” posts, but a few months ago I had absolutely no idea what big chunks of it meant. Let’s work on some definitions:
Web Application: instead of inserting rows into tables manually, it would be nice to be able to write things down in HTML forms and have some code insert it into those tables. The web app – running on a Flask server (to be explained) – is the way to do that. It has a front end (HTML, CSS) and a backend (Python).
Lifecycle Events: in product data, things that users do are usually captured as events – adding a sneaker, wearing a sneaker, selling a sneaker – each with an accompanying set of details. An example lifecycle event in our case would be that I wore a pair of sneakers, with the sneaker id, the sneaker name, and the time that I wore them. These events are very granular data, and it takes a bunch of work to turn them into data that can be used for actual analytics.
Updating Records: the “production” database in this project is the sneakers table, and the goal is to keep an up to date record of each sneaker in the collection for application purposes (not analytics purposes). Any time a major event happens (a purchase or a sale in our case), we update that table. For example, a new purchase adds a new row with the accompanying details, or a new sale marks that a given sneaker was sold, when, and how.
Sneaker Dimension Table: a dimension table is the core element of analytics pipelines – it represents that state of a given object (sneaker!) at a given point in time. In practice, if we were to choose a given pair of sneakers – let’s say my pair of Suede Common Projects – the dimension table should tell me something like how many times they had been worn if I was looking two weeks ago. It’s also common to set up a current view that only contains records that are up to date.
(Dimension tables are part of the Star Schema, which is a particular way of organizing your tables. I’m implementing it as I’ve learned at work (historical dimensions), but there are other approaches too.)
One of the biggest challenges in getting better at Data Engineering is the unspoken knowledge – assumptions about how things should be structured and built. There’s no good way to get this kind of knowledge without working in an organization with production data pipelines that you can learn from, or a mentor of some sort.
The structure of:
is a pretty standard one, and fit my analytics needs for this project.
This first post is about the top part of the stack: the web application and basic data tables. Here goes!
I’ve built websites before (my personal one is on Github Pages), and I’ve struggled to understand why I need a web server. What exactly is it, and what does it give me? Why can’t I just show HTML and CSS files with absolute URL paths?
I think the best way to explain web servers is to start with a set of things we want to do with this sneaker application. Here’s what we need:
These needs all roll up into one broad idea: we need a strong connection between our frontend and our database.
There’s a clear frontend (HTML) and backend (Python + Database) here, and the simplest way to explain a web server is that it lets you create both of those. It also allows for URL routing, serving static assets (images, CSS) efficiently, and limiting load from users. You’ll find some combination of these features in most online explanations.
Without some sort of web server, it would be really difficult to move data around and execute functions whenever a user does something.
Flask is a Python package that creates the most barebones web server you can in Python, and it’s perfect for this kind of application. I’ll skip the Flask introduction – there are plenty of great ones out there – and cover how I decided to use it / issues that beginners might run into.
Remember how most of our needs centered around communicating between the front end and the database? Flask has a bunch of methods and subpackages that exist solely for that purpose. And I totally ignored them!
I like Postgres, and wanted to get some experience setting up databases and connections myself, especially on a remote server. I built the sneakers, manufacturers, and sneakers_events tables through the psql interface, or the terminal that Postgres gives you to run commands.
An important part of modern Data Engineering (as far as I can tell) is what I’d call Janky Python Scripts(TM) – self contained Python programs that move data around, pull from APIs, etc. I wanted to get more experience building these from scratch and using Python packages like psycopg2, so I build the database connection myself.
In practice, that meant that instead of using methods like Sneakers.query.all(), I built my own functions to query the sneakers database, and put them in a separate data_scripts module. Some example (static) methods that I defined were list_available_sneakers, insert_sneaker_row, and insert_sneaker_event.
Flask’s WTF extension (nope, not that WTF) lets you make simple forms through Flask. You define them in a forms.py file as classes that inherit from a FlaskForm base class, and can choose from a bunch of input field types like IntegerField, StringField, and SelectField.
I wanted to use a SelectField to let me choose a pair of sneakers that I wanted to wear, and that required listing all of the available sneakers from the sneakers table in the database. That wasn’t too hard (I used that list_available_sneakers function I mentioned), but I kept running into the same problem: I needed to reload the entire server to get those choices to update once I added a new sneaker. That was no good.
It took me a while to figure out, but I eventually figured that I could define a constructor (init) that pulled those choices (list_available_sneakers) whenever the form was initialized, and store them in a choices variable. I passed that choices variable to the SelectField and all was well.
In Flask, you use a routes.py file to execute backend logic whenever a particular page loads. For example, if you want some stuff to happen when a user hits http://sneakers.com/index.html, you can put that logic in the appropriate function in routes.py file. You put a little @app.route(‘somepath’, methods = [‘GET’, ‘POST’, ‘Whatever’]) decorator before the function you want to run, and it works just like that.
Another cool part of Flask is automatic URL routing – if you define a function to run at a given URL, you can reference any URLs that point to it with the url_for() function. For example: if I define an add_sneakers function that’s supposed to run whenever users hit a URL, the url_for(‘add_sneakers’) call will always point correctly, even if you change the absolute path of the pointing URL.
Earlier, I mentioned that one of the points of a web server is to efficiently serve static assets like images to users. Some angry Stack Overflow users have argued against using Flask for this, but you can do it if you need – it’s just not intuitive. Instead of just storing CSS, or image files in your normal directories, you need to create a special static folder.
You can pull from the folder using url_for(‘static’, filename=filename).
After a few days of a ton of searching and a lot of re-doing, the finished product looked like this:
Here’s a screenshot of the homepage:
And here’s what I was talking about with loading choices (it only shows the sneakers that I currently own):
The next part of this series will take a look at the second part of this project – building ETL and a downstream table with Airflow. Stay tuned! You can check out the SNQL app repo here.