BiteofanApple Archive About Code Microblog Photos Links
by Brian Schrader

Primitive Tech is Here for the Long Haul

Posted on Fri, 24 Nov 2017 at 02:46 PM

John's most recent Primitive Technology blog post has me pretty excited. The new video is great, as his videos tend to be, but the post mentions something a bit more exciting.

I bought a new property to shoot primitive technology videos on. The new area is dense tropical rainforest with a permanent creek. Starting completely from scratch, my first project was to build a simple dome hut and make a fire.

This is just me speculating here, but he wouldn't be buying a whole new plot of land to shoot videos on if he wasn't planning on making more videos long term. While I never had any reason to believe he'd be stopping anytime soon, such a big purchase is great news for him, his channel, and fans like me.

Primitive Technology Blog →

The FCC Moves to Dismantle Net Neutrality

Posted on Thu, 23 Nov 2017 at 09:18 PM

Jonathan Shieber for TechCrunch →

Federal Communications Commission Chairman Ajit Pai today made good on his long-standing pledge to tackle regulations established in the last administration designed to protect the distribution of internet content.

On Tuesday, Pai distributed to the other commissioners at the FCC a draft of his suggested rule changes under the auspices of the “Restoring Internet Freedom Order.”

The move sets up a December 14 vote at the FCC that could have broad ramifications for the entire internet. Under the rules established by the Obama administration, internet providers are required to provide open access to their networks for all digital content.

I'm really sad to see the FCC going forward with their plan to dismantle the Net Neutrality protections. What's worse is that Ajit Pai seems to know full well what this will mean for the web, and seemingly no amount of public comment against his proposal can disuade him.

It seems like the only recourse we really have now is either to chip away at Ajit Pai's resolve with public comments before the December vote, or wait for some act of Congress to reverse the decision: something I can't even imagine them doing.

Help Protect the Internet by Contacting Your Representative →

MyGeneRank: Behind the Scenes of the Newest ResearchKit App

Posted on Wed, 25 Oct 2017 at 10:11 AM

I'm super excited to announce that MyGeneRank, an app that I've been working on at my jobby-job at the Scripps Translational Science Institute for a year and a half, is now available on the App Store, and the source code is available on GitHub!

I've wanted to talk about this project for a while, I've written many unpublished posts about how it works, and now the time is finally right. If you're looking for the scientific or research parts, I'm going to leave that to the paper we published. I want to talk more about my experiences and what I've learned in building the system.

As a quick overview: MyGeneRank is a ResearchKit based research study app aimed at providing users with their Genetic Risk for diseases, and measuring their reactions to this information. The first disease being Coronary Artery Disease. I'm definitely not a doctor, statistician, or biologist, and everyone else on the team handled all of the scientific work, but I am a Software Developer and I worked on the vast majority of the API, Computation Engine, Website, and iOS App development, DevOps, and System Administration as the sole developer. I learned a hell of a lot during the last year and a half and looking back, I'm not sure how I even got this far. Since the source code is available, at least for the API, (iOS source is coming) so now you too can see my mistakes and pass judgement! 🎉

From a Technical Perspective

Vaguely, MyGeneRank's backend has three main parts: a database (Postgres), a Django REST API (which is open source), and what we've called a Computation Engine. All of it runs in-house, maintained by yours truly. The API and database are pretty self-explanatory, and the "engine" is really a Celery cluster and Redis queue which runs, among other things, a series of Python wrapped command-line tools and custom R scripts to calculate a person's genetic risk given their 23andMe genotype data. While the computation stuff is a sort of special case (what with the CLI tools and R scripts), the API's design goal was to be as industry-standard practice as possible. It's 90% covered with tests, leverages Travis-CI, uses DRF and Celery for the vast majority of its work, and everything runs in Docker containers on CentOS.

If this stack sounds familiar to readers of this site, then you're catching on. In my post about Adventurer's Codex's stack I spelled out basically the same setup. In truth the stack for AC was heavily influenced by MyGeneRank. I took everything I learned building MyGeneRank and ported it to Adventurer's Codex a year later. That's how developers work, we do things once, then copy-pasta it everywhere.

Scientific Computing at Scale: Performance and Throughput

MyGeneRank has very demanding computational needs. Currently, we have 178 cores and almost a terabyte of RAM powering the app and its backend. Turns out, calculating a person's genetic risk, even using genotyping data and not NGS, requires a lot of computational power. Scaling this kind of intense scientific computation for public use was one of the most challenging (and enjoyable) parts of the project. But even now I don't really have many concrete answers to the problem other than the twin suggestions: add more cores and make your work as functional and therefore parallel as possible.

Into the Weeds for a Bit

The calculations needed to return a given user's genetic risk score can be broken into roughly 110 individual tasks and is mostly trivial to parallelize. What takes ~110 minutes of CPU time per user can be done in 3.5-4 minutes of wall time on our current system, but as any web developer knows, even that kind of processing time is hard to scale. The first couple tasks are run in series and then two chunks of tasks are run in parallel. The first chunk contains a single task which calculates the user's genetic ancestry, and the other chunk has 52 two-part tasks. This means that at any one time, 53 tasks per user are running during the bulk of the computation. These first 52 tasks take ~1.5-1.8 minutes each depending on which chromosome they're processing (some are bigger than others) and then the second batch takes about the same amount of time per chunk. The genetic ancestry takes ~3.2 minutes. Once all of these tasks are complete there's a final step that calculates the actual risk score, which is fairly instantaneous. What this means is that the time from start to finish for a given user's score is parallelize-able up to ~54 cores, then it's core speed that matters, which is harder to improve. The extra cores we have allows us to calculate more scores at once, but even with our huge core count, we can only calculate ~3-4 users' scores at a time. The good news is that all of the steps are really good at keeping memory use low; CPUs are the bottleneck here.

Improving API and Website level performance is much more straightforward than doing the same for the backend. Like most sites, MyGeneRank sits behind an Nginx Reverse Proxy with some out of the box microcaching for popular pages.

At time of writing, I'm not sure what the load will be like when we finally announce the study publicly, but I've spent a lot of time worrying about and trying to ensure that the site can handle the loads we hope for. There's been a lot of interesting news and blog posts over the years about what kind of download numbers an app, and especially a research app can expect, and I wanted to build MyGeneRank with those kinds of numbers in mind. Once the project has hit its first month I'm going to do a retrospective on how it all went and we'll see if my performance enhancements were enough.

Lessons Learned

There's a lot of little things that I've learned in building MyGeneRank (and later Adventurer's Codex). When we started, I'd worked on a few toy iOS apps and a few corporate web projects, but MyGeneRank turned out to be of a completely different scale.

Before MyGeneRank, I'd never used Django, or Django REST. I'd heard of them, and had a friend who used them, but aside from a few toy projects in Flask, my web experience was in front-ends or Java/Spring (and I guess PHP). My work at that time was mostly in writing analysis pipelines in Python and since it's my preferred language, I wanted to use it for MyGeneRank. To this day, the structure of the API project is a little wonky and apps aren't where they should be; both are cruft from those early days. I try not to worry about it too much since I was learning as I went, and this kind of legacy cruft is impossible to avoid unless you knew everything at the start, and we most assuredly didn't. I can say that Django/Django REST has shown me just how boring building websites really is because it does most of it for you automatically and supports anything you'd ever really need right out of the box; you should definitely use it.

The modern web is really complex and there's a reason that it takes so many skilled developers to build large systems. Server setup, administration, DevOps, reporting, and application development are all sub-disciplines unto themselves (which is why they're separate jobs at most places). And I've found that jumping in and out of these different worlds can result in a sort of Programmer's Jet Lag as your body adjusts to the new environment after spending days in a completely different one.

On the native side, Apple's frameworks can be fun to use, and their OS frameworks, documentation, and user guides are world-class, but their tooling can also be frustrating and slow at times. The iOS app is written entirely in Swift and that has had some major effects on the development. Swift's tooling is still very new and the language has changed drastically since it came out. Having worked in both, I can say that, while I do enjoy Swift, nothing has made me appreciate the maturity of Python more.

Overall, my advice for building these kinds of systems is the same as when I wrote a similar post about Adventurer's Codex:

...ask people who've done it before... The internet is great, but it's actually pretty difficult to find out how to design modern web systems from scratch with just a vague notion and Google.

MyGeneRank, to me, represents my passing from a junior to a senior developer in a lot of ways. By no means have I learned all there is to know, but having now built two large web projects, and being the sole developer for one of them, I feel like a different person from the one who started the project a year and a half ago. I'd love to know what you all think of the source, and if you find a bug, file an issue please.

OAuth Over XMPP

Posted on Tue, 24 Oct 2017 at 06:24 PM

As I've said before, Adventurer's Codex uses XMPP for it's real-time features. During development we ran into a couple of interesting challenges with integrating such a mature system with our new-ish web stack, one of which was User Authentication.

The majority of Adventurer's Codex uses an OAuth Provider model for user authentication but Ejabberd (our XMPP server) requires that the username and password be sent at connection time. Obviously we didn't want to have two different auth schemes to support, and we didn't want our client app to store any passwords (hence OAuth). We spent a while hunting for different possible solutions, and in the end we stumbled into a really simple one.

Ejabberd allows for authentication to be handled by an external script, which allows us to use our core database as the auth backend. We could, in principle, use a Django management command to make a call to our database, hash the password that we were given by the client and compare it to the one stored in our database, but not only is that a lot of work and error prone, it's too coupled to our database layer, and it still required the client to store the user's password.

In the end we went with what might seem like the obvious solution: just keep using OAuth. After the client receives the initial user data at load time, it sends the same OAuth token as the password along with the user's XMPP JID to Ejabberd. Ejabberd then calls out to an external script which makes an HTTP request to our API to see if the user exists and that the token is valid. Clean and simple.

A visualization of the OAuth over XMPP process.

There's a few major advantages to using this method. First, The client no longer has to store user passwords, which not only makes our implementation simpler, but also protects our users from a whole host of attacks. Second, the user's XMPP session is now bound to the same limits as the rest of their access to the site which greatly simplifies permissions handling. Third, and perhaps the most interesting is that Ejabberd is no longer tied to either the Django CLI or the database and can be spun off as essentially a separate microservice on another machine.

Check out the Ejabberd Auth Script →

todolist update

Posted on Tue, 10 Oct 2017 at 03:35 PM

Over the last week, I've made some changes to my todolist script: I've cleaned up the printing a bit and removed the temp file.

todolist terminal output

I had to remove the temp file because it was actually causing performance problems with BBEdit. Since the temp file came into and out of existence every few seconds, BBEdit's project view would dutifully redraw the project file list twice in quick succession, wasting quite a bit of CPU power1, and sometimes causing my MacBook's fan to spin up. Now that the temp file is gone, that problem is too.

1. I'm not sure why BBEdit needs so much power to redraw the project list and I've reached out to their support. Hopefully they can resolve the issue. At the very least my script is a little more well behaved so it's no longer an issue.

todolist →

Mini-Rant About Documentation

Posted on Mon, 09 Oct 2017 at 03:31 PM

I want to talk about documentation. iOS1, Nginx, Python, DRF, Django, Celery, and Postgres have excellent documentation, but documentation only helps when your question is "How does this thing work and what does it do?" Documentation, at least code level docs, are useless when it comes to figuring out what you need in the first place. Celery can tell you how to use Celery, but it isn't as great at telling you why you might need it. I've become convinced that user guides are as, if not more, important than code level documentation, and we as a community need more of them.

1. For their credit, iOS and really all of Apple's developer resources have excellent user guides that explain not only how to use a thing, but why and where you might need it (thinking about it, this could be because iOS and macOS have been around long enough to develop these kinds of docs).

todolist

Posted on Thu, 28 Sep 2017 at 02:28 PM

I've talked before about how I use TODO comments in my code to lay out what I want to do before actually doing it. To help me keep track of all of these TODOs in my code I wrote a little script yesterday and I've put it on Github for anyone who's interested.

The script looks through all of the code (by default Python code) in a given destination directory, greps for the TODO comments, and prints them nicely in a constantly updated list in the terminal. The output looks like this:

Todolist Terminal Window

Writing this script I learned a couple of new things about terminal commands like how to clear the screen without deleting the scrollback or just printing newlines (i.e. what clear does). I've put the script in my /usr/local/bin and called it todolist so now I can invoke it from anywhere and get a nice little list of what I've put off working on.

todolist on Github →

Accidental DevOps

Posted on Tue, 26 Sep 2017 at 12:30 PM

Since I became a developer, I've always worked on small (3-5) or single-person teams. Even at my current job, I'm the the lead and only full-time developer. In more recent projects (including Adventurer's Codex) this means that I'm the DevOps guy and System Admin as well. I'm by no means an expert in either, but I can do both.

I started learning how to manage and administer servers when I started this site back in 2012. Back then I never thought that all of those hours spent configuring Apache and PHP would lead to anything, but those countless hours of frustration taught me the basics. Fast-forward 5 years and I'm developing three major projects (two unannounced) and I'm DevOps and SysAdmin for all three. It's crazy to think about.

I'd highly recommend any new developers to follow the same general path I did: start a project or blog and learn to deploy it yourself. I started with a cheap old-style webhost and FTP, and slowly moved to managing the whole stack on Linode. I'm using Docker on new projects, and for now, I'm scripting my own deploys (though this could change soon if I migrate one project to Ansible).

As developers it's sometimes easy to forget that we write software that actually runs on some actual hardware in some actual datacenter somewhere. Knowing how to do many of the things that DevOps and SysAdmins do will not only make you a better developer, it gives you the ability to do more on your own. You often don't need tons of layers of software to deploy yours if you know how to do it from the ground up (especially if it's a smaller project). Those tools make it easier sure, but they're not required.

Getting Back on the Horse

Posted on Mon, 25 Sep 2017 at 04:20 PM

It's been a while since my last post. This whole summer has been an unusually quiet here, and while a number of personal issues have cropped up this summer that derailed me from blogging, I've really just gotten out of the habit of writing regularly. This is me forcing myself back onto that horse. I've got some exciting news coming, and between work and Adventurer's Codex, I've been keeping myself way too busy.

On the Adventurer's Codex front: we're in the middle of a large refactor caused by our ongoing migration to Webpack which should hopefully fix a bug caused by our fairly primitive current build and deploy system. We're starting to see the light at the end of the tunnel now, and hopefully it shouldn't take much longer before we're back to writing new, cool features. The original build and deploy system was basically a lot of manual work and a shell script. When we wrote it, I had no idea what Webpack was or how to deploy a modern front-end app, now I do. That's what happens when you learn on the go; sometimes you have to step back and fix your past mistakes.

In my dwindling spare time, I've been working on another project that I hope to announce soon, so look for more to come there.

TODOs as a Templating System

Posted on Mon, 31 Jul 2017 at 03:24 PM

When I sit down to start a new feature or project the blank page or empty function can be extremely intimidating; a void of infinite complexity. I'm sure lots of developers do this, and maybe most don't realize it, but I've found that TODO comments are super useful in helping to abstract away nitpicky details and focus on the overall purpose of the code as I'm writing it. Let's say that we want to validate some parameters from an HTTP request and kick off a background task to send an email to a list of requested users. First off, we need to handle the request and kick off the task, but there's a bunch of validation and database queries we need to make before we can do that, and we haven't even written the task function yet, that's where TODOs come in.

class MassEmailView(APIView)
    # TODO: check if user has permission to send mass mail
    def post(self, request):
        # TODO: Get users from the request
        users = []
        for user in users:
            # TODO: send the message
            pass
        return Response(None, status=200)

Right off the bat I know that I need to get a list of users and do something with each of them. In a lot of ways I'm basically writing pseudo-code and slowly filling in the blanks with real code. Next, let's say we write the background task.

# ---- tasks.py ----
@shared_task
def send_email(user, subject, message_text):
    email.send(user.email, subject, text=message_text)

# ---- views.py ----
class MassEmailView(APIView)
    # TODO: check if user has permission to mass send mail
    def post(self, request):
        # TODO: Get users, subject, and text from the request
        users = []
        subject = ''
        text = ''
        for user in users:
            tasks.send_email.delay(user, subject, text)
        return Response(None, status=200)

Slowly the code is coming together. I've written the background task and updated my view. The basic structure is there, but I haven't done the work of parsing the request or any error handling, so let's move on to that.

# ---- tasks.py ----
@shared_task
def send_email(user, subject, message_text):
    email.send(user.email, subject, text=message_text)

# ---- views.py ----
class MassEmailView(APIView)
    @authentication_classes((SessionAuthetication,))
    @permission_classes((IsAdmin,))
    def post(self, request):
        try:
            users = [
                User.objects.get(username=username)
                for username in request.POST['users'].split(',')
            ]
        except ObjectDoesNotExist:
            return Response(INVALID_USER_RESPONSE, status=400)

        subject = request.POST['message_text']
        text = request.POST['message_text']
        for user in users:
            tasks.send_email.delay(user, subject, text)
        return Response(None, status=200)

Now that we're done, it's clear that the TODO comments were hiding quite a bit of complexity, but the overall structure is the same. Just because our code is read by the computer from top to bottom doesn't mean we have to write it that way. Sometimes it helps to start with a rough outline of the whole picture, and slowly color it in bit by bit.

Archive

RSS

Creative Commons License