Easy and Ethical Traffic Monitoring with GoAccess

Traffic monitoring is a staple for web businesses, but for some reason, we've outsourced a pretty simple problem to mischievous third-parties. While there are well-behaved traffic monitoring platforms, I've developed a few homegrown solutions that have worked really well for me and my business. If you're looking for an easy traffic monitoring solution, and you're conscious of your user's/visitor's privacy, you should try one of these solutions. I promise, they're pretty simple.

Option 1: Just Don't

You always have the option to just not do traffic monitoring. Often times we can convince ourselves that data we collect is precious or useful when it fulfills no real business or personal need.

If you're a blogger, then traffic might matter to you, but it probably shouldn't. Back when I used to use Google Analytics I also had very few visitors to this site. Was it useful to know that 13 people had seen my article? Not really, but it felt useful. In the end it was just another stat for me to endlessly refresh. Progress bars are fun to watch, but you'd probably be better off writing another post, or just going for a walk.

If you own a business that sells a product, then remember this: it's not actually relevant how many hits your website gets. It's important how many products you sell. At one point, Going Indie was featured on Product Hunt, which was awesome, but that featuring resulted in very few actual sales. Was it worth my time to endlessly refresh the PH dashboard? No, and I kinda wish I didn't have the option.

Real-time dashboards are addictive dopamine factories. Sometimes it's better to just avoid them.

Option 2: Use GoAccess

If you need to have some sort of traffic monitoring, then give GoAccess a try. GoAccess aggregates webserver access logs and provides reports either live in the shell, or as really elegant and self-contained HTML files.

I've used GoAccess for years, and it's become my default solution for traffic monitoring. I've automated my reporting using my new helper RPi. Every week, the RPi generates and aggregates the reports for my various websites and emails them to me.

Sample GoAccess Report

A sample GoAccess HTML report

There are downsides to GoAccess though. Since it's using access logs, the numbers are inflated by bots and browser prefetching. GoAccess has ways to filter out some of those things, but in most cases, I've just gotten used to the numbers being bigger than they really should be.

One upside to using server-side traffic monitoring is that your stats are unaffected by people who are using ad-blockers or who refuse to enable JavaScript (are there still people doing that?)

Option 3: Roll Your Own

For some projects, I've needed more reliable and accurate traffic stats. To do that, I decided it would be best to roll my own. As I said earlier, traffic monitoring is a pretty simple problem-domain—as long as you're willing to live with some margins of error. My California policy blog uses a homegrown traffic monitoring solution that is so maddeningly simple, I will include it below in its entirety—formatted for readability.

(function() {
    if (window.fetch) setTimeout(function() {
        fetch('/pageview?u=' + window.location.pathname)
    }, 2000)
})()

This snippet sets a timer for two seconds and then fires a request off to /pageview which simply returns a 200 response. The site is statically generated—just like this one—so it can't do any processing or custom request handling, and there's an empty file called pageview in the webroot directory. I join all of my access logs together, remove anything that doesn't contain a request to /pageview and voila!

zcat /var/log/nginx/access*gz | grep pageview > $STATSFILE;
cat /var/log/nginx/access.log | grep pageview >> $STATSFILE;

/usr/local/bin/goaccess \
    -f $STATSFILE \
    --ignore-crawlers \
    -p /etc/goaccess.conf \
    > $REPORTFILE;

These reports won't include any requests made by searchbots, any request that didn't execute the JavaScript, or any request made by a user that didn't keep the page open for at least two seconds. This solution gives me simple and effective traffic stats that leverage the data my servers were already collecting, with no additional or accidental data collection required!

What Really Matters

Traffic monitoring is a useful, but addictive tool, and it's easy to get caught up in the data they collect and convince yourself that it's more useful than it really is. At the end of the day, I just need to know, roughly, how many people read one of my articles or how many visited the homepage of a service I run. I don't need to know who they were or anything else about them, and I don't want more data than I need.

Due to the limitations of server-side monitoring—even with my JS snippet—GoAccess can't provide you with exact traffic numbers; nothing can. But like I said, you probably don't need exact numbers. You probably only really need the order of magnitude, which server-side monitoring can easily provide.

How I Use Docker (for Now)

In a recent episode of Indie Dev Life I went into some detail about how I use Docker to host my software. I discussed my experiences with and guidelines for using Docker in production. This post is a continuation of that discussion.

I've been using Docker to run my software in production ever since the launch of Adventurer's Codex, and MyGeneRank back in 2017. In my technical discussion blog post for both projects, I talked a little bit about Docker and its place in the stack. I also discuss Docker and its role as a deployment tool briefly in Going Indie.

Over the years I’ve managed to tune my services to be incredibly easy to upgrade. For example, since Nine9s is written in Python and uses Docker, a deploy is simply a git pull and docker-compose up. Nowadays, even those steps are automated by a bash script. Having such a simple process means that I can deploy quickly, and it lessens the cognitive burden associated with upgrading a service, even when that service has gone without changes for months.

Over time, Docker's role in my software has morphed and evolved. During the initial launch of Adventurer's Codex, I depended heavily on community-built Docker files for large portions of the architecture. But over time Docker has actually shrunk to fill a much more limited role.

The Problem Docker Solves (for Me)

Context

I use Linode for my server hosting, so I'm already operating within a VM, and depending on the software, I might have multiple virtual servers powering a given service. Docker simply provides isolation for processes on the same VM. I do not use Docker Swarm, and I've always just used the community edition of Docker.

To me, Docker has become a tool that makes it easy to upgrade and manage my own code and other supporting services. All of my code runs in a Docker container, but so do other systems that my code depends on. For example, Pine.blog and Nine9s both use memcache for template caching since support for it is built into Django—my preferred web framework. Each web server runs Nginx on the host which reverse-proxies to Docker containers running my Django apps.

Both services also perform asynchronous processing via worker nodes. These workers are running inside of Docker. Pine.blog's workers are spread across various machines and pass requests through their own custom forward caching proxy containers backed by a shared Redis instance also in Docker.

This setup ensures that I can easily upgrade my own code, and it ensures that exploitable services like memcache aren't exposed to the outside world.

In short, I've found that Docker works great for parts of the stack that are either upgraded frequently or for parts of the stack that are largely extraneous and that only need to communicate with other parts on the same machine.

I've largely stopped using Docker in cases where there are external tools that rely on things being installed on the host machine, or where the software requires more nuanced control. Nginx is a great example. All of my new projects have Nginx installed on the host, not in Docker. This is because so many tools from log monitoring to certbot are designed to run on a version of Nginx installed globally. I use Nginx as both a webserver for static content and a reverse-proxy to my Django apps. If you want to use Nginx in Docker, I'd suggest only using it for the former case. The latter is better installed on the host.

I'm still torn about running my databases and task brokers in Docker. Docker (without Swarm) really annoys me when I'm configuring services that need to be accessed by outside actors. Docker punches through CentOS firewalls which renders most of my typical tactics for securing things moot. I've also started to question the usefulness of Docker when I'm configuring a machine that serves only one purpose. Docker is great at isolating multiple pieces of a stack from each other, but on a single-purpose VM it seems like it's just another useless layer that's only there for consistency.

Docker on CentOS is particularly irritating as the devicemapper doesn't seem to release disk space that it no longer needs. This means that your server is slowly loosing useful disk space every time you update and rebuild your containers. After about 3 years of upgrades, Pine.blog's main server has lost about 20GB of storage to this bug. Needless to say, I'm investigating a move to Ubuntu in the near future.

What about Docker in Development?

As with Docker in production, I have mixed feelings about the role Docker plays in my development. I dev on a Macbook Pro, and my Django apps run in a plain-old virtual environment. No Docker there. That said, I do use Docker to run extraneous services—like Redis, memcache, or that forward caching proxy.

I stopped running my Django apps in Docker a while back for much the same reason that I no longer run Nginx in Docker. Even with Docker's recommended fixes, Django's management CLI is frustrating to use through Docker and I've had more than one issue with Docker's buffering of log output during development.

Docker: Four Years In

Overall, I really like Docker. It makes deployments super simple: just git pull and docker-compose up (or use my fancy shell script that does zero-downtime deploys). That said, I'm certainly not a Docker purist. I use Docker in a way that reduces the friction of my deploys, and I'm starting to use it less and less when it's just another layer that serves little purpose.

Like every tool, Docker has it's role to play, but in my experience it's not the silver bullet that many people think. I haven't used Docker on AWS via ECS, so I can't comment on that. Perhaps that's where Docker really shines. I still prefer a more traditional hosting strategy. Either way, Docker will remain an important tool in my toolbelt for the foreseeable future.

Lessons on Variable Naming from Breakfast Burritos

This morning I ordered a breakfast burrito from a local taco shop. Normally this would not be news and obviously would not warrant a blog post or any in-depth analysis, but it was early and I hadn't yet had coffee, so my mind was loose and my thoughts wandering. As I looked over the menu, I pondered the two vegetarian breakfast burrito options:

  • Mushroom burrito filled with mushrooms, potatoes, eggs, and cheese
  • Potato burrito filled with potatoes, eggs, beans, and cheese

At the counter I asked for the potato breakfast burrito, and I intended to order the latter of the two, but it occurred to me that they both contained potatoes and therefor my order was ambiguous. What after all makes a burrito with potatoes, eggs cheese, and mushrooms deserve a different name than a burrito with potatoes, beans, eggs, and cheese? What makes the latter not a bean breakfast burrito, as the beans are the item that is unique to the latter burrito whereas potatoes are common to both? Are potatoes a more significant ingredient? If so, why?

I received my order—which was correct by the way—and went home, but as I walked I wondered, how is it that the cashier and I understood each other? There was so much ambiguity in the names of those menu items. How were we able to make sense of the obvious ambiguity?

Naming is Really Hard

If you haven't seen the connection by now, let me drop the pretext. These same questions also relate to how we choose to name our variables and our functions in code. Naming after all is hard, and I think my burrito example helps explain why.

It is often said that the three hardest problems in computer science are naming and off-by-one errors.

In a more rigorous naming system, I assume that most people would come to the conclusion that the second burrito is probably mis-named. It should be called the "bean breakfast burrito" since, as I mentioned, the beans are the distinct ingredient that make the latter burrito not strictly a subset of the former.

That said, beans are not normally considered a main ingredient in a burrito. In the conventional burrito naming scheme, more appealing or distinct ingredients, or ingredients not considered to be condiments, take precedence. This naming scheme is the reason why a burrito with carne asada, pico de gallo, and guacamole would be simply called a carne asada burrito and not a guacamole burrito.

These same conventions exist when we name variables and functions. We can imagine a scenario where we have a list of users and need to filter out which users have recently logged in and which among those have active subscriptions to our service.

def get_active_subscribed_users():
    all_users = get_all_users()
    active_users = (user for user in all_users if user.is_active)
    <variable> = (user for user in active_users if user.has_active_subscription)

The first two variable names are fairly obvious, the question becomes: what do we name the third variable so that it is not ambiguous? We could of course call this new variable active_users_with_active_subscriptions, but to many that would be too long, and to my eyes that makes it seem that this variable contains a list of (user, subscription) pairs.

We could name the value active_users, actively_subscribed_users, or even just relevant_users if the criteria for what relevancy means is clear enough in context. Some developers prefer to simply refer to these as users but I find that incredibly confusing. Others may prefer to define the variable users and then redefine it as they filter down the list to suit their needs, which I find even more confusing and unclear.

In practice I tend to prefer the third option along with a comment explaining what I mean by "relevant". This only exacerbates our problems though. If two groups of "relevant" users meet in a new context, their names would clash and we would need to find new names for these groups.

The context is here is key. If we instead fetched the same list from another function call, we could drop the qualifier entirely.

def get_active_subscribed_users():
    users = get_active_users()
    # We can avoid the question entirely if we simply return the list here.
    return (user for user in users if user.has_active_subscription)

Names are a Leaky Abstraction

As with our breakfast burritos, we could simply default to the names being a list of the components, but that can become overly burdensome very quickly. Our potato burrito would be unceremoniously called the "potato, eggs, bean and cheese breakfast burrito", which is unambiguous but also cumbersome. It can also cause problems as forgetting to mention a single component could confuse the reader and lead them to believe that a reference to a potato, egg, and bean burrito was not the same as your potato, egg, bean, and cheese burrito even if you were both referring to the same thing.

As programmers we aren't taxed by the character; we can have longer variable names, but at best those names should be descriptive, succinct, and distinct. Issues arise when names, by their nature, don't convey the whole story. Names almost always convey a summary of their true meaning. They can't effectively convey the context in which the name was given or the inherent value of the named thing. Out of context a name might be confusing, but that confusion may vanish when used in the appropriate context.

Likewise, in some contexts a potato breakfast burrito is the same thing as a mushroom burrito, but today it wasn't.

Building a Personalized Newsletter with Bash and a Raspberry Pi

I use Pinboard to save articles I've read and, increasingly, to save articles I want to read. That said I rarely go back and actually read things once they disappear into the Pinboard void. This isn't an uncommon problem, I know, but I think I've devised a simple solution.

I recently set up a Raspberry Pi and mounted it under my desk. I've been playing with RPis for years, but I'd never found a recurring need for them, they've always been toys with fleeting amusement value. But this time around, I've configured it as both a local web server and Samba file share. This allows me to quickly and easily share files with the RPi and, since I configured it to send emails through my Fastmail account, it can now alert me whenever I want.

My Pinboard Weekly Newsletter

Now that everything on the RPi is set up and easily accessable, I wrote up a simple bash script to pull my most recent bookmarks from Pinboard, filter out the stuff I've already read, and draft an email with everything from the past week that I still haven't gotten to.

I've posted a simplified version on Github, but my real script isn't much more complex—all told it comes out to 55 lines of code—and it's run with a simple, weekly cron job.

Pinboard Weekly

Here's a sample of the newsletter email—and yes, my RPi's name is Demin.

Hopefully this weekly newsletter reminds me to actually go back and read the interesting news and articles I've collected during the week (or it will help remind me just how unimportant certain things really are when you've had a week to let them sit).

If you use Pinboard, and you constantly find yourself saving articles and never reading them, give my script a try. If you do, let me know what you think!

Why All My Servers Have an 8GB Empty File

Last night I was listening to the latest Under the Radar, where Marco Arment dove into nerdy detail about his recent Overcast server issues. The discussion was great, and you should listen to it, but Marco's recent server troubles were pretty similar to my own server issues from last year, and so I figured I'd share my life-hack solution for anyone out there with the same problem.

The what and where

Both hosts, Marco Arment and David Smith, run their own servers on Linode—as do I—and I found myself nodding along in solidarity with Marco as he discussed his toils during a painful database server migration. Here's the crux of what happened in Marco's own words:

The disk filled up, and that's one thing you don't want on a Linux server—or a Mac for that matter. When the disk is full nothing good happens.

One thing Marco said hit me particularly close to home:

Server administration, when you're an indie, is very lonely.

During my major downtime problem last year, I felt incredibly isolated and frustrated. There was no one to help me and no time to spare. My site was down and it was down for a while. My problem was basically the same: my database server filled up (but for a different reason). And as Marco said, when the disk is full, nothing good happens.

In the days after I fixed my server issues, I wanted to ensure that even if things got filled up again, I would never have trouble fixing the problem.

A cheap hack? Yes. Effective? Also Yes.

On Linux servers it can be incredibly difficult for any process to succeed if the disk is full. Copy commands and even deletions can fail or take forever as memory tries to swap to a full disk and there's very little you can do to free up large chunks of space. But what if there was a way to free up a large chunk of space on disk right when you need it most? Enter the dd command1.

As of last year, all of my servers have an 8GB empty spacer.img file that does absolutely nothing except take up space. That way in a moment of full-disk crisis I can simply delete it and buy myself some critical time to debug and fix the problem. 8GB is a significant amount of space, but storage is cheap enough these days that hoarding that much space is basically unnoticeable... until I really need it. Then it makes all the difference in the world.

That's it. That's why I keep a useless file on disk at all times: so I can one day delete it. This solution is super simple, trivial to implement, and easy to utilize. Obviously the real solution is to not fill up the database server, but as with Marco's migration woes, sometimes servers do fill up because of simple mistakes or design flaws. When that time comes, it's good to have a plan, because otherwise you're stuck with a full disk and a really bad day.

1 There are lots of tools you can use to do this besides dd. I just prefer it.

All Too Quiet

Other than my last post, it's been pretty quiet here lately. I've spent the majority of the last two months wrapping up the newest update to Pine.blog, which launched silently about a month ago. I meant to do a sort of announcement or retrospective post on the launch, but I just never got around to it.

A new IndieDevLife is coming, but I haven't had the mental bandwidth to record an episode lately. I've been spending a lot of time reading and writing policy proposals for Democracy & Progress and I've been pouring the remainder of my time into a new iOS app.

That's right, I've started yet another new app! It's in a private, early beta now, and I'm expecting the launch in May. No more details just yet, but I promise this one will be particularly special. I'm doing a lot that I've never done before, and playing with APIs I'd never heard of until now. It's fun stuff.

California, Democracy, and Progress

Last week I published the first proposal on Democracy & Progress, my new public policy blog. It's about Democracy Vouchers and how California should adopt them.

You should read the post, then please consider subscribing. The blog is in its initial launch phase, so your subscription means more than it normally would.

I've been wanting to write about policy for a long time, but I could never quite figure out the tone or the topic scope. I finally settled on the idea of discussing California politics through the lens of improving and promoting democracy. It's a big topic, and there's a lot to discuss, so I hope you'll follow along and let me know what you think.

Over the last few years, I've become pretty immersed in the policy world's conversations. I read a lot of policy books, articles and papers, I follow a lot of political writers, and I listen to a lot of politics podcasts. Over time, I started to develop my own policy outlook and then I wanted to participate in the conversation to add what I thought was a different angle on the discussion. Last year I started writing Op-Eds and publishing some in my local paper, but also I wanted to do more than that. I just couldn't figure out what my angle would be, what kinds of topics or ideas I wanted to cover, and through what lens I would cover them.

A few years back, while listening to the Ezra Klein Show, Ezra lamented that we as a society didn't spend more time focusing on local and state politics—where our time and energy is often better spent. Collectively, we don't focus on state and local politics, and yet it's only there where a lot of policy solutions can be done. That conversation stuck with me, and over time the drive to write about state politics has only gotten deeper. It was last summer, when I read All Politics is Local, that the idea for what would become D&P really started to form.

While I was scoping out the policy-blog space, I did a lot of searching around and while I found lots of medium to long-form policy blogs focusing on the national federal government (a lot of which I was already following), I didn't find a lot of the same thing at the state level. It was then I realized I had found my niche.

Politics can be a difficult thing to discuss in public, so that's why I wanted to focus on policy, not politics. Hopefully the blog can stay far away from the concerns of the day and avoid kindling a partisan fervor. At D&P, we're going to focus on solutions. California is a solidly Democratic state (the party not the governing strategy), so partisan squabbling is less of an issue—which is a blessing—but there are still plenty of difficult issues.

As part of my work for D&P, I've started compiling a list of resources for people to help them follow California politics. I know I wish I'd had something like this to get me started, so hopefully I can pay it forward.

Please consider giving D&P a follow in your favorite feed reader via RSS, on Twitter @dem_and_prog, or sign up for the newsletter. Here's to a better, more policy focused California.

Various Goings On

I've been a little scatter-brained over the past few weeks. I've started lots of little projects and finished almost none of them. Hopefully, they'll all start to wrap up soon. I mentioned on a previous Indie Dev Life that I was working on an update to the Pine.blog iOS app, and that is still true. It's coming along nicely. I think I'm about 80% done with it, but I've reached the infamous second 80% and it's become a bit of a slog, so I did what a usually do in this situation: literally anything else.

To that end, I've been experimenting with a few other projects. I've started the process of adding a dark mode to Pine.blog's Web UI, which is coming along nicely. I've also started building out Micropub support for Pine.blog. Both are features I've wanted for a long time, but never gotten to. Hopefully they'll be done around the same time as the iOS refresh. I've always struggled with putting together a coherent design for the Pine.blog web UI, and it shows. With the newest refactor of Pine.blog for iOS though, I've finally developed a coherent design and I've started using the same layout and components on both iOS and the web. It's looking really good, and hopefully this new design will last a while and bring Pine.blog up to modern standards.

On Monday the holidays are officially over and I'll force myself to finish the iOS release for Pine.blog. Usually I try to keep myself pretty focused, but for now I'll keep tinkering — it's a nice thing to do every once in a while.

Breadcrumbs and Pinboard

I use Pinboard, and I have for some time. For years, I've dumped the occasional interesting article there and then largely forgotten about it. Recently though, I've decided to really dive in and start using the service to solve a problem I routinely have.

I read a lot of articles, as I'm sure many of you do, and I often refer to them in conversation or in a blog post. When I do reference an article I like to cite it, even in conversation, but I often can't find the article anymore nor can I remember enough about it to effectively search for it. I think it's important to cite your sources, even casually, because it helps you recognize your own filter bubble, and it helps to ground you in the realm of facts. It also helps the other people you're conversing with better understand and counter your argument if they can tell where you're getting your information from. Plus, if you can find the original source, you can always verify that you're remembering the information correctly.

In an effort to forever solve this problem, I've started archiving essentially every article I read in Pinboard, and I do this whether or not I think the article was any good. This effectively turns Pinboard into a breadcrumb trail that I can use to retrace my path on the Web and hopefully dig up and verify any information that I half-remember.

A while back, I started taking notes when I read books so I can more easily remember and refer back to information I may need, but I don't want to keep a notebook for everything I read online. Enter Pinboard.

There are a few things that my system is lacking so far. Most importantly, I'm still having trouble finding some articles when I know that I've bookmarked them. It seems that either Pinboard's search isn't working how I'd expect or I'm not giving it enough metadata to search through when bookmarking articles.

Hopefully, as I get more accustomed to this new way of using Pinboard I'll drastically increase the frequency when I can recall some useful fact and also produce a worthy citation.

History, Myth, and Talking Cows

I started reading The Early History Of Rome by Livy a little over two years ago, but today I finally finished it. It's a good book and a fun and illustrative read, but there's a reason it took me so long to get through.

Livy, or Titus Livius, as he was actually known, was a Roman writer born in 64 or 59 BC, so his writing style is... strange by modern standards. I would often have to re-read multiple pages after realizing that I had no idea who was doing what. This is compounded by the fact that there are literally hundreds of names and, in a very Roman tradition, they are all incredibly similar. I'm also a fairly slow reader, so between the constant re-reading, my overall slow reading speed, and huge reading backlog, I could only finish 20 pages before getting distracted with another easier book.

That said, Livy's work is certainly worth reading if only for some of the truly amazing stories he tells. Sometimes the stories are so completely outlandish that I have to stop and remind myself that this isn't a fantasy story, it's history. Now, obviously there is myth and legend interspersed with it, and in these early histories they're effectively inseparable. After all, it's claimed —by multiple Roman authors— that the city was founded by two children nursed by a wolf, before the oldest kills the youngest, becomes king, and ascends to heaven in a cloud. It's an interesting read.

I'd like to share just one passage that I earmarked early on in the book. For context, know that the Romans were incredibly superstitious. They were constantly on the lookout for signs from the gods and they rarely did anything without performing some sacred ritual and seeking approval from the gods (see the story of the Sacred Chickens). In this passage, the consuls for the year (461 BC) had just been elected, and war would soon come to the Romans though they didn't know it yet.

The year was marked by ominous signs: fires blazed in the sky, there was a violent earthquake, and a cow talked — there was a rumor that a cow had talked the previous year, but nobody believed it: this year they did. Nor was this all: it rained lumps of meat. Thousands of birds (we are told) seized and devoured the pieces in mid-air, while what fell to the ground lay scattered about for several days without going putrid. The Sibylline Books were consulted by two officials, who found them in them the prediction that danger threatened, from 'a concourse of alien men' who might attack 'the high places of the City, with the shedding of blood'. There was also found, amongst other things, a warning to avoid factious politics.

– Livy, History of Rome, 3.10

There is so much that I love about this passage, but my absolute favorite thing is that Livy reports that a cow talked, but for some reason the first time this happened it was dismissed, and that a cow talking is apparently a warning to avoid factious politics. If that's the case, then I kinda wish a cow would talk today.

I've already purchased the second volume of Livy's work and it's on my shelf ready to go. Hopefully I can get through this one a bit quicker.

RSS