BiteofanApple Archive About Code Twitter
by Brian Schrader

Clarifying Structure

Posted on Thu, 26 Feb 2015

In moving between my day job, programming in Java, and my personal projects, usually in Python, there tends to be a lot of bleedover from one language/paradigm to the other. I love Python. It's a fun, straghtforward, powerful language and it has lots of great features and frameworks built up around it, but Python's dynamic nature can lead to problems with readability and understandability(?) when developers take it for granted.

Say one thing for Java, say that it's picky. It wants one class per file, getters and setters are explicit, and it can be very verbose (in the wrong ways). One thing Java does have though, is clear structure. Classes have members, members are laid out in advance, and you know what to expect. Python passes dictionaries like Java throws NullPointerExceptions: everywhere. Unlike NullPointerExceptions though, Python's reliance on dictionaries is one of my favorite features of the language, but relying on data types, which have no default structure, means that anyone reading your code will have to, not only decipher the meaning of the code, but also the structure of your data. Clarifying the structure of your dictionaries explicitly will help readability and enforce you to adhear to that structure down the line.

Here's some sample code that doesn't clarify ahead of time what the structure of the dictionary will be:

    results = {}
    results['users'] = get_users()
    results['posts'] = [post for post in get_posts()]
    if my_user_id not in results['users']:
        results['users'].append(my_user_id)
    results['last_post_times'] = [last_post_time for last_post_time in get_times()]
    for lpt in results['last_post_times']:
        if lpt['user'] not in results['users']:
            raise SomeError
    return results

Although it's not too hard to see, the structure of results can't be easily determined from the first line. You have to walk the code to see the structure.

    >>> results = {
            'users': [
                'terry.gilliam',
                'eric.idle',
                'graham.chapman',
                'john.cleese',
                'michael.palin',
                'terry.jones'
            ],
            'posts': [
                { 
                    'user': 'john.cleese',  
                    'content': 'How to defend yourself against a man armed with a banana.'
                },
                {
                    'user': 'eric.idle',
                    'content': '@john.cleese What about a pointed stick?'
                }
            ],
            'last_post_times': [...]
        }

Another way of laying out this code would be like this. Here, we clarify the structure that results will have upfront.

    results = { 'users': [], 'posts': [], 'last_post_times': [] }
    # The rest is the same...
    results['users'] = get_users()
    results['posts'] = [post for post in get_posts()]
    if my_user_id not in results['users']:
        results['users'].append(my_user_id)
    results['last_post_times'] = [last_post_time for last_post_time in get_times()]
    for lpt in results['last_post_times']:
        if lpt['user'] not in results['users']:
            raise SomeError
    return results

Now, from looking only at the first line, we know the structure that results will take. We don't have to decipher it. There's no performance difference in either approach, but laying out the structure of the dictionary ahead of time makes the code more easily scannable.

Python will let you be sloppy in ways that Java just won't. Overall I much prefer working in Python, but because its so forgivable and dynamic, I'm constantly finding myself forcing structure onto my code to make it more understandable when I eventually come back to it.

Microblog Crawler Diary: Signals

Posted on Tue, 24 Feb 2015

I've been turning this issue over in my head all day, and I think I've got a workable solution. Here's the problem: Since the web-app and the crawler are two separate processes (completely unrelated since the crawler is a cron job and the web-app is running under mod_wsgi) they don't have any way to communicate. Normally this isn't a problem since they don't really need to communicate. The crawler does its thing, and inserts records into a cache, and the web-app reads from the cache. Simple. However, my goal now is to support server notifications for new posts. That is, when a user posts a message, any service that has registered for notifications will be pinged to let them know that a new post is available. That way the server doesn't have to poll URLs constantly to receive to-the-minute updates.

Sending these messages is simple. Receiving them is hard. Upon receiving a message, the app server needs to notify the crawler that it should go and fetch a given URL (and possibly even a given item). Interrupting a process that is completely unrelated and that could be right in the middle of something else, is not an easy problem to solve. Here's my solution though. Currently, when the crawler starts up it checks to see if there is a pid file in the system's temp directory. If there is, then it quits, if not it creates one and sets about its business. The pid file tells the crawler that an instance of itself is already running. The web-app will read the pid from that file, write the link it's supposed to crawl to another file, and send a signal to the crawler process with that pid.

The crawler will have a handler for that signal that will then pause the crawling, (if it is being done) read in the link that is supposed to be crawled, and do so. Once it is done crawling, it records the link, and inserts the new post into the cache. The crawler then resumes its job of crawling the list of links, and if that link is the same as the link recorded in the signal handler, then it moves on without processing it (since it has already done so).

It does seem a bit primitive, but it should be straight forward to implement. With this implemented, the crawler can increase its default crawling interval and fall back to just being notified if anything new comes along.


Side note: I will have to be careful what signals are allowed to pass to the crawler. If unchecked, spammers could easily stop the crawler dead in its tracks (since I would be essentially sending OS interrupts for every message received). To mitigate this, the web-app will keep a whitelist of servers it has requested updates from and only accept messages from those servers.

Microblog Crawler Diary: Timezones

Posted on Sun, 22 Feb 2015

Akin to Brent Simmons' Vesper Sync Diary I think I'm gonna start posting updates and concerns that I have as I continue to develop the various parts of Microblogger.

So far, my biggest issue, aside from learning Flask, has been constructing the user's home timeline. For those who are unfamiliar with Microblogger, basically the home timeline looks just like Twitter's timeline. It consists of a reverse chronological list of posts by people that the user follows. Since Microblogger has to construct this timeline from disparate XML feeds, I have to account for future developers not including timezones in their pubdate strings.

Here's the basic problem. Consider that I have a correctly formatted microblog post:

<item>
    <guid>{some guid}</guid>
    <description>Hi!</description>
    <pubdate>Wed, Feb 28 2014 08:00:00 EST</pubdate>
</item>

Now consider that I have an incorrectly formatted one:

<item>
    <guid>{some guid}</guid>
    <description>Hi!</description>
    <pubdate>Wed, Feb 28 2014 05:01:00</pubdate>
</item>

Which post comes first? Initially you would think the top one would come first since it was posted after the second. However if we know that the second was posted in Pacific Standard Time, then it should actually come first, and not 3 hours behind. Accounting for bad formatting is arguably something I should hold off on doing until I encounter it, and I've thought about that a lot. However, I think I've come up with a very simple way to work around this problem for a large percentage of cases where this happens.

My workaround is to use the timezone present in the HTTP response timestamp.

HTTP/1.1 GET
Date: Wed, Feb 28 2014 08:31:00 EST

This will tell me the timezone that the server is in, and from there I can assume the most likely timezone for the pubdate is the same.

So that's how I'm proceeding: If there is no timezone present in the pubdate, then use the timezone from the HTTP response header.

Elon Musk is Crazy

Posted on Tue, 20 Jan 2015

It would be an incremental process, and proceeds from the Earth internet could will help pay for the $10 billion investment in the colony and internet on Mars, Musk said.

"People should not expect this to be active sooner than five years," he said. "But we see it as a long-term revenue source for SpaceX to be able to fund a city on Mars..." - Elon Musk

The man is crazy, really crazy, but in a good way. I like it.

Elon Musk Wants to Bring the Internet to Mars →

Update: Apparently Google is in on the project too.

Fight for Space - Why Space?

Posted on Thu, 15 Jan 2015

Space exploration is important for our economy, our society, and our future as a species. The upcoming documentary, "Fight for Space" was funded on Kickstarter almost 2 years ago, and they've been hard at work making the case for space. Great work guys, I can't wait for the finished film.

The Results of SpaceX's Attempt to Recover Falcon

Posted on Thu, 15 Jan 2015

Amazing really. For a first test, it really is amazing they did as well as they did. Congrats SpaceX.

Parabolic Arc →

Archive

Subscribe to the RSS Feed. Check out my code on GitHub
Creative Commons License
BiteofanApple is licensed under a Creative Commons Attribution 4.0 International License.