BiteofanApple Archive About Code Twitter
by Brian Schrader

Roderick on the Line: Space Exploration is Human Nature

Posted on Tue, 10 Mar 2015

In the absence of an answer to the question, "Why?" all you can do is explore.

Magnificent episode. Really worth a listen.

  • 00.00 - Why?
  • 01.50 - One-way trip to Mars
  • 05.15 - Why? Exactly!
  • 07.39 - Reasons against space travel
  • 08.45 - Privatization of Space Exploration

Why Everyone Was Wrong About Net Neutrality

Posted on Sun, 08 Mar 2015

An interesting take on why the new FCC regulations may be here to stay.

Looking to the future, there's one last thing that everyone might be wrong about. The general assumption is that the new rules will be met with fierce and protracted litigation (perhaps decades of it, warn the greatest doomsdayers). I've said myself that there will be litigation, and it is true that, in our times, most serious regulation is immediately challenged in court, almost as a kind of corporate reflex. Verizon and A.T. & T. have both already threatened to sue. But maybe this prediction is wrong, too.

Why Everyone Was Wrong About Net Neutrality →

Microblog Crawler v1.3 Released

Posted on Sun, 08 Mar 2015

I'm pleased to announce that version 1.3 of the Microblog Crawler is now available on GitHub and PyPi!

To install use:

pip install MicroblogCrawler.

Release Notes

The big news: Version 1.3 is now multiprocessed!

Among other things, version 1.3 also includes a number of fixes and improvements.

  • on_item callback now includes the feed information as the second parameter. This is a breaking change in the API.
  • on_info callback now receives a dictionary response of all of the info fields in a given feed. Previous versions received a name, value tuple.
  • Multiprocessing now allows the crawler to process 4 feeds (or more if you override the value) at once.
  • Fixed a number of bugs that allowed duplicates.
  • Fixed an issue where feed crawl times may be inaccurately reported.
  • Fixed the timezone problem. Feeds without timezones are parsed according to their HTTP response timezone.

Added a bunch of 'Good Citizen' features like:

  • Added crawler user agent and proper subscriber count reporting to remote servers.
  • Crawler is now HTTP status code aware and static files will not be parsed if they have not been modified (HTTP 304).
  • Added automatic 301 redirection behavior and MAX_REDIRECTS
  • Added support for returning specific error codes from other HTTP headers.

Clarifying Structure

Posted on Thu, 26 Feb 2015

In moving between my day job, programming in Java, and my personal projects, usually in Python, there tends to be a lot of bleedover from one language/paradigm to the other. I love Python. It's a fun, straghtforward, powerful language and it has lots of great features and frameworks built up around it, but Python's dynamic nature can lead to problems with readability and understandability(?) when developers take it for granted.

Say one thing for Java, say that it's picky. It wants one class per file, getters and setters are explicit, and it can be very verbose (in the wrong ways). One thing Java does have though, is clear structure. Classes have members, members are laid out in advance, and you know what to expect. Python passes dictionaries like Java throws NullPointerExceptions: everywhere. Unlike NullPointerExceptions though, Python's reliance on dictionaries is one of my favorite features of the language, but relying on data types, which have no default structure, means that anyone reading your code will have to, not only decipher the meaning of the code, but also the structure of your data. Clarifying the structure of your dictionaries explicitly will help readability and enforce you to adhear to that structure down the line.

Here's some sample code that doesn't clarify ahead of time what the structure of the dictionary will be:

    results = {}
    results['users'] = get_users()
    results['posts'] = [post for post in get_posts()]
    if my_user_id not in results['users']:
        results['users'].append(my_user_id)
    results['last_post_times'] = [last_post_time for last_post_time in get_times()]
    for lpt in results['last_post_times']:
        if lpt['user'] not in results['users']:
            raise SomeError
    return results

Although it's not too hard to see, the structure of results can't be easily determined from the first line. You have to walk the code to see the structure.

    >>> results = {
            'users': [
                'terry.gilliam',
                'eric.idle',
                'graham.chapman',
                'john.cleese',
                'michael.palin',
                'terry.jones'
            ],
            'posts': [
                { 
                    'user': 'john.cleese',  
                    'content': 'How to defend yourself against a man armed with a banana.'
                },
                {
                    'user': 'eric.idle',
                    'content': '@john.cleese What about a pointed stick?'
                }
            ],
            'last_post_times': [...]
        }

Another way of laying out this code would be like this. Here, we clarify the structure that results will have upfront.

    results = { 'users': [], 'posts': [], 'last_post_times': [] }
    # The rest is the same...
    results['users'] = get_users()
    results['posts'] = [post for post in get_posts()]
    if my_user_id not in results['users']:
        results['users'].append(my_user_id)
    results['last_post_times'] = [last_post_time for last_post_time in get_times()]
    for lpt in results['last_post_times']:
        if lpt['user'] not in results['users']:
            raise SomeError
    return results

Now, from looking only at the first line, we know the structure that results will take. We don't have to decipher it. There's no performance difference in either approach, but laying out the structure of the dictionary ahead of time makes the code more easily scannable.

Python will let you be sloppy in ways that Java just won't. Overall I much prefer working in Python, but because its so forgivable and dynamic, I'm constantly finding myself forcing structure onto my code to make it more understandable when I eventually come back to it.

Update: Someone at the local Python meetup group told me about NamedTuples saying, "If you're looking for that structure, I'm wondering if you should be using a dictionary."

Microblog Crawler Diary: Signals

Posted on Tue, 24 Feb 2015

I've been turning this issue over in my head all day, and I think I've got a workable solution. Here's the problem: Since the web-app and the crawler are two separate processes (completely unrelated since the crawler is a cron job and the web-app is running under mod_wsgi) they don't have any way to communicate. Normally this isn't a problem since they don't really need to communicate. The crawler does its thing, and inserts records into a cache, and the web-app reads from the cache. Simple. However, my goal now is to support server notifications for new posts. That is, when a user posts a message, any service that has registered for notifications will be pinged to let them know that a new post is available. That way the server doesn't have to poll URLs constantly to receive to-the-minute updates.

Sending these messages is simple. Receiving them is hard. Upon receiving a message, the app server needs to notify the crawler that it should go and fetch a given URL (and possibly even a given item). Interrupting a process that is completely unrelated and that could be right in the middle of something else, is not an easy problem to solve. Here's my solution though. Currently, when the crawler starts up it checks to see if there is a pid file in the system's temp directory. If there is, then it quits, if not it creates one and sets about its business. The pid file tells the crawler that an instance of itself is already running. The web-app will read the pid from that file, write the link it's supposed to crawl to another file, and send a signal to the crawler process with that pid.

The crawler will have a handler for that signal that will then pause the crawling, (if it is being done) read in the link that is supposed to be crawled, and do so. Once it is done crawling, it records the link, and inserts the new post into the cache. The crawler then resumes its job of crawling the list of links, and if that link is the same as the link recorded in the signal handler, then it moves on without processing it (since it has already done so).

It does seem a bit primitive, but it should be straight forward to implement. With this implemented, the crawler can increase its default crawling interval and fall back to just being notified if anything new comes along.


Side note: I will have to be careful what signals are allowed to pass to the crawler. If unchecked, spammers could easily stop the crawler dead in its tracks (since I would be essentially sending OS interrupts for every message received). To mitigate this, the web-app will keep a whitelist of servers it has requested updates from and only accept messages from those servers.

Microblog Crawler Diary: Timezones

Posted on Sun, 22 Feb 2015

Akin to Brent Simmons' Vesper Sync Diary I think I'm gonna start posting updates and concerns that I have as I continue to develop the various parts of Microblogger.

So far, my biggest issue, aside from learning Flask, has been constructing the user's home timeline. For those who are unfamiliar with Microblogger, basically the home timeline looks just like Twitter's timeline. It consists of a reverse chronological list of posts by people that the user follows. Since Microblogger has to construct this timeline from disparate XML feeds, I have to account for future developers not including timezones in their pubdate strings.

Here's the basic problem. Consider that I have a correctly formatted microblog post:

<item>
    <guid>{some guid}</guid>
    <description>Hi!</description>
    <pubdate>Wed, Feb 28 2014 08:00:00 EST</pubdate>
</item>

Now consider that I have an incorrectly formatted one:

<item>
    <guid>{some guid}</guid>
    <description>Hi!</description>
    <pubdate>Wed, Feb 28 2014 05:01:00</pubdate>
</item>

Which post comes first? Initially you would think the top one would come first since it was posted after the second. However if we know that the second was posted in Pacific Standard Time, then it should actually come first, and not 3 hours behind. Accounting for bad formatting is arguably something I should hold off on doing until I encounter it, and I've thought about that a lot. However, I think I've come up with a very simple way to work around this problem for a large percentage of cases where this happens.

My workaround is to use the timezone present in the HTTP response timestamp.

HTTP/1.1 GET
Date: Wed, Feb 28 2014 08:31:00 EST

This will tell me the timezone that the server is in, and from there I can assume the most likely timezone for the pubdate is the same.

So that's how I'm proceeding: If there is no timezone present in the pubdate, then use the timezone from the HTTP response header.

Archive

Subscribe to the RSS Feed. Check out my code on GitHub
Creative Commons License
BiteofanApple is licensed under a Creative Commons Attribution 4.0 International License.