BiteofanApple Archive About Code Twitter
by Brian Schrader

Python multiprocessing and unittest

Posted on Tue, 28 Apr 2015

I've been having an issue with unit testing Microblogger when my tests need to use Python's multiprocessing module. I've been looking at this code for days now and I can't seem to find the bug. I'm hoping that by writing down my thoughts here, I can think through the problem.

Basically, the test is trying to verify that a User object can be created with information from a remote XML feed. The test gives the User module a URL and tells it to fetch all information at that resource.

    def test_cache_user(self):
        user = User(remote_url='')
        self.assertEqual(user._status, dl.CACHED)
        self.assertEqual(user.username, 'sonicrocketman')

The cache_user method starts up a crawler to go out and parse the contents of the URL provided.

    def cache_users(users):
        from crawler.crawler import OnDemandCrawler
        remote_links = [user._feed_url for user in users]
        user_dicts = OnDemandCrawler().get_user_info(remote_links)

Everything is ok still. Inside that OnDemandCrawler().get_user_info() method, the OnDemandCrawler crawls the URL given and then calls self.on_finish(). This is when things get funky.

    def on_finish(self):

The stop command tells the crawler to shut down, the now keyword just tells it to force stop the crawling process and don't wait to cleanly exit.

If we look at the source to the microblogcrawler (v1.4.1) we see that stop does the following:

    def stop(self, now=False):
        if now:
            # Try to close the crawler and if it fails,
            # then ignore the error. This is a known issue
            # with Python multiprocessing.
                self._stop_crawling = True

The curious part is that self._stop_crawling = True part. In the tests for the microblogcrawler both forcing the crawler to stop and normally stopping it work fine. The issue arises when trying to stop them in a unit test. For some reason the crawler doesn't stop.

Here's a sample crawler and the output it produces when run as a unit test:

    class SomeCrawler(FeedCrawler):
        def on_start(self):
            print 'Starting up...' + str(self._stop_crawling)
        def on_finish(self):
            print 'Finishing up...' + str(self._stop_crawling)
            print 'Should be done now...' + str(self._stop_crawling)

>>> python -m crawler_test
>>> Starting up...False        # Correct
>>> Finishing up...False       # Correct
>>> Should be done now...True  # Correct
>>> Starting up...False        # lolwut?

For some reason the crawler isn't receiving the signal to stop. Looking at it from my Activity Monitor it appears to stop (the 4 worker threads are closed), but then the crawler creates 4 new worker threads and does it all over again.

The last step of this process is inside the crawler itself. The crawling process is controlled by the self._stop_crawling attribute:

    def _do_crawl(self):
        # Start crawling.
        while not self._stop_crawling:
            # Do work...

From this code, if the _stop_crawling attribute is set to True, then the crawler should finish the round it's on and close down, but the value of the attribute doesn't seem to be sticking when it's assigned in the stop method above.

If anyone has any ideas as to what the issue could be, I'd love to hear them. I'm pretty much out of ideas now. As I said before, the tests in the microblog crawler (which are not unit tests) work fine. The issue only comes up when running a test suite through unittest itself.

Microblog Crawler v1.4(.1)

Posted on Sat, 25 Apr 2015

Version 1.4.1 of my MicroblogCrawler is out on PyPi! Technically v1.4 was out last week but it had a fairly large bug that needed fixing. 1.4.1 has patched it and it's ready for prime time.

v1.4.1 is full of enhancements, a few of which are listed here:

  • Calling stop now actually stops the crawler. This bug was due to a nasty bug in Python's multiprocessing module (9400). The crawler now alerts you when such a problem arises by outputting it through the on_error callback.
  • Fixed a bug that would cause feeds to throw errors if no pubdate element was found. Elements are not parsed but are discarded, and on_error is called.
  • Fixed a major bug when attempting to stop the crawler immediately.

The full version notes are available here.

The major enhancement in this version (besides the graceful exiting) was the addition of a workaround for a bug in Python's multiprocessing module. The bug has to do with what happens to exceptions raised in child processes. When they are raised, they are pickled and sent back to the parent process. The problem arises when an exception is not pickleable. The child process hangs and never exits. The interesting thing is that the bug was first reported in 2010 and affects all versions of Python since 2010 (i.e. 2.7, 3.2, 3.3, 3.4). This bug has been baffling me since I started converting the crawler to be multiprocessed, and its nice to finally have a workaround.

If anyone out there is using MicroblogCrawler, I'd love to hear from you, and pull requests are very welcome!

PEP 484 - Type Hints for Python

Posted on Wed, 08 Apr 2015

Guido van Rossum:

I'm for any improvements that will help my favorite language run smoother, with fewer errors, and maybe faster someday*.

This PEP aims to provide a standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and performance optimizations utilizing type information. Of these goals, static analysis is the most important. This includes support for off-line type checkers such as mypy, as well as providing a standard notation that can be used by IDEs for code completion and refactoring.

There's been a big push for better static analysis in Python for the last few years, and there've been attempts like this before (see Cython) and having a language level standard for Type Hints would bring the benefits to all the various Python implementations.

# An example of the proposed type hint syntax.
def greeting(name: str) -> str:
    return 'Hello ' + name

I admit, the new syntax looks very Rust/Swift-like, and that's probably by design. One thing that worries me, and which isn't obvious from that code sample is that Python Type Hints will (must) include generics and blocks (i.e. lambdas, closures, etc). When those get into the mix, the Type Hint system starts to look a little messy.

from typing import Mapping, Set

def notify_by_email(employees: Set[Employee], overrides: Mapping[str, str]) -> None: ...

Even though that code isn't particularly pretty, the Type Hints can help the static analyzer find errors that could potentially be very hard to track down. As I said, I'm completely in favor of this addition to the Python syntax.

As a final note, for those of you worried that Python might be changing to a statically typed language, fear not.

It should also be emphasized that Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention.

PEP 484 - Type Hints →

* According to the PEP, the goal of Type Hints will not be performance based, but they do go on to say, "Using type hints for performance optimizations is left as an exercise for the reader," which keeps me hopeful that PyPy or maybe even CPython could use them for that purpose as an added benefit.


Posted on Tue, 31 Mar 2015

This week I took to signing up for a Patreon account and finally supporting my favorite video producers. Its exciting to see them getting due payment for what they create, and Patreon makes it really easy to be a patron, and the benefits are awesome.

One of my favorite series Extra Credits, a normally video games oriented series, create a mini-series a year ago called Extra History, where they told the story of the Punic Wars in their typical educational fashion. That mini-series is easily in my top 5 of the videos they've ever made, but they couldn't justify continuing the mini-series because their main sponsor was a gaming magazine. With Patreon, that has changed. Direct funding from their viewers means there's enough interest in Extra History to justify more and more videos, and its great.

With direct support supplementing ad sales, creators can make judgements based on interest instead of solely on popularity. This means they can make more of the kinds videos that they want to make, instead of what will just be popular. If you haven't already, take a look at Patreon. The amounts you can sign up for are trivial, but they make a difference.

Microblog Crawler Diary: Signals Follow Up

Posted on Tue, 31 Mar 2015

Well, saner heads have prevailed. Microblogger will not rely on sending OS signals to alert the crawler of new messages (it was a terrible idea anyway). Instead, the crawler will ping the web server on a private URL periodically to get new messages. The implementation details have not been solidified, but I do like this solution a lot more than the terrible option of sending signals. Plus, this method is scalable to additional servers.

Thanks to Micha for the suggestion.

Priming the Pump

Posted on Tue, 31 Mar 2015

C.G.P. Grey recently announced that he will be posting his videos to iTunes and RSS in addition to putting them on YouTube. His move is part of a growing trend as more and more creators move away from relying solely on proprietary platforms like YouTube. Having videos on YouTube is great; having YouTube be the only place the videos live is scary, and it puts a lot of power in YouTube's hands.

More and more I'm hearing bloggers, video producers, and musicians talk about going back to more open, agnostic systems like RSS and blogs. There's been a lot talk of reinventing blogging (*cough* Dave Winer *cough*) and of creators owning their own content. Microblogger for one, hopes to be an active participant in the push to open up the internet.

Together, we seem to be priming the pump for a resurgence of a more open internet, and that's exciting. With Facebook pushing harder then ever to trap other people's content inside its walled garden, and Twitter closing the doors on what developers can do with its API, a resurgence of the open internet is exactly what we need. We need to push for companies to use open standards, and open access to data, and we need to use them ourselves. Its time for the pendulum to swing back this way.


Subscribe to the RSS Feed. Check out my code on GitHub
Creative Commons License
BiteofanApple is licensed under a Creative Commons Attribution 4.0 International License.