Simeon Franklin

Blog :: What Generators are for

22 May 2012

I teach Python classes and enjoy exploring language features from the perspective of newbie's to the language. Usually I can explain the rationale for Python language features by showing a compelling use case.

For example a common programming task is to iterate over a container of known length and do something for each item in the container. In a C-like language this might look like:

index = 0; // A
while(index < container.length){ // B
  item = container[index]; //C
  do_something(item);
  l++; // D
}

Of course this pattern is so common that C-like languages have a looping construct just to handle this situation. Notice that the counter initialization (A), the bounds checking (B), and the incrementing (D) are all combined into a single syntactical pattern:

for(index=0;index<container.length;index++){ // A, B, D
    item = container[index];
    do_something(item);
}

Python goes one step further and provides a simpler iterating construct. If what we want is item, why bother with index and container.length?

for item in container: // E - implicit assignment without temp counter variables
    do_something(item)

But what about generator functions?

I've always been a little stuck with generator functions though - generator expressions make sense as a generator version of list comprehensions and I suppose if you had iteration you can't wrap up in a list comprehension you'd want to be able to do it in a function... Oh and David Beazley does wicked cool things with generators, but that seems kinda advanced...

So I've been stuck with a use case for generators functions till I recently ran into a problem for which they turned out to be exactly the right solution.

Consider a remote API with a record limit. For any given query there may be 1,000 records that match but only PAGE_SIZE records will be returned. The API does provide the ability to do offset queries so I could manually check the number of records and then repeatedly issue offset queries to get all the records. Of course this means I can't simply loop over the list of records - now my (pseudo)code looks like:

index = 0
page_index = 0
request = session.CreateSearchRequest(*query)
results = session.Search(request)
while results.HasNext(): # by side effect advances to the next record
    do_something(results) 
    pos += 1
    page_index += 1
    if page_index == PAGE_SIZE: # regenerate result set
       request.SetOffset(pos)
       results = session.Search(request)
       page_index = 0 # start the next paged set

This is pretty ugly - I've got way more code managing state in my loop than actually processing my results. Ideally I'd like to be able to ignore the paging limits and the non-Pythonic API (librets - I'm looking at you) where .HasNext() returns True or False and has the side effect of advancing the result set. Ideally I'd like to write:

request = session.CreateSearchRequest(query)
for results in magically_loop_over_my_search(request):
    do_something(results)

instead.

My breakthrough came thanks to a Ned Batchelder article on iteration I read recently. Ned pointed out that generators can be used to reshape iteration. As an example: looping through a two dimensional data structure like a grid (assuming a .getcell(x,y) kind of API) is done most easily with nested for loops to get every cell.

To get every cell in a single simple iteration would require some math and extra variables- a row counter, a column counter (reset to 0 ever row_length iterations), plus a cell = grid[r][c] kind of assignment to actually retrieve the cell.

Hey - that description looks a lot like my problem with the paged data set!

Ned's solution is to simply do the nested for loops inside a function and yield a cell's coordinates allowing the following simple iteration:

for (x, y) in twod_range(width, height):
    do_something(spreadsheet.get_cell(x, y))

It's important to note that without the yield statement our iterator would be more complicated. We could write a regular function that returns a list containing all the coordinate pairs - but it would be more complicated code (list initialization, append to the list, and return the list). It would also be less efficient - I want to process 20,000 records that are good sized and I don't want to make a single list containing all 20,000 records. Generator functions to the rescue!

Back to my problem: by moving the page handling code into a generator function:

PAGE_SIZE = 250   
def paged_results(session, request, offset=0):
    request.SetLimit(PAGE_SIZE)
    request.SetOffset(offset)
    return session.Search(request)

def get_results(session, request):
    pos = current_loop = 0
    results = paged_results(session, request)
    num_records = results.GetCount()
    while pos < num_records and results.HasNext(): # .HasNext() advances result set
        yield results
        pos += 1
        current_loop += 1
        if current_loop = PAGE_SIZE:
            results = paged_results(session, request, offset=pos)
            current_loop = 0 # reset paging

I'm able at various places to consume my paged result set with the desired Pythonic API:

request = session.CreateSearchRequest(*query) 
for results in get_results(session, request):
    do_something(results)

I'm now realizing that while loops may be examples of Un-Pythonic code. Would I rather say

while not file.EOF():
    line = file.get_line()
    do_something(line)

or

for line in file:
    do_something(line)

My new compelling use case for simple generator functions is:

Generator functions give me a conceptually easy tool to reshape complicated iteration into simple for loops.


blog comments powered by Disqus