Simeon Franklin
Blog :: What Generators are for
22 May 2012
I teach Python classes and enjoy exploring language features from the perspective of newbie's to the language. Usually I can explain the rationale for Python language features by showing a compelling use case.
For example a common programming task is to iterate over a container of known length and do something for each item in the container. In a C-like language this might look like:
index = 0; // A
while(index < container.length){ // B
item = container[index]; //C
do_something(item);
l++; // D
}
Of course this pattern is so common that C-like languages have a looping construct just to handle this situation. Notice that the counter initialization (A), the bounds checking (B), and the incrementing (D) are all combined into a single syntactical pattern:
for(index=0;index<container.length;index++){ // A, B, D
item = container[index];
do_something(item);
}
Python goes one step further and provides a simpler iterating construct. If what we want is item, why bother with index and container.length?
for item in container: // E - implicit assignment without temp counter variables
do_something(item)
But what about generator functions?
I've always been a little stuck with generator functions though - generator expressions make sense as a generator version of list comprehensions and I suppose if you had iteration you can't wrap up in a list comprehension you'd want to be able to do it in a function... Oh and David Beazley does wicked cool things with generators, but that seems kinda advanced...
So I've been stuck with a use case for generators functions till I recently ran into a problem for which they turned out to be exactly the right solution.
Consider a remote API with a record limit. For any given query there may be 1,000 records that match but only PAGE_SIZE records will be returned. The API does provide the ability to do offset queries so I could manually check the number of records and then repeatedly issue offset queries to get all the records. Of course this means I can't simply loop over the list of records - now my (pseudo)code looks like:
index = 0
page_index = 0
request = session.CreateSearchRequest(*query)
results = session.Search(request)
while results.HasNext(): # by side effect advances to the next record
do_something(results)
pos += 1
page_index += 1
if page_index == PAGE_SIZE: # regenerate result set
request.SetOffset(pos)
results = session.Search(request)
page_index = 0 # start the next paged set
This is pretty ugly - I've got way more code managing state in my loop than actually processing my results. Ideally I'd like to be able to ignore the paging limits and the non-Pythonic API (librets - I'm looking at you) where .HasNext() returns True or False and has the side effect of advancing the result set. Ideally I'd like to write:
request = session.CreateSearchRequest(query)
for results in magically_loop_over_my_search(request):
do_something(results)
instead.
My breakthrough came thanks to a Ned Batchelder article on iteration I read recently. Ned pointed out that generators can be used to reshape iteration. As an example: looping through a two dimensional data structure like a grid (assuming a .getcell(x,y) kind of API) is done most easily with nested for loops to get every cell.
To get every cell in a single simple iteration would require some math and extra variables- a row counter, a column counter (reset to 0 ever row_length iterations), plus a cell = grid[r][c] kind of assignment to actually retrieve the cell.
Hey - that description looks a lot like my problem with the paged data set!
Ned's solution is to simply do the nested for loops inside a function and yield a cell's coordinates allowing the following simple iteration:
for (x, y) in twod_range(width, height):
do_something(spreadsheet.get_cell(x, y))
It's important to note that without the yield statement our iterator would be more complicated. We could write a regular function that returns a list containing all the coordinate pairs - but it would be more complicated code (list initialization, append to the list, and return the list). It would also be less efficient - I want to process 20,000 records that are good sized and I don't want to make a single list containing all 20,000 records. Generator functions to the rescue!
Back to my problem: by moving the page handling code into a generator function:
PAGE_SIZE = 250
def paged_results(session, request, offset=0):
request.SetLimit(PAGE_SIZE)
request.SetOffset(offset)
return session.Search(request)
def get_results(session, request):
pos = current_loop = 0
results = paged_results(session, request)
num_records = results.GetCount()
while pos < num_records and results.HasNext(): # .HasNext() advances result set
yield results
pos += 1
current_loop += 1
if current_loop = PAGE_SIZE:
results = paged_results(session, request, offset=pos)
current_loop = 0 # reset paging
I'm able at various places to consume my paged result set with the desired Pythonic API:
request = session.CreateSearchRequest(*query)
for results in get_results(session, request):
do_something(results)
I'm now realizing that while loops may be examples of Un-Pythonic code. Would I rather say
while not file.EOF():
line = file.get_line()
do_something(line)
or
for line in file:
do_something(line)
My new compelling use case for simple generator functions is:
Generator functions give me a conceptually easy tool to reshape complicated iteration into simple for loops.
blog comments powered by Disqus