The Art of Parsing – Python – Removing Duplicates

When processing large amount of data, for example when building a recommender system or an index, there is often a need to remove duplicates from a list of e.g. words. As always, there are many ways to solve a problem, even when you stick to one programming language (which in my case is Python). It is always good to ask yourself the question: how does this method scale? Especially when you work with large data sets, this is something to keep in mind. I was pretty happy with the following method to remove duplicates from a list:

def uniq(inlist):

if not inlist:
return inlist
inlist.sort()
outlist = [inlist[0]]
for i in range(1,len(inlist)):
if inlist[i]!=inlist[i-1]:
outlist.append(outlist[i])

return outlist

(ok, indentation doesn’t really work with this free version of wordpress AFAIK). But then I decided to try

from sets import Set
def uniq(inlist):

return list(Set((item for item in inlist)))

which turned out to be a significant speedup. And the code is much cleaner too 🙂 The graphs below show the speed up:

This graph compares two Python methods for removing duplicates from a list

The graph above shows the processing time for removing duplicates from a list as function of list size for the two method described above (“Method 1” is the second method, using the sets module). The graph below shows the relative speedup:

This graph shows how much faster Method 1 is (the method using the sets module)

Advertisements

~ by anopisthographs on July 13, 2010.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: