The Art of Parsing – Python – Removing Duplicates
When processing large amount of data, for example when building a recommender system or an index, there is often a need to remove duplicates from a list of e.g. words. As always, there are many ways to solve a problem, even when you stick to one programming language (which in my case is Python). It is always good to ask yourself the question: how does this method scale? Especially when you work with large data sets, this is something to keep in mind. I was pretty happy with the following method to remove duplicates from a list:
def uniq(inlist):
if not inlist:
return inlist
inlist.sort()
outlist = [inlist[0]]
for i in range(1,len(inlist)):
if inlist[i]!=inlist[i-1]:
outlist.append(outlist[i])
return outlist
(ok, indentation doesn’t really work with this free version of wordpress AFAIK). But then I decided to try
from sets import Set
def uniq(inlist):
return list(Set((item for item in inlist)))
which turned out to be a significant speedup. And the code is much cleaner too 🙂 The graphs below show the speed up:
The graph above shows the processing time for removing duplicates from a list as function of list size for the two method described above (“Method 1” is the second method, using the sets module). The graph below shows the relative speedup: