Wednesday, May 6, 2009

How to process large text file efficiently in Python

Abstract

Three different ways of processing text file line by line are given in the order of increasing efficiency.

I have to handle a large text file of space-separated data in python, and the data goes like this:

tag1 tag2 tag3
12 34 12
123 345 12

the first line is tags for each column, and the rest lines hold data. Since the tags are fixed, I can code it directly in to my script, that is to say the first line should be skipped. My first script goes like this:

file = open('foo.txt', 'r')
for line in file.readlines()[1:]:
#do something

This script requires a vast amount of RAM, since it has to store a list of all lines! So it is wise to use iterator:

file = open('foo.txt', 'r')
first = True
for line in file:
if first:
first = False
else:
#do something

The second script works much better than the first one, because the lines are read one by one from the file by using a iterator. However, the first flag is not a neat way to skip the first line for we have to test the flag many times, which makes no sense. And the problem is solved in the third script:

file = open('foo.txt', 'r')
file.readline()
for line in file:
#do something

the 'fileread.line()' command will perfectly move the file position one line forward, and the iter will then start from the second line:)

No comments:

Post a Comment