python 按行读文件

Doing it the usual way

The standard idiom consists of a an ‘endless’ while loop, in which we repeatedly call the file’s readline method. Here’s an example:

# File: readline-example-1.pyfile = open("sample.txt")while 1: line = file.readline() if not line: break pass # do somethingThis snippet reads the file line by line. If readline reaches the end of the file, it returns an empty string. Otherwise, it returns the line of text, including the trailing newline character.

On my test machine, using a 10 megabyte sample text file, this script reads about 32,000 lines per second.

Using the fileinput module

If you think the while loop is ugly, you can hide the readline call in a wrapper class. The standard fileinput module contains an input class which does exactly that.

# File: readline-example-2.pyimport fileinputfor line in fileinput.input("sample.txt"): passHowever, adding more layers of Python code doesn’t exactly help. For the same test setup, performance drops to 13,000 lines per second. That’s nearly two and half times slower!

Speeding up line reading

To speed things up, we obviously need to make sure we spend as little time on in Python code (running under the interpreter) as possible.

One way to do this is to tell the file object to read larger chunks of data. For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method. Or you could even use the read method to read the entire file into a single memory block, and then use string.split to chop it up into individual lines.

However, if you’re processing really large files, it would be nice if you could limit the chunk size to something reasonable. For example, if you read a few thousand lines at a time, you probably won’t use up more than 100 kilobytes or so.

The following script uses a nested loop. The outer loop uses readlines to read about 100,000 bytes of text, and the inner loop processes those lines using a simple for-in loop:

# File: readline-example-3.pyfile = open("sample.txt")while 1: lines = file.readlines(100000) if not lines: break for line in lines: pass # do somethingCan this really be faster? You bet. With the same test data, we can now process 96,900 lines of text per second!

Or to put it another way, this solution is three times as fast as the standard solution, and over seven times faster than the fileinput version.

In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better:

# File: readline-example-5.pyfile = open("sample.txt")for line in file: pass # do somethingIn Python 2.1, you have to use the xreadlines iterator factory instead:

# File: readline-example-4.pyfile = open("sample.txt")for line in file.xreadlines(): pass # do something

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。