string – Reading a text file and splitting it into single words in python

string – Reading a text file and splitting it into single words in python

Given this file:

$ cat words.txt
line1 word1 word2
line2 word3 word4
line3 word5 word6

If you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file):

with open(words.txt,r) as f:
    for line in f:
        for word in line.split():
           print(word)    

Prints:

line1
word1
word2
line2
...
word6 

Similarly, if you want to flatten the file into a single flat list of words in the file, you might do something like this:

with open(words.txt) as f:
    flat_list=[word for line in f for word in line.split()]

>>> flat_list
[line1, word1, word2, line2, word3, word4, line3, word5, word6]

Which can create the same output as the first example with print n.join(flat_list)

Or, if you want a nested list of the words in each line of the file (for example, to create a matrix of rows and columns from a file):

with open(words.txt) as f:
    matrix=[line.split() for line in f]

>>> matrix
[[line1, word1, word2], [line2, word3, word4], [line3, word5, word6]]

If you want a regex solution, which would allow you to filter wordN vs lineN type words in the example file:

import re
with open(words.txt) as f:
    for line in f:
        for word in re.findall(rbwordd+, line):
            # wordN by wordN with no lineN

Or, if you want that to be a line by line generator with a regex:

 with open(words.txt) as f:
     (word for line in f for word in re.findall(rw+, line))
f = open(words.txt)
for word in f.read().split():
    print(word)

string – Reading a text file and splitting it into single words in python

As supplementary,
if you are reading a vvvvery large file, and you dont want read all of the content into memory at once, you might consider using a buffer, then return each word by yield:

def read_words(inputfile):
    with open(inputfile, r) as f:
        while True:
            buf = f.read(10240)
            if not buf:
                break

            # make sure we end on a space (word boundary)
            while not str.isspace(buf[-1]):
                ch = f.read(1)
                if not ch:
                    break
                buf += ch

            words = buf.split()
            for word in words:
                yield word
        yield  #handle the scene that the file is empty

if __name__ == __main__:
    for word in read_words(./very_large_file.txt):
        process(word)

Leave a Reply

Your email address will not be published.