题意是统计一个文本中单词出现个数。可以肯定的是涉及到文件操作以及正则匹配。

这里只是单纯地匹配单词,还是很简单的。我们只要提取出单词的特征就好了(当然我们默认给定的都是合法的单词

另外对正则表达式不太熟悉的朋友可以参考:正则表达式学习

下面上代码,7行代码就可以解决了:

1
2
3
4
5
6
7
import re
f = open("in.txt", "r")
regx = re.compile(r"\b[\w']*[^ ]\b")
regx = regx.findall(f.read())
print("单词总数为:", len(regx))
print(regx)

说明:

我们把单词文本放在in.txt目录下,正则表达式\b[\w']*[^ ]\b用来匹配所有的单词,这里我们提取的单词特征为:字母、数字、下划线或'出现多次,且不含空格

f.read()读取文本所有内容,findall进行正则匹配。

测试内容如下:

To ease these woes, public transport app Citymapper has released a new feature that offers advice on which part of the train to get on to shave time off your journey. The Boarding Strategy feature advises users whether they are better to get on the front, middle or back of their train in preparation for the next leg of their journey.

The new feature could also make journeys easier for visitors in unfamiliar cities by helping them make transfers between trains quickly and efficiently.

According to a post on Citymapper’s blog: ‘In peak hours this will knock minutes off your journey time. We have collected this data as much as we’ve been able to in all our cities.’

The service works when a user searches for a journey and also when using the Citymapper Go mode. A logo appears in the journey advice menu, suggesting which part of the train is best to get on. It also gives Citymapper a new edge when facing competition from search giant Google and Apple Maps.

However, some users have pointed out that it may lead to already packed carriages becoming even fuller if they are close to exits at popular stations. Plus, the app has still to provide advice that many commuters will be wanting, namely which carriage to get on to increase the chances of getting a seat.

Citymapper is available in 29 cities providing journey advice on trains, trams, buses, underground services and metro lines.

放到word中可以发现以上共244个单词,下面我们运行python程序python 0004.py > out.txt,比较结果:

截图

发现结果匹配成功!

0006题和0004题比较像,要求是统计每个文件中出现次数最多的单词

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import re
def get_word(diary):
regx = re.compile(r"\b[\w']*[^ ]\b")
content = open(diary, "r").read()
regx = regx.findall(content)
mp = {}
for cnt in range(0, len(regx)):
if mp.get(regx[cnt]) != None:
mp[regx[cnt]] = mp[regx[cnt]] + 1
else:
mp[regx[cnt]] = 1
ls = sorted(mp.items(), key=lambda d:d[1], reverse=True)#按值从大到小对map排序
print("The most important word in", diary, "is \"" + ls[0][0] + "\" and it shows", ls[0][1], "times.")
path = ("in1.txt", "in2.txt")
for diary in path:
get_word(diary)