之前在码农网看过python的爬虫小技巧,但是我认为总结地不够全面,而且在这段编写爬虫的过程中,也形成了自己的套路~

特意在这里分享给大家,当然一方面也是以后忘记了留作参考。

1、基本网页抓取

  • 包含伪装浏览器访问(解决403错误)

  • 使用代理,避免长时间爬取被封本机IP

  • 处理网页gzip压缩

  • HTTPError异常处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#获取url 对应 HTML 源码
def getHtml(url):
#伪装浏览器
header = dict({ 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.116 Chrome/48.0.2564.116 Safari/537.36',
'Accept-encoding': 'gzip',
})
request = urllib.request.Request(url, headers=header)
try:
#使用代理:117.135.252.227:80
proxy_support = urllib.request.ProxyHandler({'http': '117.135.252.227:80'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
page = urllib.request.urlopen(request)
#print(page.headers.get('Content-Encoding')) 'gzip'
#print(page.headers.get_content_charset()) 'utf8'
#如果使用了gzip压缩
if page.headers.get('Content-Encoding') == 'gzip':
return zlib.decompress(page.read(), 16+zlib.MAX_WBITS).decode('utf8')
else:
return page.read().decode(page.headers.get_content_charset())
except urllib.request.HTTPError as e:
print('HTTPERROR: ', str(e))
return urllib.request.HTTPError

2、Mysql数据库操作

一般在类的构造函数init中完成数据库的连接,在析构函数中断开连接。

示例如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import mysql.connector
class Example():
#构造函数
def __init__(self):
self.conn = mysql.connector.connect(user='root', password='root', host='localhost', database='test')
self.cursor = self.conn.cursor(buffered=True)
#析构函数
def __del__(self):
self.cursor.close()
self.conn.close()
#数据库操作
#查询table表中id=1的数据个数
def doSomething(self):
query = 'SELECT COUNT(*) FROM table WHERE id = %s'
data = (1, )
#执行查询
self.cursor.excute(query, data)
#确保查询提交
self.conn.commit()

3、把json格式数据插入表中

首先使用toJson()函数把我们要插入的数据项转为json格式,再使用jsonINTOMysql()函数将json格式数据插入mysql中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def toJson(**kwargs):
return kwargs
#把json格式的rowdict插入table中
def jsonINToMysql(table, rowdict):
self.cursor.execute('DESCRIBE %s' % table)
allowedKeys = set(row[0] for row in self.cursor.fetchall())
keys = allowedKeys.intersection(rowdict)
if len(rowdict) > len(keys):
unknownKeys = set(rowdict)-allowedKeys
print(sys.stderr, "skipping keys", ",".join(unknownKeys))
columns = ",".join(keys)
values_template = ", ".join(["%s"]*len(keys))
sql = "INSERT INTO %s(%s) VALUES (%s)" % (table, columns, values_template)
values = tuple(rowdict[key] for key in keys)
self.cursor.execute(sql, values)
self.conn.commit()
def example():
#假设数据表table中有id、name、sex三项
json = toJson(id=id, name=name, sex=sex)
jsonINTOMysql('table', json)

4、对爬取数据的乱码进行解析

有时我们需要爬取的是单独的json格式数据(请参考利用爬虫爬取js生成数据),可能会发现json中的数据是经过编码的,例如我爬取汽车之家车辆的详细参数配置时,会发现json中的数据为:

1
{"SIP_C_119":"2645","SIP_C_250":"%2d","SIP_C_306":"%2d","SIP_T_LOGO":"http://i1.itc.cn/20130624/83e_01580269_5a5f_1526_74cc_c6bf0064c28e_1.jpg","SIP_C_114":"%2d%2d%2d","SIP_C_117":"7005","SIP_C_305":"%2d","SIP_C_304":"%2d%2d%2d","SIP_C_118":"2040","SIP_C_303":"%u67f4%u6cb9%u673a","SIP_C_115":"%u6574%u8f663%u5e74%2f6%u4e07%u516c%u91cc","SIP_C_116":"%2d%2d%2d","model_engine_type":2,"SIP_C_120":"3935","overseas":false,"SIP_T_PRICE":32.0,"SIP_C_261":"%u25cf","SIP_T_ID":127870,"SIP_T_GEAR":"M","SIP_C_124":"%u5ba2%u8f66","SIP_C_125":"2","SIP_C_126":"10%2d23","SIP_C_127":"90","SIP_C_170":"%2d","SIP_C_329":"120","SIP_C_171":"%u673a%u68b0%u6db2%u538b%u52a9%u529b%u8f6c%u5411","SIP_C_322":"%2d","SIP_T_MODELNAME":"%u5b89%u51ef%u5ba2%u8f66%20%u5b9d%u65af%u901a","SIP_C_321":"%2d","SIP_C_320":"%u624b%u52a8","SIP_C_185":"%u25cf","SIP_C_283":"%u25cf","SIP_C_318":"%u5364%u7d20","SIP_C_108":"5%u6863%u624b%u52a8","SIP_C_317":"%2d","SIP_C_285":"%u25cf","SIP_C_314":"%2d","SIP_C_104":"%u6c5f%u6dee%u6c7d%u8f66","SIP_C_313":"%2d","SIP_C_105":"%u5176%u4ed6%u8f66%u578b","SIP_C_316":"%2d","SIP_C_106":"2%u95e810%2d23%u5ea7%u5ba2%u8f66","SIP_C_315":"%2d","SIP_C_107":"3%2e0T%20163%u9a6c%u529bL4","SIP_C_310":"%2d","SIP_C_312":"%2d","SIP_C_102":"32%2e0%u4e07%u5143","SIP_C_103":"32%2e0%7e32%2e0%u4e07%u5143","SIP_C_150":"%u7f38%u5185%u76f4%u55b7","SIP_C_294":"%2d%2d%2f%2d%2d%2f%2d%2d","SIP_C_297":"163","SIP_C_291":"%2d%2d%2d","SIP_C_151":"%u94dd%u5408%u91d1","SIP_C_293":"7005x2040x2645","SIP_C_152":"%u94f8%u94c1","SIP_C_292":"%2d%2d%2d","SIP_T_DISPL":3.0,"SIP_C_158":"%u624b%u52a8","SIP_C_157":"5","SIP_C_156":"5%u6863%u624b%u52a8","SIP_C_155":"%u56fdIV","SIP_C_298":"120%2f3800","SIP_C_299":"362%2f1600%2d2200","SIP_C_159":"%u4e2d%u7f6e%u540e%u9a71","SIP_C_224":"%u771f%u76ae","SIP_C_160":"%u9ea6%u5f17%u900a%u5f0f%u72ec%u7acb%u60ac%u6302","SIP_C_161":"%u94a2%u677f%u5f39%u7c27%u7ed3%u6784","SIP_C_162":"%u627f%u8f7d%u5f0f%u8f66%u8eab","SIP_C_163":"7%2e00%20R16","SIP_C_164":"7%2e00%20R16","SIP_C_165":"%u94a2%u5236","nameDomain":"4094","SIP_C_167":"%u901a%u98ce%u76d8%u5f0f","SIP_C_166":"%u5168%u5c3a%u5bf8%u5907%u80ce","SIP_C_169":"%u624b%u5239%u5f0f%u5236%u52a8","SIP_C_168":"%u9f13%u5f0f","SIP_T_MODELID":4094,"SIP_T_STA":1,"SIP_C_335":"6%u4e07","SIP_C_333":"1600","SIP_C_334":"2200","SIP_C_332":"362","SIP_T_YEAR":2014,"SIP_C_330":"3800","SIP_C_139":"4","SIP_C_138":"%u6da1%u8f6e%u589e%u538b","SIP_C_137":"2968","SIP_C_136":"3%2e0","SIP_C_135":"NGD3%2e0%2dC3HA","SIP_C_134":"3%2e0T%20163%u9a6c%u529bL4","SIP_C_249":"0","SIP_C_140":"%u76f4%u5217","SIP_C_141":"4","SIP_C_347":"%u5364%u7d20","SIP_C_142":"%u53cc%u9876%u7f6e","brandNameDomain":"ak-2171","SIP_C_149":"%u67f4%u6cb9","SIP_C_148":"40%2e0","SIP_T_NAME":"3%2e0T%20VIP%u7248","SIP_C_241":"%u25cf"}

这种编码相信大家看着很熟悉,就像我们把包含中文的url地址复制粘贴下来的结果,那么我们该怎么对这种数据进行解码呢?

使用这种方式:

1
2
3
4
5
6
7
8
#对str先进行unquote url解码,再进行eval unicode解码
#处理数据时只需要我们把key-value对中的value依次作为参数传给deUnicode即可
def deUnicode(str):
try:
ans = eval('"%s"' % unquote(str).replace('%', '\\'))
return ans
except SyntaxError as e:
return str

5、requests通过post提交表单数据(一般用于模拟登录)

requests的post实现依赖于维持一个session,也就是说在session存在期间,我们可以以登录的身份来获取其他需要登录后才能获取的页面源码。

简单使用如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
url = ''
datas = urllib.parse.urlencode({
'data-key' : 'data-value'
})
headers = dict({
'header-key' : 'header-value'
})
session = requests.session()
session.post(url, datas, headers=headers)
res = session.get(other url).text