Urllib库与URLError异常处理

1.Urllib库的改变

Python2.X——->Python3.X
import urllib2–>import urllib.request,urllib.error
import urllib–>import urllib.request,urllib.erroe,urllib.parse
import urlparse–>import urllib.parse
urllib2.urlopen–>urllib.request.urlopen
urllib.urlencode–>urllib.parse.urlencode
urllib.quote–>urllib.request.quote
cookielib.CookieJar–>http:CookieJar
urllib2.Request–>urllib.request.Request

2.快速使用Urllib库爬取网页

import urllib.request

file = urllib.request.urlopen("http://www.baidu.com")
data = file.read()
dataline = file.readline()
print(dataline)

f = open("../爬取数据存储/01_4.2.html","wb")
f.write(data)
f.close()

注意：读取内容常见的有三种方式，其用法为：
1.fiel.read()读取文件的全部内容，与readlines不同的是，read会把读取到的内容赋值给一个字符串变量
2.fiel.readlines()读取文件的全部内容，与read不同的是，readlines会把读取到的内容赋值给一个列表变量，若要读取全部内容，推荐使用这种方式。
3.file.readline()读取文件的一行内容。

除此之外，还可以使用urllib.request里面的urlretrieve()函数直接将对应信息写入本地文件，格式为:”urllib.request.retrieve(url,filename = 本地文件地址)”

import urllib.request

file = urllib.request.urlopen("http://www.baidu.com")
filename = urllib.request.urlretrieve("http://edu.51cto.com" , filename = "../爬取数据存储/02_.html")
urllib.request.urlcleanup()
print(file.info())
print(file.getcode())
print(file.geturl())

一般来说，url标准中只允许一部分ASCII字符比如数字，字母，部分符号等，而其他一些字符比如汉字等不符合标准的字符时，需要进行url编码；相应的，也可以进行解码。

urllib.request.quote("http://www.sina.com.cn")
urllib.request.unquote("http%3A//www.sina.com.cn")

3.浏览器的模拟-Headers属性

当网页针对请求头进行反爬虫时，我们可以设置一些Headers属性，模拟成浏览器去访问这些网站，就可以解决这个问题了。

3.1 使用build_opener()修改报头

由于urlopen()不支持一些HTTP的高级功能，所以我们要修改报头，可以使用urllib.request.build_opener()进行

import urllib.request

url = "https://blog.csdn.net/red_stone1/article/details/80999551"

headers = ("User-Agent","Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)")
# 创建自定义的opener对象
opener = urllib.request.build_opener()
# 创建全局默认的opener对象，这样，在使用urlopen()时，也会使用我们安装的opener对象
opener.addheaders = [headers]
data = opener.open(url).read().decode("utf-8")

print(data)

3.2 使用add_header()添加报头

除了上面这种方法之外，还可以使用urllib.request.Request()下的add_header()实现浏览器的模拟。

import urllib.request

url = "https://blog.csdn.net/red_stone1/article/details/80999551"

# 使用add_header()添加报头
req = urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)')
data = urllib.request.urlopen(req).read()

print(data)

4.超时设置

有的时候，我们访问一个网页，如果该网站长时间未响应，那么系统就会判断该网页超时了，即无法打开该网页。
有的时候，我们需要根据自己的需要来设置超时的时间值。那么就可以用到timeout了。

import urllib.request

for i in range(1,100):
    try:
        file = urllib.request.urlopen("http://yum.iqianyue.com", timeout=3)
        data = file.read()
        print(len(data))
    except Exceptionas as e:
        print("出现异常-->"+str(e))

相对来说，我们最好设置一个较合理的时间值，这样的话，我们能最大程度的获取想要的信息，同时，也能减轻爬取对象的服务器压力。

5.HTTP协议请求实战

http协议请求主要分为6种类型，各类型的主要作用如下：
1）get请求：get请求会通过url网址传递信息，可以直接在url中写上要传递的信息，也可以由表单进行传递。如果使用表单进行传递，这表单中的信息会自动转为url地址中的数据，通过url地址传递。
2）post请求：可以向服务器提交数据，是一种比较主流也比较安全的数据传递方式，比如在登录时，经常使用post请求发送资源。
3）put请求：请求服务器存储一个资源，通常要指定存储的位置。
4）delete请求：请求服务器删除一个资源。
5）head请求：请求获取对应的http报头信息。
6）options请求：可以获得当前url所支持的请求类型。
这六种类型是比较比较常用的，除此之外，还有
7）trace请求：主要用于测试或诊断。
8）connect请求

因为TRACE用的非常少，故不再提及。相对来说，用得最多的是get请求和post请求。

5.1 get请求实例分析

import urllib.request

url = "http://www.baidu.com/s?wd="
key = "王者荣耀"
key = urllib.request.quote(key)
fullurl = url+key

request = urllib.request.Request(fullurl)
data = urllib.request.urlopen(request).read()
# print(data)

with open("../爬取数据存储/05_百度接口.html","wb")as f:
    f.write(data)
    print("保存完成")

通过以上实例我们可以知道，如果要使用GET请求，思路如下：
1）构建对应的URL地址，该URL地址包含GET请求的字段名和字段内容等信息，并且URL地址满足GET请求的格式，即“http://网址?字段名1=字段内容1&字段名2=字段内容2”
2）以对应的URL为参数，构建Request对象。
3）通过urlopen()打开构建的Request对象。
4）按需求进行后续的处理操作，比如读取网页的内容、将内容写入文件等。

5.2 POST请求实例分析

import urllib.request

url = "http://www.iqianyue.com/mypost/"

postdata = {
    "name":"hahaha",
    "pass":"xixixiix"}
postdata = urllib.parse.urlencode(postdata).encode("utf-8")

request = urllib.request.Request(url,postdata)
request.add_header("User-Agent","Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11")
data = urllib.request.urlopen(request).read()
with open("../爬取数据存储/07_post请求.html","wb") as f:
    f.write(data)
    print("Sava successful.")

如果我们要构造PSOT请求，实现思路如下：
1）设置好URL网址。
2）构建表单数据，并使用urllib.parse.urlencode对数据进行编码处理。
3）构建Request对象，参数包括URL地址和要传递的数据。
4）添加头部信息，模拟浏览器进行爬取。
5）使用urllib.requesr.urlopen()打开对应的Request对象。完成信息的传递。
6）后续处理，比如读取网页内容，将内容写入文件等。

6.代理服务器的设置

有时使用同一个IP去爬取同一个网站上的网页，久了之后会被该网站服务器屏蔽。那么怎么解决这个问题呢？
解决的方法很简单，“瞒天过海，暗度陈仓”，即使用代理服务器。
代理ip的获取可以通过网上查找，我比较常用的是http://www.xicidaili.com/ ，我们尽量找验证时间短的，这样成功的概率会比较大，一些验证时间长的，可能会失效。

import urllib.request

def use_proxy(proxy_addr,url):
    proxy = urllib.request.ProxyHandler({'http':proxy_addr})
    # 使用build_opener(),创建自定义的opener对象
    opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)
    # 使用install_opener(),创建全局默认的opener对象，这样，在使用urlopen()时，也会使用我们安装的opener对象
    data = urllib.request.urlopen(url).read().decode("utf-8")
    return data

proxy_addr = "120.27.14.125:80"
data = use_proxy(proxy_addr,"http://www.baidu.com")
# print(len(data))
with open("../爬取数据存储/08_代理服务器设置.html","wb")as f:
    f.write(data)
    print("保存完成")

7.DebugLog实战

有时候我们希望边运行边打印调试日志，这时候需要开启DebugLog。
如何开启DebugLog呢？开启思路如下：
1）分别使用urllib.request.HTTPHandler()和urllib.request.HTTPSHandler()将debuglevel设置为1。
2）urllib.request.build_opener()创建opener对象，并使用上一步的值为参数。
3）urllib.request.install_opener()创建全局默认的opener对象，这样在使用urlopen()时会自动使用我们创建的opener对象。
4）进行后续的操作，比如urlopen()等。

代码如下:

import urllib.request

httphd = urllib.request.HTTPHandler(debuglevel=1)
httpshd = urllib.request.HTTPSHandler(debuglevel=1)

opener = urllib.request.build_opener(httphd, httpshd)
urllib.request.install_opener(opener)

data = urllib.request.urlopen("http://www.baidu.com").read()
print(data)

8.异常处理神器-URLError实战

这一节主要介绍两个类，第一类是URLError类，第二类是URLError的一个子类-HTTPError类。
一般来说，产生URLError的原因有如下几种可能：
1）连接不上服务器
2）远程URL不存在
3）无网络
4）触发了HTTPError
初始完全处理代码：

import urllib.request
import urllib.error

try:
    urllib.request.urlopen("http://www.blog.csdn.net")
except urllib.error.HTTPError as e:
    print(e.code)
    print(e.reason)
except urllib.error.URLError as e:
    print(e.reason)
except Exception as e:
    print("发生错误:" + str(e))

优化后代码：

import urllib.request
import urllib.error

try:
    urllib.request.urlopen("http://www.blog.csdn.net")
except urllib.error.URLError as e:
    if hasattr(e, "code"):
        print(e.code)
    if hasattr(e,"reason"):
        print(e.reason)
except Exception as e:
    print("发生错误:" + str(e))

Cookie的使用

即使我们将username和pwd进行编码传输，如果不使用Cookie，在爬取登陆后的第二个网页时，仍需登录，因为HTTP协议是一个无状态协议，我们访问了新网页，自然会话信息会消失。如果希望一直保持登录状态，则需进行cookie处理：
1）导入Cookie处理模块http.cookiejar
2) 使用http.cookiejar.CookieJar()创建CookieJar对象
3）使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象
4）创建全局默认的opener对象

import urllib.request
import urllib.parse
import http.cookiejar

url = "http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LzIe6"
postdata = {"username":"weisuen", "password":"aA123456"}
postdata = urllib.parse.urlencode(postdata).encode("utf-8")
request = urllib.request.Request(url, postdata)
request.add_header('User-Agent', 'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11')

cookie = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
urllib.request.install_opener(opener)

data = opener.open(request).read()
with open("../爬取数据存储/11_cookie的使用.html", "wb") as f:
    f.write(data)
    print("下载完成！")

url2 = "http://bbs.chinaunix.net/"
data2 = urllib.request.urlopen(url2).read()
with open("../爬取数据存储/11_cookie的使用2.html", "wb") as f:
    f.write(data2)
    print("下载完成！")

小结：
1）会话信息控制常用方式：通过Cookie保存会话信息，通过Session保存会话信息。
2）如果是通过Session保存，会将对应信息保存在服务器端，但是服务器会给客户端发送SessionID等信息，这些信息一般存在客户端的cookie中，然后，用户在访问其他网页时，会从Cookie中读取这一部分信息，然后从服务器的Session中根据这一部分Cookie信息检索出该客户端的所有会话信息，然后进行会话控制，即就算是Session也会用到Cookie。目前来说，大部分情况还是会存储在Cookie中。
3）Cookie：通过在客户端记录的信息确定用户的身份。
Session：通过在服务器端记录的信息确定用户的身份。

爬虫的浏览器伪装技术

有一些网站为了避免爬虫的恶意访问，会设置一些反爬虫机制，对方服务器会对爬虫进行屏蔽。常见的饭爬虫机制主要有下面几个：
1.通过分析用户请求的Headers信息进行反爬虫
2.通过检测用户行为进行反爬虫，比如通过判断同一个IP在短时间内是否频繁访问对应网站等进行分析
3.通过动态页面增加爬虫的爬取难度，达到反爬虫的目的

解决方法：
1.第一种反爬虫机制在目前网站中应用的最多，大部分反爬虫网站会对用户请求的Headers信息的“User-Agent”字段进行检测来判断身份，有时，这类反爬虫的网站还会对“Referer”字段进行检测。我们可以在爬虫中构造这些用户请求的Headers信息，以此将爬虫伪装成浏览器，简单的伪装只需设置好“User-Agent”字段的信息即可，如果要进行高相似度的浏览器伪装，则需要将用户请求的Headers信息中常见的字段都在爬虫中设置好
2.第二种反爬虫机制的网站，可以通过之前学习的使用代理服务器并经常切换代理服务器的方式，一般就能够攻克限制
3.第三种反爬虫机制的网站，可以利用一些工具软件，比如selenium+phantomJS，就可以攻克限制

import urllib.request
import http.cookiejar
url = "https://stock.tuchong.com/"

headers = {
    "Host": " p3a.pstatp.com",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    "Accept": " text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Connection": "keep-alive"}
cookie = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))

headall = []
for key,value in headers.items():
    item=(key,value)
    headall.append(item)
opener.addheaders=headall
urllib.request.install_opener(opener)

data=urllib.request.urlopen(url).read()
with open("../爬取数据存储/爬虫的浏览器伪装技术.html", "wb") as f:
    f.write(data)
    print("下载完成！")