HTTP

URI和URL

URI:资源标志符
URL:统一资源定位符

URI包括URL。

请求

请求方式
请求URL
请求头
请求体

get请求参数包含在URL里，数据可以在URL看到，post不会，以表单显示传输。
get提交数据最多1024字节，post无限制。

响应

响应码
响应头
响应体

100需要继续提交请求
200成功
305需要使用代理
404未找到
500服务器错误

urllib库

请求库的4个模块：

request：模拟发送请求
error：异常处理
parse：拆分解析合并url
robotparser：识别网站的robot.txt文件，判断哪些网站可以爬取，使用较少

发送请求

urlopen

API：

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

data：请求参数
timeout：设置超时时间
context：ssl类型
cafile和capath：CA证书和证书路径
cadefault：已经弃用，默认false

详细信息参考：https://docs.python.org/3/library/urllib.request.html

url参数

得到的是一个HTTPResponse对象，包含read()，readinto()，getheader(name)，getheaders()，fileno()等方法。

#显示网站的源代码
import urllib.request
reponse = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

返回结果的状态码，响应头以及获取响应头的Server值：

print(response.status)
#404

print(response.getheaders())
#一系列参数

print(response.getheader('Server'))
#获取Server参数对应的值nginx

data参数

如果data参数内容为字节流编码格式的内容，即bytes类型，则需要通过bytes方法转化。如果传递了data参数，则请求方式不再是get，而是post。urlencode将字节流转为str类型，用httpbin.org提供http请求测试。

import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
response=urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

运行结果如下，form表单内容为”word”:”hello”。

{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "word": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "10", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "json": null, \n  "origin": "110.184.21.66, 110.184.21.66", \n  "url": "https://httpbin.org/post"\n}\n'

timeout参数

用于设置超时时间，单位秒。如果超过这个时间，还没有得到相应，就会抛出异常。如果不指定，则为默认时间。

response=urllib.request.urlopen('http://httpbin.org/get',timeout=1)

通过try,except语句来获取异常，并跳过该网页的抓取。

import socket
import urllib.request
import urllib.error
try:
    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print(e.reason)

Request

urlopen方法可以实现基本请求发起，但不足以构建一个完整的请求。还需要加入Header等信息，需要使用request来构建。

API：

class urllib.request.Request(url,data=None,header={},origin_req_host=None,unverifiable=False,method=None)

url：请求链接
data：请求参数，字节流类型
headers：请求头，常常用来伪装浏览器类型
origin_req_host：请求方的host名或IP地址
unverifiable：表示这个请求是否是无法验证的，默认false。如果请求图片，我们没有自动抓取图像的权限，这时设置为true。
method：请求方法：post,get,put.

import urllib.request
request=urllib.request.Request('https://python.org')
response=urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

传入参数：

from urllib import request,parse
url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',
    'Host':'httpbin.org'
}
dict={
    'name':'Germey'
}
data=bytes(parse.urlencode(dict),encoding='utf-8')
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

结果为：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0(compatible;MSIE 5.5;Windows NT)"
  }, 
  "json": null, 
  "origin": "110.184.21.66, 110.184.21.66", 
  "url": "https://httpbin.org/post"
}

headers也可以用add_header方法添加。

req=request.Request(url=url,data=data,headers=headers,method='POST')
req.add_header('User-Agent','Mozilla/4.0(compatible;MSIE 5.5;Windows NT)')

高级用法Handler

可用于登录验证，cookie处理，代理设置。

HTTPDefaultErrorHandler:用于处理HTTP响应错误，抛出异常
HTTPRedirectHandler:用于处理重定向
HTTPCookieProcessor:用于处理Cookies
ProxyHandler:用于设置代理，默认代理为空
HTTPPasswordMgr:用于管理密码，它维护了用户名和密码的表
HTTPBasicAuthHandler:用于管理认证

验证

from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError
username='username'
password='password'
url='http://localhost:5000/'
p=HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,username,password)
auth_handler=HTTPBasicAuthHandler(p)
opener=build_opener(auth_handler)
try:
    result=opener.open(url)
    html=result.read().decode('utd-8')
    print(html)
except URLError as e:
    print(e.reason)

代理

from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener
proxy_handler=ProxyHandler(
{
    'http':'http://127.0.0.1:9743',
    'https':'https://127.0.0.1:9743'
})
opener=builder_opener(proxy_handler)
try:
    response=opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

获取网站的cookie:

# -*- coding: utf-8 -*-
import http.cookiejar,urllib.request
cookie=http.cookiejar.CookieJar()
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name + "=" + item.value)
#保存为文件
filename='cookies.txt'
cookie=http.cookiejar.MozillaCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
#打开cookie文件,输出网页源码
cookie=http.cookiejar.MozillaCookieJar()
cookie.load('cookies.txt',ignore_discard=True,ignore_expires=True)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

异常处理

URLError

from urllib import request,error
try:
    response=request.urlopen('https://www.baidu.org')
except error.URLError as e:
    print(e.reason)

HTTPError

code:返回状态码
reason:错误原因
headers:请求头

from urllib import request,error
try:
    response=request.urlopen('https://joeyos.github.io/index1.html')
except error.HTTPError as e:
    print(e.reason)
    print(e.code)
    print(e.headers)

解析链接

urllib库提供了parse模块，用于处理URL的标准接口，实现url的抽取、合并和链接转换。

urlparse

实现url的识别和分段：

from urllib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result)

url被拆分为6部分：协议，域名，路径，参数和查询条件，结果为：

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

urlunparse

将拆分的url进行合并:

from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))
#输出：http://www.baidu.com/index.html;user?a=6#comment

urlsplit

这个方法和urlparse方法类似，只不过不在单独解析params部分，返回五个结果，params合并到path中：

from urllib.parse import urlsplit
result=urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

返回某个属性：

#返回scheme属性值
print(result.scheme)

urlunsplit

合并url。

urljoin

字符串拼接

from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','about.html'))

urlencode序列化

#输出http://www.baidu.com?name=germey&age=22
from urllib.parse import urlencode
params={
    'name':'germey',
    'age':22
}
base_url='http://www.baidu.com?'
url=base_url+urlencode(params)
print(url)

parse_qs反序列化

将序列还原为字典:

#{'name': ['germey'], 'age': ['22']}
from urllib.parse import parse_qs
query='name=germey&age=22'
print(parse_qs(query))

parse_qsl类似，转化为元组：

[('name', 'germey'), ('age', '22')]

quote转化为url编码

#https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8
from urllib.parse import quote
keyword='壁纸'
url='https://www.baidu.com/s?wd='+quote(keyword)
print(url)

unquote解码

from urllib.parse import unquote
url='https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
print(unquote(url)

Robots协议

Robots协议也称为爬虫协议，机器人协议，告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。通常是一个叫做robots.txt的文本文件，一般放在网站的根目录下。

当搜索爬虫访问一个站点时，首先检测是否存在robots.txt，如果存在，搜索爬虫会根据其中定义的爬取范围来爬取。如果没有，搜索爬虫会访问所有的可访问页面。

robots.txt样例：

User-agent为*则指明对任何爬虫有效，Baiduspider则只对百度爬虫有效
Disallow指定不允许爬取的目录
Allow指定可以爬取目录

User-agent:*
Disallow:/
Allow:/public/

robotparser解析

urllib.robotparser.RobotFileParser(url='')

判断网页是否可以抓取：

from urllib.robotparser import RobotFileParser
rp=RobotFileParser('http://www.jianshu.com/robots.txt')
#rp.set_url('http://www.jianshu.com/robots.txt')
#rp.read()
print(rp.can_fetch('*','https://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*','https://www.jianshu.com/search?q=python&page=1&type=collections'))

使用requests

urllib可以处理网页验证和cookie时，需要写opener和handler来处理。为了更加方便地实现这些操作，就有了更为强大的库requests，可以处理cookies、登录验证和代理设置。

urllib的urlopen方法实际是get方式，对应requests的get方法。

import requests
r=requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

get请求附加信息：

import requests
r=requests.get("http://httpbin.org/get")
print(type(r.text))
print(r.json())
print(type(r.json()))

抓取知乎网页

import requests
import re
headers={
    'User-Agent':'Mozilla/5.0(Macintosh;Tntel Mac OS X 10_11_4)AppleWebKit/537.36(KHTML,like Gecko)Chrome/52.0.2743.116 Safari/537.36'
}
r=requests.get("https://www.zhihu.com/explore",headers=headers)
pattern=re.compile('explore-feed.*?question_link.*?>(.*?)</a>',re.S)
titles=re.findall(pattern,r.text)
print(titles)

输出：

['\n如何看待张云雷粉丝称呼张云雷为钢铁侠？\n', '\n为什么现在博士后工资比讲师工资高?\n', '\n《复仇者联盟 4》之后，小辣椒和摩根该怎么继续过？\n', '\n有哪些“这也能用数学证明”的事件？\n', '\n不接待生客的南京日料店柒本味老板手艺大概什么水平？\n', '\n战机的后掠翼，变后掠翼，三角翼，鸭翼各自的特点是什么？\n', '\n民国的军阀真的如课本所说的那么十恶不赦吗？\n', '\n为什么说迦勒底对玛修的实验是反人道的？\n', '\n朱一龙真的很帅吗，为什么我get不到他的颜?\n', '\n癌症真的治不好吗？既然一些癌症治不好，我们为什么还要去治呢？\n']

抓取二进制数据（图片音频）

import requests
r=requests.get("https://github.com/favicon.ico")
with open('favicon.ico','wb') as f:
    f.write(r.content)

添加headers

与urllib.request一样，我们也可以通过headers参数传递头信息。

import requests
r=requests.get("https://www.zhihu.com/explore")
print(r.text)
#报错400
#添加headers信息
headers={
    'User-Agent':'Mozilla/5.0(Macintosh;Intel Mac OS X 10_11_4)AppleWebKit/537.36(KHTML,like Gecko)Chrome/52.0.2743.116 Safari/537.36'
}
r=requests.get("https://www.zhihu.com/explore",headers=headers)
print(r.text)

POST请求

import requests
data={'name':'germey','age':'22'}
r=requests.post("http://httpbin.org/post",data=data)
print(r.text)

输出：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "22", 
    "name": "germey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "18", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "json": null, 
  "origin": "110.184.21.66, 110.184.21.66", 
  "url": "https://httpbin.org/post"
}

响应

import requests
headers={
    'User-Agent':'Mozilla/5.0(Macintosh;Intel Mac OS X 10_11_4)AppleWebKit/537.36(KHTML,like Gecko)Chrome/52.0.2743.116 Safari/537.36'
}
r=requests.get('http://www.jianshu.com',headers=headers)
print(type(r.status_code),r.status_code)
print(type(r.headers),r.headers)
print(type(r.cookies),r.cookies)
print(type(r.url),r.url)
print(type(r.history),r.history)
exit() if not r.status_code == requests.codes.ok else print('Request successfully...')

输出：

<class 'int'> 200
<class 'requests.structures.CaseInsensitiveDict'> {'Server': 'Tengine', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Date': 'Sat, 04 May 2019 13:42:15 GMT', 'Vary': 'Accept-Encoding', 'X-Frame-Options': 'DENY', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'ETag': 'W/"1a7ba402c73b43a9553b6d3c8e1a1421"', 'Cache-Control': 'max-age=0, private, must-revalidate', 'Set-Cookie': 'locale=zh-CN; path=/', 'X-Request-Id': '2cfaeafe-724f-46dd-bf6d-b0e8c8e66268', 'X-Runtime': '0.009006', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Content-Encoding': 'gzip', 'Via': 'cache18.l2cm12[20,0], cache6.cn389[71,0]', 'Timing-Allow-Origin': '*', 'EagleId': '7d412b4615569773358857835e'}
<class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[<Cookie locale=zh-CN for www.jianshu.com/>]>
<class 'str'> https://www.jianshu.com/
<class 'list'> [<Response [301]>]

文件上传

import requests
files = {'file':open('favicon.ico','rb')}
r = requests.post("https://httpbin.org/post",files=files)
print(r.text)

获取cookies并登录

相比urllib处理cookies，requests只需要一步：

import requests
r=requests.get("https://www.baidu.com")
print(r.cookies)
for key,value in r.cookies.items():
    print(key+'='+value)

输出:

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315

输入cookies用于登录网站：

Network/all/Request Headers
注意浏览器版本User-Agent对应一致

import requests
headers={
    'Cookie':'_zap=733828ac-690c-4683-91c2-495692901352; __DAYU_PP=rJFUIBJrfV3mEeE3YUUE2ba59e567729; d_c0="AHDvOUvAYg2PThC78RHrST8Ih6H00jS2I_I=|1522745376"; __utma=51854390.1865320984.1527746480.1527746480.1527746480.1; __utmv=51854390.100-1|2=registration_date=20160404=1^3=entry_date=20160404=1; _xsrf=m5Ad8kONH2cqJf8JBspvv9SOwtTkWn06; q_c1=732964db340943c082141363766ca584|1556973839000|1520334961000; l_n_c=1; l_cap_id="MWVlNWNkM2VlODk4NDRmZjg3ZTVlZWQwZmRkMmE1NGM=|1556976273|81a6aafd6382829d0369b15f53e6b61b798232f1"; r_cap_id="ZjNiZjlkYWZkNzMzNGQ2NmI2NDEzZGYxNzAzZGYyZDI=|1556976273|775239ffb70f2dd82ad7f90a557cd830d452410c"; cap_id="MmRmYTVjM2I5ZGM5NDE2YTkzMTA4MmU3Yjk0M2FkMTk=|1556976273|8eba0ec1ef91d3a9e41c1c94a7212f429a7a3ddd"; n_c=1; tgw_l7_route=73af20938a97f63d9b695ad561c4c10c; capsion_ticket="2|1:0|10:1556978974|14:capsion_ticket|44:ODE4ZWI3NWIyN2U5NGE5NWFhYzVmNTRhYmY3NjQ4MjI=|c32f5c899f064403b630e52dbb90bd95858e70e2046966b9f9627169e71269a6"; z_c0="2|1:0|10:1556978977|4:z_c0|92:Mi4xR2VIVEFnQUFBQUFBY084NVM4QmlEU1lBQUFCZ0FsVk5JZXU2WFFBYnpVZk9NcUZxRmxlODF6RG5IVHVndWlBN2N3|b40b2b3e64f7a602e4a0276d5358b2ceef0e6314b45dd7ed6487e9af84f5e264"; tst=r',
    'Host':'www.zhihu.com',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
}
r=requests.get('https://www.zhihu.com',headers=headers)
print(r.text)

会话维持session

get和post可用于网页请求，但属于两个不同的会话。

第一个请求post用于登录，第二次想get获取登录后的个人信息，两个不相关的会话，不能成功获取信息。可以用session维持会话。

import requests
s = requests.Session()
s.get('https://httpbin.org/cookies/set/number/123456')
r=s.get('http://httpbin.org/cookies')
print(r.text)

输出：

{
  "cookies": {
    "number": "123456"
  }
}

证书验证

import requests
from requests.packages import urllib3
#屏蔽警告
urllib3.disable_warnings()
#输出状态码
r = requests.get('https://www.12306.cn',verify=False)
print(r.status_code)

代理设置

对于大规模爬取，会跳出登录认证界面或者直接进行IP封禁，一段时间无法访问，这里需要进行代理。

import requests
#代理地址无效，仅供示例参考
proxies={
    "http":"http:10.10.1.10:3228",
    "http":"http://10.10.1.10:1080",
}
requests.get("https://www.taobao.com",proxies=proxies)

还支持socks协议代理，需要安装socks库：

pip3 install 'requests[socks]'

import requests
proxies={
    'http':'socks5://user:password@host:port',
    'https':'sock5://user:password@host:port'
}
requests.get("https://taobao.com",proxies=proxies)

超时设置

import requests
r = requests.get("https://www.taobao.com",timeout=1)
print(r.status_code)

身份认证

import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://localhost:5000',auto=HTTPBasicAuth('username','password'))
print(r.status_code)

requests还提供了其他认证方式，如OAuth认证，需要安装oauth包：

pip3 install requests_oauthlib

Prepared Request

urllib可以将请求表示为数据结构，在requests同样可以。

from requests import Request,Session
url='http://httpbin.org/post'
data={
    'name':'germey'
}
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
}
s=Session()
req=Request('POST',url,data=data,headers=headers)
prepped=s.prepare_request(req)
r=s.send(prepped)
print(r.text)

输出:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "germey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
  }, 
  "json": null, 
  "origin": "110.184.21.66, 110.184.21.66", 
  "url": "https://httpbin.org/post"
}

有了Request对象，就可以将请求当做独立的对象来看待，这样在进行队列调度时会非常方便。

正则表达式

http://tool.oschina.net/regex/

\w 匹配字母、数字以及下划线
\W 匹配不是字母、数字下划线
\s 匹配任意空白字符，等价于[\t\n\r\f]
\S 匹配任意非空字符
\d 匹配任意数字，等价于[0-9]
\D 匹配任意非数字的字符
\A 匹配字符串开头
\Z 匹配字符串结尾，只匹配换行前的字符串
\z 匹配字符串结尾，同时匹配换行符
\G 匹配最后匹配完成的位置
\n 匹配换行符
\t 匹配制表符
^  匹配一行字符串的开头
$  匹配一行字符串的结尾
.  匹配任意字符，除了换行符
[...] 用来表示一组字符，单独列出。[abc]表示匹配a,b,c
[^...] 匹配不在[]中的字符，即除了abc之外的字符
*  匹配0个或多个表达式
+  匹配1个或多个表达式
?  匹配0个或1个前面的表达式
{n} 精确匹配n个前面的表达式
{n,m} 匹配n到m次前面的表达式
a|b 匹配a或b
() 匹配表达式
.* 贪婪匹配
.*? 非贪婪匹配

match匹配

只能从开头开始匹配。

import re
content = 'joeyos@qq.com,abc@qq.com'
res = re.match('[\w!#$%&\'*+/=?^_`{|}~-]+(?:\.[\w!#$%&\'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?',content)
print(res)   

排除网页换行符的影响：

res = re.search(rule,content,re.S)

search匹配

任意地方匹配：

import re
content = 'https://joeys.githu.io,joeyos@qq.com,abc@qq.com'
res = re.search('[\w!#$%&\'*+/=?^_`{|}~-]+(?:\.[\w!#$%&\'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?',content)
print(res)