[Python][教學] 網路爬蟲（crawler）實務（下）--爬蟲策略以及設定

在知道目標之後，接著就要決定如何達成。網路爬蟲的概念很簡單，就是進入網頁之後，將網站中的原始碼擷取下來，透過Xpath或是Html節點的方式來找到目標字串。

根據上次[Python][教學] 網路爬蟲（crawler）實務（上）--網頁元件解析分析的內容，我們的爬網策略大致上會是：

進入搜尋頁面>找到店家網址>進入店家頁面>擷取資料

根據這樣的流程，將他拆解成更符合爬蟲程式的邏輯：

進入搜尋頁面
搜尋頁面有多個頁面，透過參數一次抓取n個搜尋頁面(搜尋網頁數n)
從搜尋頁面中解析出店家的網址，每頁有m個店家網址(店家網頁數m)
進入店家網址，解析出需要用的資訊(搜尋網頁數=n * m)

廢話不多說直接看code：

							
											##import 必要套件
									
											import requests
									
											from bs4 import BeautifulSoup
									
											import HTMLParser
									
											import time
									
											from random import randint
									
											import sys
									
											from IPython.display import clear_output
									
											##從搜尋頁面擷取店家網址（因為搜尋頁面的電話是圖片不好抓）
									
											links = ['http://www.ipeen.com.tw/search/all/000/1-100-0-0/?p=' + str(i+1) + 'adkw=東區&so=commno' for i in range(10)]
									
											shop_links=[]
									
											for link in links:
									
											res = requests.get(link)
									
											soup = BeautifulSoup(res.text.encode("utf-8"))
									
											shop_table = soup.findAll('h3',{'class':'name'})
									
											##關在a tag裡的網址抓出來
									
											for shop_link in shop_table:
									
											link = 'http://www.ipeen.com.tw' + [tag['href'] for tag in shop_link.findAll('a',{'href':True})][0]
									
											shop_links.append(link)
									
											##避免被擋掉，小睡一會兒
									
											time.sleep(1)
									
											##建立變項檔案的header
									
											title = "shop" + "," + "category" + "," + "tel" + "," + "addr" + "," + "cost" + "," + "rank" + "," + "counts" + "," + "share" + "," + "collect"
									
											shop_list = open('shop_list.txt','w')
									
											##先把header寫進去
									
											shop_list.write(title.encode('utf-8') + "\n")
									
											for i in range(len(shop_links)):
									
											res = requests.get(shop_links[i])
									
											soup = BeautifulSoup(res.text.encode("utf-8"))
									
											header = soup.find('div',{'class':'info'})
									
											shop = header.h1.string.strip()
									
											##做例外處理
									
											try:
									
											category = header.find('p', {'class':'cate i'}).a.string
									
											except Exception as e:
									
											category = ""
									
											try:
									
											tel = header.find('p',{'class': 'tel i'}).a.string.replace("-","")
									
											except Exception as e:
									
											tel = ""
									
											try:
									
											addr = header.find('p', {'class': 'addr i'}).a.string.strip()
									
											except Exception as e:
									
											addr = ""
									
											try:
									
											cost = header.find('p', {'class':'cost i'}).string.split()[1]
									
											except Exception as e:
									
											cost = ""
									
											try:
									
											rank = header.find('span', {'itemprop': 'average'}).string
									
											except Exception as e:
									
											rank = ""
									
											try:
									
											counts = header.find_all('em')[0].string.replace(',','')
									
											except Exception as e:
									
											counts = ""
									
											try:
									
											share = header.find_all('em')[1].string.replace(',','')
									
											except Exception as e:
									
											share = ""
									
											try:
									
											collect = header.find_all('em')[2].string.replace(',','')
									
											except Exception as e:
									
											collect = ""
									
											##串起來用逗號分格（應該有更好的方法，但是先將就用用）
									
											result = shop + "," + category + "," + tel + "," + addr + "," + cost + "," + rank + "," + counts + "," + share + "," + collect
									
											shop_list.write(result.encode('utf-8') + "\n")
									
											##隨機睡一下
									
											time.sleep(randint(1,5))
									
											clear_output()
									
											print i
									
											sys.stdout.flush()
									
											shop_list.close()

view raw iPeen.py hosted with ❤ by GitHub

這次的爬網流程雖然簡單，但是還是有幾個要注意的地方：

time.sleep: 這次總共抓了n * m 個網頁，短時間的大量抓取會消耗網站資源，影響網站運行，所以通常有品的爬網者會設定睡眠時間，避免造成對方主機的負擔
try except: 當需要自動抓大量欄位時，一定要考慮或是注意到你要抓的欄位可能不是每個頁面都有提供，所以要加上例外處理才能避免錯誤而跳出程式
更為普遍的xpath設定: 如果只抓一兩個頁面，xpath要怎樣設都可以，也可以很簡單的利用數數的方式去取得標籤。但是如果要抓大量網頁，每個網頁的每個節點的數量可能會不一樣，最好多看幾個網頁原始碼，找到每個標籤在結構上的固定位置，避免抓錯欄位。

資料來源：Bryan的行銷研究及資料分析筆記

訪客 - Mathilda Jutila
回報固定連結

Exquisitely professional thoughts. I merely hit essay writing service reviews upon this web site and desired to enunciate that I've definitely delighted in reckoning your blog articles or blog posts. Rest assured I'll subsist pledging to your RSS and I wish you write-up after much more shortly

約 7 年前 http://maps.google.com/maps?z=15&q=,

0 讚分享短網址: Facebook Twitter 回覆

[Python][教學] 網路爬蟲（crawler）實務（下）--爬蟲策略以及設定

留下你的回應

以訪客張貼回應

在此對話中的人們

回應 (1)

訪客 - Mathilda Jutila

釘選列表

喜愛列表

Web Services

YOU MAY BE INTERESTED

Popular Tags

	##import 必要套件
	import requests
	from bs4 import BeautifulSoup
	import HTMLParser
	import time
	from random import randint
	import sys
	from IPython.display import clear_output

	##從搜尋頁面擷取店家網址（因為搜尋頁面的電話是圖片不好抓）
	links = ['http://www.ipeen.com.tw/search/all/000/1-100-0-0/?p=' + str(i+1) + 'adkw=東區&so=commno' for i in range(10)]
	shop_links=[]
	for link in links:
	res = requests.get(link)
	soup = BeautifulSoup(res.text.encode("utf-8"))
	shop_table = soup.findAll('h3',{'class':'name'})
	##關在a tag裡的網址抓出來
	for shop_link in shop_table:
	link = 'http://www.ipeen.com.tw' + [tag['href'] for tag in shop_link.findAll('a',{'href':True})][0]
	shop_links.append(link)
	##避免被擋掉，小睡一會兒
	time.sleep(1)

	##建立變項檔案的header
	title = "shop" + "," + "category" + "," + "tel" + "," + "addr" + "," + "cost" + "," + "rank" + "," + "counts" + "," + "share" + "," + "collect"
	shop_list = open('shop_list.txt','w')
	##先把header寫進去
	shop_list.write(title.encode('utf-8') + "\n")

	for i in range(len(shop_links)):

	res = requests.get(shop_links[i])
	soup = BeautifulSoup(res.text.encode("utf-8"))
	header = soup.find('div',{'class':'info'})

	shop = header.h1.string.strip()

	##做例外處理
	try:
	category = header.find('p', {'class':'cate i'}).a.string
	except Exception as e:
	category = ""

	try:
	tel = header.find('p',{'class': 'tel i'}).a.string.replace("-","")
	except Exception as e:
	tel = ""

	try:
	addr = header.find('p', {'class': 'addr i'}).a.string.strip()
	except Exception as e:
	addr = ""

	try:
	cost = header.find('p', {'class':'cost i'}).string.split()[1]
	except Exception as e:
	cost = ""

	try:
	rank = header.find('span', {'itemprop': 'average'}).string
	except Exception as e:
	rank = ""

	try:
	counts = header.find_all('em')[0].string.replace(',','')
	except Exception as e:
	counts = ""

	try:
	share = header.find_all('em')[1].string.replace(',','')
	except Exception as e:
	share = ""

	try:
	collect = header.find_all('em')[2].string.replace(',','')
	except Exception as e:
	collect = ""

	##串起來用逗號分格（應該有更好的方法，但是先將就用用）
	result = shop + "," + category + "," + tel + "," + addr + "," + cost + "," + rank + "," + counts + "," + share + "," + collect
	shop_list.write(result.encode('utf-8') + "\n")

	##隨機睡一下
	time.sleep(randint(1,5))
	clear_output()
	print i
	sys.stdout.flush()

	shop_list.close()

	今日	972
	昨日	905
	本週	6735
	本月	20094
	總訪客量	2569184