A simple crawler exercise using requests + BeautifulSoup
This is the 17th original article on learning python on a daily basis
After talking about the BeautifulSoup library in the last post, this post is all about using the knowledge from the last post to crawl our site for today's topic: the Catnip Movie Top 100. This site is also pretty easy, so you can crawl through it yourself first and come back to this article when you have problems haha.
This post is mostly just an exercise, nothing more, so bigots, please detour!
1. Libraries and websites used in this paper
2、Analysis of target websites
It was easy to find the information we were looking for, the arrows on the top 5 were all the information we wanted about the movie picture address, movie title, starring actors, showtime and rating. With the content available, the next step is to get the link to the next page.
There are two methods here, the first is to get links to all pages on the first page and the second method is to get links to the next page of each page. Here, since it only gives a link out to part of the page, we get the link to the next page, which is easier this way.
OK, analysis done, next code jacked up.
3.knock out code
Don't care about anything. Get a get request immediately.
import requests from bs4 import BeautifulSoup url_start = 'http://maoyan.com/board/4' response = requests.get(url_start) if response.status_code == 200: soup = BeautifulSoup(response.text, 'lxml') print(response.text)
Output results.
Surprise, surprise? If you play a lot of crawlers, this is not surprising, we've been anti-crawled. Let's try adding a request header.
url_start = 'http://maoyan.com/board/4' headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'} response = requests.get(url_start, headers=headers)
This will return normally, because the general site will add an anti-crawl on the request header, so do not rush when you encounter an anti-crawl, add a request header to try?
Next use BeautifulSoupL to get the content
imgs = soup.select('dd .board-img') # Here is the link to get the image titles = soup.select('dd .board-item-main .name') # Here's the name of the movie. starses = soup.select('dd .board-item-main .movie-item-info .star') # Here's getting the movie starring times = soup.select('dd .board-item-main .movie-item-info .releasetime') # This is to get the movie release date scores = soup.select('dd .board-item-main .score-num') # This is to get the rating
Here each fetch statement contains information about each different movie, so that you can't have the information about each movie in the same character at once as in the regular. For example, if I get the images, one statement gets the links to all the movie images on this page, and we have to take them out separately when we store them. Here I'm using a for loop 0 to 9 to store the information about the same coordinates inside the same dictionary.
films = [] # Store information about all films on a page for x in range(0, 10): # This is the link to get the properties img = imgs[x]['data-src'] # The following ones all get the content of the tag and remove the spaces at the ends title = titles[x].get_text().strip() stars = starses[x].get_text().strip()[3:] # using slice is removing the word starring time = times[x].get_text().strip()[5:] # Use slice is to remove the word release time score = scores[x].get_text().strip() film = {'title': title, 'img': img, 'stars': stars, 'time': time, 'score': score} films.append(film)
The next step is to get the links for each page
pages = soup.select('.list-pager li a') # You can see the link to the next page in the lasta label page = pages[len(pages)-1]['href']
The latter is simple, it is the use of the loop to all the content of the page to take out on it, the code will not be posted.
Write at the end
This is BeautifulSoup library of small exercises, used yesterday's content is not much, just used the selector part and get the text content and properties part, feel or rule is better to use point ha, I a rule can get the details of each movie, as follows:
<dd>.*?board-index.*?>([d]{1,3})</i>.*?title="(.*?)".*?class="star">(.*?)</p>.*?class="releasetime">(.*?)</p>.*?class="integer">(.*?)</i>.*?class="fraction">(.*?)</i>
There is also a matching pattern that needs to be used.re.S It'll be fine. So I recommend using regular expressions haha.
Check out my github if you need the full code haha!
github:https://github.com/SergioJune/gongzhonghao_code/blob/master/python3_spider/index.py
If this article was useful to you, how about a like and a retweet?
MORE