A simple crawler exercise using requests + BeautifulSoup


This is the 17th original article on learning python on a daily basis

After talking about the BeautifulSoup library in the last post, this post is all about using the knowledge from the last post to crawl our site for today's topic: the Catnip Movie Top 100. This site is also pretty easy, so you can crawl through it yourself first and come back to this article when you have problems haha.

This post is mostly just an exercise, nothing more, so bigots, please detour!

1. Libraries and websites used in this paper

  • requests
  • BeautifulSoup
  • Target website: http://maoyan.com/board/4

2、Analysis of target websites

It was easy to find the information we were looking for, the arrows on the top 5 were all the information we wanted about the movie picture address, movie title, starring actors, showtime and rating. With the content available, the next step is to get the link to the next page.

There are two methods here, the first is to get links to all pages on the first page and the second method is to get links to the next page of each page. Here, since it only gives a link out to part of the page, we get the link to the next page, which is easier this way.

OK, analysis done, next code jacked up.

3.knock out code

Don't care about anything. Get a get request immediately.

import requests
from bs4 import BeautifulSoup

url_start = 'http://maoyan.com/board/4'
response = requests.get(url_start)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'lxml')
print(response.text)

Output results.

Surprise, surprise? If you play a lot of crawlers, this is not surprising, we've been anti-crawled. Let's try adding a request header.

url_start = 'http://maoyan.com/board/4'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'}
response = requests.get(url_start, headers=headers)

This will return normally, because the general site will add an anti-crawl on the request header, so do not rush when you encounter an anti-crawl, add a request header to try?

Next use BeautifulSoupL to get the content

imgs = soup.select('dd .board-img')  # Here is the link to get the image
titles = soup.select('dd .board-item-main .name')  #  Here's the name of the movie.
starses = soup.select('dd .board-item-main .movie-item-info .star')  #  Here's getting the movie starring
times = soup.select('dd .board-item-main .movie-item-info .releasetime')  #  This is to get the movie release date
scores = soup.select('dd .board-item-main .score-num')  #  This is to get the rating

Here each fetch statement contains information about each different movie, so that you can't have the information about each movie in the same character at once as in the regular. For example, if I get the images, one statement gets the links to all the movie images on this page, and we have to take them out separately when we store them. Here I'm using a for loop 0 to 9 to store the information about the same coordinates inside the same dictionary.

 films = [] # Store information about all films on a page
    for x in range(0, 10):
         # This is the link to get the properties
        img = imgs[x]['data-src']
         # The following ones all get the content of the tag and remove the spaces at the ends
        title = titles[x].get_text().strip()
         stars = starses[x].get_text().strip()[3:] # using slice is removing the word starring
         time = times[x].get_text().strip()[5:] # Use slice is to remove the word release time
        score = scores[x].get_text().strip()
        film = {'title': title, 'img': img, 'stars': stars, 'time': time, 'score': score}
        films.append(film)

The next step is to get the links for each page

pages = soup.select('.list-pager li a')  #  You can see the link to the next page in the lasta label
    page = pages[len(pages)-1]['href']

The latter is simple, it is the use of the loop to all the content of the page to take out on it, the code will not be posted.

Write at the end

This is BeautifulSoup library of small exercises, used yesterday's content is not much, just used the selector part and get the text content and properties part, feel or rule is better to use point ha, I a rule can get the details of each movie, as follows:

<dd>.*?board-index.*?>([d]{1,3})</i>.*?title="(.*?)".*?class="star">(.*?)</p>.*?class="releasetime">(.*?)</p>.*?class="integer">(.*?)</i>.*?class="fraction">(.*?)</i>

There is also a matching pattern that needs to be used.re.S It'll be fine. So I recommend using regular expressions haha.

Check out my github if you need the full code haha!

github:https://github.com/SergioJune/gongzhonghao_code/blob/master/python3_spider/index.py

If this article was useful to you, how about a like and a retweet?

MORE


Recommended>>
1、2017 Chinese Language Competition Awards Ceremony at Daily Laundry Dream Factory
2、Why should I join 007
3、Dr Wu tells you how AI X can benefit students study abroad applications
4、Rongzhilian Wang Donghui Still doing business the way you do business you may run out of business to do
5、How to enter Jedi survival battle royale zombie mode How to create a custom pattern

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号