Python crawler for beginners 2 crawl douban movie information

This is a free instructional course on Python crawlers for beginners, with only 7 sessions to give you an initial understanding of crawlers with zero foundation and follow the course content to be able to crawl resources on your own. Watch the article, turn on your computer and get hands-on, you can learn a section in 45 minutes on average, and if you want, you can be on your way to crawling today~

Well, let's officially start our second lesson on Crawling Douban Movie Info! La-la-la-la, look at the blackboard!

1. Crawler principle

1.1 Crawler fundamentals

After hearing so much about crawlers, what exactly is a crawler? And how do crawlers work? Let's start with the "crawler principle".

A crawler, also known as a web spider, is a program or script. But here's the kicker: it can follow certain rules and automatically fetch information from a web page. The generic framework for crawlers is as follows.

1.Selection of seed URLs.

2.Place these URLs in the queue of URLs to be crawled.

3.The URL to be crawled is fetched, downloaded and stored in the library of downloaded pages. In addition, these URLs are placed in the queue of URLs to be crawled and moved to the next loop.

4.The URLs in the crawled queue are analyzed and the URLs are placed in the pending URL queue to move on to the next loop.


Or to use a specific example, for that matter!

1.2 An example of a crawler

For example, if we want to get the "rating" information of a movie.

Manual procedure.

Get movie information page

Locate (find) to the location of the scoring information

Copy, save the scoring data we want

Steps for the crawler operation.

Request and download movie page information

Parsing and locating scoring information

Save scoring data

Doesn't it feel like it?

1.3 The basic process of crawling

In simple terms, after we send a request to the server, we get the returned page, and after parsing the page, we can extract the part of information we want and store it in a specified document or database. This way, the information we want is "crawled" down to us!

2. Requests+Xpath Crawl Douban Movies

There are many crawler-related packages in Python: Urllib, requsts, bs4 ...... We'll start with requests+xpath, because it's so easy to get started! After learning it, you'll see that BeautifulSoup is still slightly difficult.

Here's how we crawl Douban movies using requests+xpath.

2.1 Install Python application packages: requests, lxml

If you are using Requests+Xpath for the first time, you first need to install two packages: requests and lxml by typing the following two lines of code in the terminal, respectively (the installation method is described in section 1).

pip install requests pip install lxml

2.2 Import the Python modules we need

We write our code in jupyter, first importing the two modules we need.

import requests from lxml import etree

Python import library directly with "import+library name", need to use some method in the library with "from+library name+import+method name". Here we need requests to download the page and use lxml.etree to parse the page.

2.3 Get the Douban movie target page and parse it

We're going to crawl some information above the Douban movie "The Shawshank Redemption" at the following website address

Given the url and using the requests.get() method to get the text of the page, use etree. HTML() to parse the downloaded page data "data".

url = '' data = requests.get(url).text s=etree.HTML(data)

2.4 Get the movie title

Get the Xpath information of the element and get the text.

file=s.xpath('elementalXpath information/text()')

Here.“elementalXpath information” It's something we need to get manually., Accessed by: Locating target elements, On the website, click in order: right click > inspections

Shortcut key "shift+ctrl+c", when you move the mouse to the corresponding element, you can see the corresponding page code.

Click in order on the code corresponding to the movie title right click > Copy > Copy XPath, gain Film title ofXpath:

This way we have copied the Xpath information from the element: the


Put in the code and print the message.

film=s.xpath('//*[@id="content"]/h1/span[1]/text()') print(film)

2.5 The code and the results of the run

above The full code is as follows.

import requests from lxml import etree url = '' data = requests.get(url).text s=etree.HTML(data) film=s.xpath('//*[@id="content"]/h1/span[1]/text()') print(film)

The complete code and results are run in Jupyter as follows.

This completes our code for crawling the "movie title" information in the Douban movie "The Shawshank Redemption", which can be run in Jupyter.

2.6 Getting information about other elements

Except for the name of the movie., We also have access to director、 lead actor、 Film length etc. information, The way to get it is similar。 The code is as follows:

director=s.xpath('//*[@id="info"]/span[1]/span[2]/a/text()') # director actor1=s.xpath('//*[@id="info"]/span[3]/span[2]/a[1]/text()') # lead actor1 actor2=s.xpath('//*[@id="info"]/span[3]/span[2]/a[2]/text()') # lead actor2 actor3=s.xpath('//*[@id="info"]/span[3]/span[2]/a[3]/text()') # lead actor3 time=s.xpath(‘//*[@id="info"]/span[13]/text()') # Film length

Observe the code above, Finding access to different“ lead actor” information age, The difference is only that“a[x]” in“x” The size of the numbers differs between。 in reality, To get all of them at once“ lead actor” when the information in the, Use the unnumbered“a” can be expressed。 The code is as follows:

actor=s.xpath('//*[@id="info"]/span[3]/span[2]/a/text()') # lead actor

The full code is as follows.

import requests from lxml import etree url = '' data = requests.get(url).text s=etree.HTML(data) film=s.xpath('//*[@id="content"]/h1/span[1]/text()') director=s.xpath('//*[@id="info"]/span[1]/span[2]/a/text()') actor=s.xpath('//*[@id="info"]/span[3]/span[2]/a/text()') time=s.xpath('//*[@id="info"]/span[13]/text()') print(' Film title:',film) print(' director:',director) print(' lead actor:',actor) print(' length of film:',time)

The complete code and results are run in jupyter as follows.

3. About Requests

The official description of the Requests library has this to say: Requests is the only non-GMO Python HTTP library that is safe for humans to enjoy.

This statement is a direct and overbearing declaration that the Requests library is the single best HTTP library for python. Why does it have such a bottom line? If you are interested, please read the official Requests documentation.

Requests Seven methods are commonly used.

4. About the parsing wizard Xpath

Xpath is the XML Path Language, which is a language used to determine the location of a part of an XML document.

Xpath provides the ability to find nodes in a data structure tree based on an XML tree structure. Initially Xpath was proposed as a generic syntax model between Xpointer and XSL. But Xpath was quickly adopted by developers as a small query language.

You can read this document to learn more about Xpath.

Xpath's process for parsing web pages.

1.First fetch the web data through the Requests library

2.Parsing through the page to get the desired data or a new link

3.Web parsing can be done with Xpath or other parsing tools, Xpath is a very good tool for web parsing

Comparison of common web parsing methods

Regular expressions are more difficult to use and more expensive to learn

BeautifulSoup slower performance, harder compared to Xpath, useful in some specific scenarios

Xpath is easy to use, fast (Xpath is one of the lxml inside), and the best choice to get started

Well, that's it for this lesson!

For nothing~

1、What do you think of the fact that iCloud in the country will be operated by a domestic company starting February 28th
2、The first thing you need to do is to get a good idea of what you want to do
3、Learn Domains with me from 0 What is the difference between a regular SSL and a wildcard SSL certificate
4、Will robots take on human consciousness and then go to war with humans
5、Learn Assembly Language With Me Programs Containing Multiple Segments

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送