Python crawler for beginners 2 crawl douban movie information
This is a free instructional course on Python crawlers for beginners, with only 7 sessions to give you an initial understanding of crawlers with zero foundation and follow the course content to be able to crawl resources on your own. Watch the article, turn on your computer and get hands-on, you can learn a section in 45 minutes on average, and if you want, you can be on your way to crawling today~
Well, let's officially start our second lesson on Crawling Douban Movie Info! La-la-la-la, look at the blackboard!
1. Crawler principle
1.1 Crawler fundamentals
After hearing so much about crawlers, what exactly is a crawler? And how do crawlers work? Let's start with the "crawler principle".
A crawler, also known as a web spider, is a program or script. But here's the kicker: it can follow certain rules and automatically fetch information from a web page. The generic framework for crawlers is as follows.
1.Selection of seed URLs.
2.Place these URLs in the queue of URLs to be crawled.
3.The URL to be crawled is fetched, downloaded and stored in the library of downloaded pages. In addition, these URLs are placed in the queue of URLs to be crawled and moved to the next loop.
4.The URLs in the crawled queue are analyzed and the URLs are placed in the pending URL queue to move on to the next loop.
Ahem~
Or to use a specific example, for that matter!
1.2 An example of a crawler
For example, if we want to get the "rating" information of a movie.
Manual procedure.
Get movie information page
Locate (find) to the location of the scoring information
Copy, save the scoring data we want
Steps for the crawler operation.
Request and download movie page information
Parsing and locating scoring information
Save scoring data
Doesn't it feel like it?
1.3 The basic process of crawling
In simple terms, after we send a request to the server, we get the returned page, and after parsing the page, we can extract the part of information we want and store it in a specified document or database. This way, the information we want is "crawled" down to us!
2. Requests+Xpath Crawl Douban Movies
There are many crawler-related packages in Python: Urllib, requsts, bs4 ...... We'll start with requests+xpath, because it's so easy to get started! After learning it, you'll see that BeautifulSoup is still slightly difficult.
Here's how we crawl Douban movies using requests+xpath.
2.1 Install Python application packages: requests, lxml
If you are using Requests+Xpath for the first time, you first need to install two packages: requests and lxml by typing the following two lines of code in the terminal, respectively (the installation method is described in section 1).
pip install requests pip install lxml
2.2 Import the Python modules we need
We write our code in jupyter, first importing the two modules we need.
import requests from lxml import etree
Python import library directly with "import+library name", need to use some method in the library with "from+library name+import+method name". Here we need requests to download the page and use lxml.etree to parse the page.
2.3 Get the Douban movie target page and parse it
We're going to crawl some information above the Douban movie "The Shawshank Redemption" at the following website address
https://movie.douban.com/subject/1292052/
Given the url and using the requests.get() method to get the text of the page, use etree. HTML() to parse the downloaded page data "data".
url = 'https://movie.douban.com/subject/1292052/' data = requests.get(url).text s=etree.HTML(data)
2.4 Get the movie title
Get the Xpath information of the element and get the text.
file=s.xpath('elementalXpath information/text()')
Here.“elementalXpath information” It's something we need to get manually., Accessed by: Locating target elements, On the website, click in order: right click > inspections
Shortcut key "shift+ctrl+c", when you move the mouse to the corresponding element, you can see the corresponding page code.
Click in order on the code corresponding to the movie title right click > Copy > Copy XPath, gain Film title ofXpath:
This way we have copied the Xpath information from the element: the
//*[@id="content"]/h1/span[1]
Put in the code and print the message.
film=s.xpath('//*[@id="content"]/h1/span[1]/text()') print(film)
2.5 The code and the results of the run
above The full code is as follows.
import requests from lxml import etree url = 'https://movie.douban.com/subject/1292052/' data = requests.get(url).text s=etree.HTML(data) film=s.xpath('//*[@id="content"]/h1/span[1]/text()') print(film)
The complete code and results are run in Jupyter as follows.
This completes our code for crawling the "movie title" information in the Douban movie "The Shawshank Redemption", which can be run in Jupyter.
2.6 Getting information about other elements
Except for the name of the movie., We also have access to director、 lead actor、 Film length etc. information, The way to get it is similar。 The code is as follows:
director=s.xpath('//*[@id="info"]/span[1]/span[2]/a/text()') # director actor1=s.xpath('//*[@id="info"]/span[3]/span[2]/a[1]/text()') # lead actor1 actor2=s.xpath('//*[@id="info"]/span[3]/span[2]/a[2]/text()') # lead actor2 actor3=s.xpath('//*[@id="info"]/span[3]/span[2]/a[3]/text()') # lead actor3 time=s.xpath(‘//*[@id="info"]/span[13]/text()') # Film length
Observe the code above, Finding access to different“ lead actor” information age, The difference is only that“a[x]” in“x” The size of the numbers differs between。 in reality, To get all of them at once“ lead actor” when the information in the, Use the unnumbered“a” can be expressed。 The code is as follows:
actor=s.xpath('//*[@id="info"]/span[3]/span[2]/a/text()') # lead actor
The full code is as follows.
import requests from lxml import etree url = 'https://movie.douban.com/subject/1292052/' data = requests.get(url).text s=etree.HTML(data) film=s.xpath('//*[@id="content"]/h1/span[1]/text()') director=s.xpath('//*[@id="info"]/span[1]/span[2]/a/text()') actor=s.xpath('//*[@id="info"]/span[3]/span[2]/a/text()') time=s.xpath('//*[@id="info"]/span[13]/text()') print(' Film title:',film) print(' director:',director) print(' lead actor:',actor) print(' length of film:',time)
The complete code and results are run in jupyter as follows.
3. About Requests
The official description of the Requests library has this to say: Requests is the only non-GMO Python HTTP library that is safe for humans to enjoy.
This statement is a direct and overbearing declaration that the Requests library is the single best HTTP library for python. Why does it have such a bottom line? If you are interested, please read the official Requests documentation.
Requests Seven methods are commonly used.
4. About the parsing wizard Xpath
Xpath is the XML Path Language, which is a language used to determine the location of a part of an XML document.
Xpath provides the ability to find nodes in a data structure tree based on an XML tree structure. Initially Xpath was proposed as a generic syntax model between Xpointer and XSL. But Xpath was quickly adopted by developers as a small query language.
You can read this document to learn more about Xpath.
Xpath's process for parsing web pages.
1.First fetch the web data through the Requests library
2.Parsing through the page to get the desired data or a new link
3.Web parsing can be done with Xpath or other parsing tools, Xpath is a very good tool for web parsing
Comparison of common web parsing methods
Regular expressions are more difficult to use and more expensive to learn
BeautifulSoup slower performance, harder compared to Xpath, useful in some specific scenarios
Xpath is easy to use, fast (Xpath is one of the lxml inside), and the best choice to get started
Well, that's it for this lesson!
For nothing~