When using the shell tool, you can try using CSS on the response object to select elements:.
movementresponse.css('title') After getting a list-likeSelectorList targets， It indicates that the wrapping inXML/HTML around the elementSelector A list of objects， Its allows you to submit further queries to refine the selection or just Extraction of data。
To extract the text from the title above, you would type.
Two things to note here: the first is that we've added ::text to the CSS query, which means we only want to select the text element in the element. If we do not point out ::text, we will get the full title element, including its tag.
The other thing is that you get a list after calling .extract(), and that's because we're now dealing with an instance of SelectorList. If you only want the first element in the list, you can write it like this.
Originally, you could have written.
However, an IndexError can be avoided with .extract_first(), which returns a None if the corresponding element is not found.
Here's a trick: since there's no data in the page, it will cause some errors, and you should make most of the crawler code subtly respond to those errors, so even if some parts can't be crawled, you'll at least get some data.
extract() harmony extract_first() method, you can also work with regular expressions using the re() means to Extraction of data。
In order to use the rightCSS picker， The following methods will help you： usefulnessview(response) Open in your browsershell The response page downloaded from the。 You can use the browser developer tools or something likeFirebug( refer to《 take advantage ofFirebug come crawling》 harmony《 take advantage ofFirefox come crawling》 These chapters) This extended procedure。
The Selector Gadget, supported by most browsers, is also a good tool for quickly determining CSS selectors to visually select elements.
In addition to CSS, the Scrapy selector also supports XPath expressions.
XPath expressions are very powerful and it is also the basis for Scrapy selectors. In fact, if you read the textual representation of the selector object in the shell carefully, you'll see that the CSS selector is converted internally to XPath.
XPath expressions may not be as popular as CSS selectors, but they offer much more powerful functionality. In addition to being able to navigate the structure, it can also find content, for example, through XPath you can select links that contain the text "Next Page", so XPath is very suitable for crawling tasks. If you already know how to use CSS selectors, we also encourage you to learn XPath, which will make crawling much easier.
We won't be covering a lot of XPath aspects here, but you can use this link to learn how to use XPath with the Scrapy selector. To learn XPath further, we recommend reading these two wizards. The first wizard will take you through examples to learn XPath, while the second will take you through how to think in XPath.
Extracting famous quotes
Now that you've made your choice about harmony Extraction is slightly understood， Next let's extract the quotes data from the web page by code， to complete that crawl。
Let's open the Scrapy shell and try to extract to the data we want:.
We can get a list of selectors corresponding to HTML elements like quote by.
Each of the selectors returned by the above query allows us to perform more in-depth queries on the child elements. Let's assign a variable to the first selector so that we can apply our CSS selector directly to a specific div element that belongs to the quote category.
Now, let's use the quote object we just created to extract the title, author and tags from the quote.
Since tags is a list of strings, we can use the .extract() method to get it.
Knowing how to extract under a quote element, we can now iterate over all the quote elements and put them together in a Python dictionary.
Extracting data in a crawler
Back to the crawler we wrote. So far, the crawler hasn't crawled any specific data, it's just downloaded an entire HTML page into a local file. Now let's integrate the above extraction logic into our crawler.
A Scrapy crawler typically generates a number of dictionaries containing data extracted from a page. We can also do this in the callback method using the Python keyword yield, as follows.
If you run this crawler， then the log is output harmony Extracted data。
Storing the crawled data
To store the crawled data, the easiest way is to use Feed exports by running the following command.
scrapy crawl quotes -o quotes.json
Then a quotes.json file is generated that contains all the crawled fields and is serialized in JSON format.
You can also use other formats, such as JSON Lines.
scrapy crawl quotes -o quotes.jl
JSON Lines This format is useful， Because it resembles a stream， You can easily add new records。 When you run it twice， It won't show up either. harmonyJSON Same problem.。 moreover， Because each record is in a separate row， So you don't have to wait until you get the entire large file into memory first before processing， There are some likeJQ Tools like this can help do that from the command line。
In small projects (like the one in this wizard), using Feed exports output should be sufficient. But if you want to perform more complex operations on the crawled fields, you can write a field pipeline (Item Pipeline). When the project is created, a placeholder file tutorial/pipelines.py for writing field pipelines appears in the folder. However, if you only want to store the crawled fields, you don't need to perform any field pipelines in the setup.
Now that you know how to get from the page Extraction of data， Let's see how to follow some links from these pages。
First extract the link we want to track. After reviewing the page, we can find links to the next page with the following markings.
Let's try to extract in the shell.
This will get the anchor element, but we want the href attribute. To cope with this, Scrapy supports an extended CSS to allow you to select attribute content, e.g.
Now look at our modified crawler, which recursively follows the next page of links and extracts the data from the corresponding page:.
After the data has been extracted, the parse() method looks for the next page link and constructs a full absolute path via the urljoin() method (since the extracted may be a relative path), then submits a new request into the next page and uses itself as a callback method to extract the next page's data, and so on through the cycle of crawling until the end.
Here it is.Scrapy of Tracking Links machine processed： When you submit a request in a callback method，Scrapy will schedule the request to be sent， and register a callback method to be executed at the end of the request。
In this way, you can build a sophisticated crawler to follow links according to rules you define and extract different data from the pages visited.
The crawler we've written above creates a loop to keep following the links on the next page until the last page - suitable for crawling blogs, forums and other sites with page numbers.
A shortcut to creating requests
You can use response.follow to create request objects faster: the
With scrapy. Request is different in that response.follow supports relative paths directly - there is no need to call urljoin. Note that response.follow only returns an instance of the request; you still need to submit the request.
In addition to passing a string, you can also pass a selector to response.follow
; the selector needs to be able to extract the necessary attributes.
For elements, response.follow can be automatically extracted to its href attribute, so the code can be further shortened to
Cannot be usedresponse.follow(response.css('li.next a'))， owing toresponse.css Returns a list-like object， which contains a selector corresponding to all results。 This can be done as in the above example withfor cycle， Or useresponse.follow(response.css('li.next a'))。
More examples harmony mode
Here is another crawler used to crawl for information on famous people， It gives an example of how to call back harmony Tracking Links。
The crawler will start crawling from the home page, follow all the links going to the celebrity pages, and call the parse_author callback method to crawl each celebrity page, as well as follow the links on the next page and call back the parse method again (as seen in the previous subsection).
Here we use response.follow as positional arguments so that the code can be more concise; of course one can also use scrapy. Request.
parse_author The callback method defines a helper method to extract the harmony For cleaningCSS The data queried， and returns thePython dictionary。
Another interesting point that the crawler demonstrates is that although there are many quotes from the same author, we don't have to worry about the crawler going to the same celebrity page over and over again. Scrapy filters out URLs that have already been visited by default to avoid overdoing the server due to programming errors. The above functions can be set via the
DUPEFILTER_CLASS to achieve.
Hopefully, now you're interested inScrapy of Tracking Links harmony The callback mechanism has a nice understanding。
The above crawler example makes full use of the tracking link mechanism, and you can also look at the one used for general crawling
CrawlSpider class, it will start a small rules engine, and you can write your own crawler based on that engine.
There is also a general pattern of constructing a data field from multiple pages, and a trick can be used to pass additional data to the callback method.
Using crawler parameters
By using the -a option, you can give the crawler command-line arguments when you run it: the
scrapy crawl quotes -o quotes-humor.json -a tag=humor
These parameters are passed to the __init__ method of the crawler and will become the default properties of the crawler.
In the following example, the value provided to the tag parameter will be enabled via self.tag. You can use this approach to construct URLs based on tag parameters and have your crawler crawl only quotes with a specific tag (tag):.