cool hit counter Scrapy Wizard_Intefrankly

Scrapy Wizard


Before reading the wizard, we assume that you already have Scrapy installed on your system. Read the installation wizard first, if you haven't already.

The wizard will take you through the following tasks.

1.Create a new Scrapy project

2.Write a website for crawling harmony Extraction of data The crawler program of

3.Export the crawled data using the command line

4. recursive Tracking Links

5.Using crawler parameters

Scrapy is written in Python. If you are not familiar with the language, you can go through it first so that you can use Scrapy better.

If you are already familiar with certain programming languages and want to learn Python quickly, we recommend reading "Deeper into Python 3" or "Python Wizard" (translation: blue indicates a hyperlink, you can click directly in the original article to read, same below).

If you don't know anything about programming but want to learn Python, this online book Learning Python the Hard Way is useful. You can also check out these Python resources for non-programmers.

Create a project

Before you can start crawling, you need to create a new Scrapy project. Go to one of the directories where you want to store your code and run the following command.

scrapy startproject tutorial

After that a tutorial folder will appear with these files.

Our first crawler.

A crawler is a class that you define and that is used by Scrapy to crawl information from one (or more) websites. This class must inherit fromscrapy.Spider , and specifies the first time a request needs to be submitted, how links on the page are tracked (optional), and how the downloaded page is parsed to extract data.

Here is the code for our first crawler. Save it in a file and name it quotes_spider.py, and place it in the tutorial/spiders directory at

As you can see, our crawler inherits thescrapy.Spider and defines a number of properties and methods.

·name: tag crawler. The attribute must be unique within a project, meaning that the name must be different for each crawler.

· start_requests(): must return an iterable request object (you can return a list of requests or a generator function), and the crawler will start crawling based on those requests. Subsequent requests will be generated one after another from the first batch.

· parse(): this method is used to process the response that is downloaded (the response corresponds to each request that is submitted). The response parameter is a TextResponse instance (containing the page content), after which there are a number of useful methods available for processing that instance.

The parse() method is typically used to parse the response, extract the crawled data into a dictionary, find a new URL and create a new request accordingly.

How to run our crawler

To run our crawler, go into the top-level directory of the project and run the following command.

scrapy crawl quotes

Now look at the current directory, You will notice that two new files are generated:quotes-1.html harmonyquotes-2.html, which contains the content of the corresponding URL, Just like ourparse as specified in the methodology。

Caution.

If you're wondering why we haven't gone to parsing HTML yet, don't worry, we'll get to that in a minute.

What's going on inside?

The crawler's start_requests method returnsscrapy.Request object and scheduled by Scrapy for scheduling. On receiving the response corresponding to each request, Scrapy instantiates the response (Response) object and calls the callback method associated with the request (in this case, the parse method), passing the response as a parameter.

take a shortcut

No need to usestart_requests() method to generate thescrapy.Request object, you can then define a start_urls class attribute that assigns the URL to it as a list. This list will be used by default by start_requests() to create the initial request for your crawler: the

The parse() method is then called to process the request corresponding to each URL generated. Although we don't specify for Scrapy to do this, the reason this is the case is that parse() is the default callback method used by Scrapy to handle requests and is not called explicitly.

Extraction of data

Learn how to useScrapy Extraction of data The best way to do this is to useScrapy shell tools。 Run the following command:

scrapy shell ‘http://quotes.toscrape.com/page/1/’

Caution.

Run it from the command lineScrapy shell time, Remember to put the URL in single quotes。 Otherwise those with certain parameters( as if& symbolic) The URL will not work。

To use double quotes on the Windows platform.

scrapy shell “http://quotes.toscrape.com/page/1/”

You will then see output similar to the following.

When using the shell tool, you can try using CSS on the response object to select elements:.

movementresponse.css('title') After getting a list-likeSelectorList targets, It indicates that the wrapping inXML/HTML around the elementSelector A list of objects, Its allows you to submit further queries to refine the selection or just Extraction of data。

To extract the text from the title above, you would type.

Two things to note here: the first is that we've added ::text to the CSS query, which means we only want to select the text element in the element. If we do not point out ::text, we will get the full title element, including its tag.

The other thing is that you get a list after calling .extract(), and that's because we're now dealing with an instance of SelectorList. If you only want the first element in the list, you can write it like this.

Originally, you could have written.

However, an IndexError can be avoided with .extract_first(), which returns a None if the corresponding element is not found.

Here's a trick: since there's no data in the page, it will cause some errors, and you should make most of the crawler code subtly respond to those errors, so even if some parts can't be crawled, you'll at least get some data.

apart fromextract() harmonyextract_first() method, you can also work with regular expressions using there() means to Extraction of data。

In order to use the rightCSS picker, The following methods will help you: usefulnessview(response) Open in your browsershell The response page downloaded from the。 You can use the browser developer tools or something likeFirebug( refer to《 take advantage ofFirebug come crawling》 harmony《 take advantage ofFirefox come crawling》 These chapters) This extended procedure。

The Selector Gadget, supported by most browsers, is also a good tool for quickly determining CSS selectors to visually select elements.

XPath Introduction.

In addition to CSS, the Scrapy selector also supports XPath expressions.

XPath expressions are very powerful and it is also the basis for Scrapy selectors. In fact, if you read the textual representation of the selector object in the shell carefully, you'll see that the CSS selector is converted internally to XPath.

XPath expressions may not be as popular as CSS selectors, but they offer much more powerful functionality. In addition to being able to navigate the structure, it can also find content, for example, through XPath you can select links that contain the text "Next Page", so XPath is very suitable for crawling tasks. If you already know how to use CSS selectors, we also encourage you to learn XPath, which will make crawling much easier.

We won't be covering a lot of XPath aspects here, but you can use this link to learn how to use XPath with the Scrapy selector. To learn XPath further, we recommend reading these two wizards. The first wizard will take you through examples to learn XPath, while the second will take you through how to think in XPath.

Extracting famous quotes

Now that you've made your choice about harmony Extraction is slightly understood, Next let's extract the quotes data from the web page by code, to complete that crawl。

Let's open the Scrapy shell and try to extract to the data we want:.

We can get a list of selectors corresponding to HTML elements like quote by.

Each of the selectors returned by the above query allows us to perform more in-depth queries on the child elements. Let's assign a variable to the first selector so that we can apply our CSS selector directly to a specific div element that belongs to the quote category.

Now, let's use the quote object we just created to extract the title, author and tags from the quote.

Since tags is a list of strings, we can use the .extract() method to get it.

Knowing how to extract under a quote element, we can now iterate over all the quote elements and put them together in a Python dictionary.

Extracting data in a crawler

Back to the crawler we wrote. So far, the crawler hasn't crawled any specific data, it's just downloaded an entire HTML page into a local file. Now let's integrate the above extraction logic into our crawler.

A Scrapy crawler typically generates a number of dictionaries containing data extracted from a page. We can also do this in the callback method using the Python keyword yield, as follows.

If you run this crawler, then the log is output harmony Extracted data。

Storing the crawled data

To store the crawled data, the easiest way is to use Feed exports by running the following command.

scrapy crawl quotes -o quotes.json

Then a quotes.json file is generated that contains all the crawled fields and is serialized in JSON format.

You can also use other formats, such as JSON Lines.

scrapy crawl quotes -o quotes.jl

JSON Lines This format is useful, Because it resembles a stream, You can easily add new records。 When you run it twice, It won't show up either. harmonyJSON Same problem.。 moreover, Because each record is in a separate row, So you don't have to wait until you get the entire large file into memory first before processing, There are some likeJQ Tools like this can help do that from the command line。

In small projects (like the one in this wizard), using Feed exports output should be sufficient. But if you want to perform more complex operations on the crawled fields, you can write a field pipeline (Item Pipeline). When the project is created, a placeholder file tutorial/pipelines.py for writing field pipelines appears in the folder. However, if you only want to store the crawled fields, you don't need to perform any field pipelines in the setup.

Tracking Links

Now that you know how to get from the page Extraction of data, Let's see how to follow some links from these pages。

First extract the link we want to track. After reviewing the page, we can find links to the next page with the following markings.

Let's try to extract in the shell.

This will get the anchor element, but we want the href attribute. To cope with this, Scrapy supports an extended CSS to allow you to select attribute content, e.g.

Now look at our modified crawler, which recursively follows the next page of links and extracts the data from the corresponding page:.

After the data has been extracted, the parse() method looks for the next page link and constructs a full absolute path via the urljoin() method (since the extracted may be a relative path), then submits a new request into the next page and uses itself as a callback method to extract the next page's data, and so on through the cycle of crawling until the end.

Here it is.Scrapy of Tracking Links machine processed: When you submit a request in a callback method,Scrapy will schedule the request to be sent, and register a callback method to be executed at the end of the request。

In this way, you can build a sophisticated crawler to follow links according to rules you define and extract different data from the pages visited.

The crawler we've written above creates a loop to keep following the links on the next page until the last page - suitable for crawling blogs, forums and other sites with page numbers.

A shortcut to creating requests

You can use response.follow to create request objects faster: the

With scrapy. Request is different in that response.follow supports relative paths directly - there is no need to call urljoin. Note that response.follow only returns an instance of the request; you still need to submit the request.

In addition to passing a string, you can also pass a selector to response.follow

; the selector needs to be able to extract the necessary attributes.

For elements, response.follow can be automatically extracted to its href attribute, so the code can be further shortened to

Caution.

Cannot be usedresponse.follow(response.css('li.next a')), owing toresponse.css Returns a list-like object, which contains a selector corresponding to all results。 This can be done as in the above example withfor cycle, Or useresponse.follow(response.css('li.next a')[0])。

More examples harmony mode

Here is another crawler used to crawl for information on famous people, It gives an example of how to call back harmony Tracking Links。

The crawler will start crawling from the home page, follow all the links going to the celebrity pages, and call the parse_author callback method to crawl each celebrity page, as well as follow the links on the next page and call back the parse method again (as seen in the previous subsection).

Here we use response.follow as positional arguments so that the code can be more concise; of course one can also use scrapy. Request.

parse_author The callback method defines a helper method to extract the harmony For cleaningCSS The data queried, and returns thePython dictionary。

Another interesting point that the crawler demonstrates is that although there are many quotes from the same author, we don't have to worry about the crawler going to the same celebrity page over and over again. Scrapy filters out URLs that have already been visited by default to avoid overdoing the server due to programming errors. The above functions can be set via theDUPEFILTER_CLASS to achieve.

Hopefully, now you're interested inScrapy of Tracking Links harmony The callback mechanism has a nice understanding。

The above crawler example makes full use of the tracking link mechanism, and you can also look at the one used for general crawlingCrawlSpider class, it will start a small rules engine, and you can write your own crawler based on that engine.

There is also a general pattern of constructing a data field from multiple pages, and a trick can be used to pass additional data to the callback method.

Using crawler parameters

By using the -a option, you can give the crawler command-line arguments when you run it: the

scrapy crawl quotes -o quotes-humor.json -a tag=humor

These parameters are passed to the __init__ method of the crawler and will become the default properties of the crawler.

In the following example, the value provided to the tag parameter will be enabled via self.tag. You can use this approach to construct URLs based on tag parameters and have your crawler crawl only quotes with a specific tag (tag):.

If you pass the tag=humor parameter to this crawler, you will find that it will only visit URLs with the humor tag, such as http://quotes.toscrape.com/tag/humor.


Recommended>>
1、Bitcoin and blockchain 0 scam or gamechanging technology
2、DNS masterslave server build
3、5 essential skills a qualified machine learning engineer needs to have have you got them all
4、A short summary of some experiences with Hive
5、iMacOSXSummary of various proxy setup methods in daily developmentshellAndroidStudiogemnpm

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号