cool hit counter Building a web crawler using Scrapy_Intefrankly

Building a web crawler using Scrapy


First delivery at a critical time!

Remember n years ago the project needed a flexible crawler tool, so a small team was organized to implement a crawler framework in Java that could do simple configuration development based on the target site's structure, address and needed content to achieve crawler functionality for a specific site. It also took a lot of manpower to develop because of the special circumstances that had to be taken into account. Then I discovered this Scrapy tool under Python and instantly felt that all the things I did before were for nothing. For a common web crawler function, Scrapy is perfectly competent and wraps up a lot of the complex programming. This article will cover how to Scrapy build a simple web crawler.

A basic crawler tool, which should have the following features.

HTTP(S) request to download web page information

Parsing web pages to crawl needed content

Save content

Find a valid link from an existing page to continue crawling to the next page

Let's look at how Scrapy does all of this. First prepare the Scrapy environment, you need to install Python (this article uses v2.7) and pip, then use pip to install lxml and scrapy. Personally, I highly recommend using virtualenv to install the environment so that there are no conflicts between different projects. The detailed steps will not be covered here. For Mac users be aware that when using pip to install lxml, the following error similar to this occurs.

Error: #include “xml/xmlversion.h” not found

To solve this problem, you need to first install Xcode's command line tools, which can be done by executing the following command from the command line

$ xcode-select --install

Once the environment is installed, let's implement a simple crawler using Scrapy to crawl this blog site for article titles, addresses and summaries.

1. Create Project

$ scrapy startproject my_crawler

This command will create a project named "my_crawler" in the current directory, with the following directory structure

my_crawler

|-my_crawler

| |-spiders

| | |-__init__.py

| |-items.py

| |-pipelines.py

| |-setting.py

|-scrapy.cfg

2. Set the fields of the content to be crawled, in this case the title, address and summary of the article

Modify the "items.py" file and add the following code to the "MyCrawlerItem" class.

# -*- coding: utf-8 -*-

importscrapy

classMyCrawlerItem(scrapy.Item):

title=scrapy. Field()# Article Title

url=scrapy. Field()# Article address

summary=scrapy. Field()# Article Summary

pass

3. Writing web parsing code

In the "my_crawler/spiders" directory, create a file called "crawl_spider.py" (you can name it whatever you want). The code is as follows

# -*- coding: utf-8 -*-

importscrapy

fromscrapy.linkextractorsimportLinkExtractor

fromscrapy.spidersimportCrawlSpider,Rule

frommy_crawler.itemsimportMyCrawlerItem

classMyCrawlSpider(CrawlSpider):

name='my_crawler'# Spiderfirst name, Must be unique, To execute the crawl command use

allowed_domains=['bjhee.com']# Limit the domains allowed to climb, Multiple settings are possible

start_urls=[

]

rules=(# Set the parsing function for a specific URL, multiple settings are possible

Rule(LinkExtractor(allow=r'/page/[0-9]+'),# Specify the allowed continued crawl of theURL format, Support for regularity

callback='parse_item',# Name of the callback function used to parse the page

follow=True

),

)

defparse_item(self,response):

# Get the Dom element by XPath

articles=response.xpath('//*[@id="main"]/ul/li')

forarticleinarticles:

item=MyCrawlerItem()

item['title']=article.xpath('h3[@class="entry-title"]/a/text()').extract()[]

item['url']=article.xpath('h3[@class="entry-title"]/a/@href').extract()[]

item['summary']=article.xpath('div[2]/p/text()').extract()[]

yielditem

For those who are not familiar with XPath, you can get the XPath of the element through Chrome's debug tool.

4. Let's test the effect of the crawler

At the command line, type.

$ scrapy crawl my_crawler

Note that "my_crawler" is the name of the Spider you gave in the "crawl_spider.py" file.

Within a few seconds, you'll see the contents of the field to be grabbed printed on the console. It's that amazing! Scrapy encapsulates the management of HTTP(S) requests, content downloads, queues of URLs to be crawled and those already crawled. Your main job is basically to set up the URL rules and write a way to parse them.

We save the crawl as a JSON file at

$ scrapy crawl my_crawler -o my_crawler.json -t json

You can find the file "my_crawler.json" in the current directory, which holds the information of the fields we want to crawl. (the parameter "-t json" can be omitted)

5. Save the results to the database

Here we use MongoDB, you need to install Python's MongoDB library "pymongo" first. Edit the "pipelines.py" file in the "my_crawler" directory and add the following code to the "MyCrawlerPipeline" class.

# -*- coding: utf-8 -*-

importpymongo

fromscrapy.confimportsettings

fromscrapy.exceptionsimportDropItem

classMyCrawlerPipeline(object):

def__init__(self):

# Set up a MongoDB connection

connection=pymongo.Connection(

settings['MONGO_SERVER'],

settings['MONGO_PORT']

)

db=connection[settings['MONGO_DB']]

self.collection=db[settings['MONGO_COLLECTION']]

# Process each MyCrawlerItem item that is crawled

defprocess_item(self,item,spider):

valid=True

fordatainitem:

ifnotdata:# Filter out items with empty fields

valid=False

raiseDropItem("Missing !".format(data))

ifvalid:

returnitem

Then open the "settings.py" file in the "my_crawler" directory and add the settings for pipeline at the end of the file.

ITEM_PIPELINES={

'my_crawler.pipelines.MyCrawlerPipeline':300,# set upPipeline, Can be multiple, The value is the execution priority

}

# MongoDB connection information

MONGO_SERVER='localhost'

MONGO_PORT=27017

MONGO_DB='bjhee'

MONGO_COLLECTION='articles'

DOWNLOAD_DELAY=2# If the network is slow, you can add some delay in seconds if appropriate

6. Execute the crawler

$ scrapy crawl my_crawler

Don't forget to start MongoDB and create the "bjhee" database. Now you can look up the record in MongoDB.

To summarize, to build a web crawler using Scrapy, all you need to do is.

Define crawl fields in "items.py"

Create your crawler in the "spiders" directory, write parsing functions and rules

Processing the crawl results in "pipelines.py"

"settings.py" sets the necessary parameters

Scrapy does everything else for you. The diagram below shows the exact flow of how Scrapy works. How's that? Start writing a crawler of your own.

The code in this example can be downloaded here (http://www.bjhee.com/downloads/201511/my_crawler.tar.gz).


Recommended>>
1、java multithreading series learning CyclicBarrier through matchmaking games
2、Original CI Continuous Integration System Environment Deploying a gerrit environment complete documentation
3、PHP uses DES for encryption and decryption
4、CF452AEevee
5、FAIR Ho KaiMing et al propose group normalization an alternative to batch normalization without batch size restrictions

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号