cool hit counter How to get started with Python crawlers?_Intefrankly

How to get started with Python crawlers?

The Daily Life of a Program Ape

Getting started" is good motivation, but can be slow. If you have a project in your hands or in your head, then practicing it you will be driven by the goal and not learn it slowly like a learning module.

Also if each knowledge point in the knowledge system is a point in the graph and the dependencies are edges, then the graph must not be a directed acyclic graph. Because the experience of learning A can help you learn B. Therefore, you don't need to learn how to "get started" because such a "start" point doesn't exist! What you need to learn is how to make a bigger thing, and in the process, you'll learn what you need to learn very quickly. Of course, you can argue that you need to know python first, otherwise how do you learn python for crawling? But in fact, you can totally learn python while doing this crawler :D

Seeing that many of the previous answers talk about the "art" - how to crawl with what software, I'll talk about the "path" and "art" - how the crawler works and how to implement it in python.

You need to learn:

Basic http scraping tool, scrapy

Bloom Filter: Bloom Filters by Example

If you need large-scale web crawling, you need to learn the concept of distributed crawlers. It's not that arcane, you just have to learn how to maintain a distributed queue that all clustered machines can share efficiently. The simplest implementation is python-rq:

Combination of rq and Scrapy: darkrho/scrapy-redis - GitHub

Post-processing, web parsing (grangier/python-goose - GitHub), storage (Mongodb)

The following is a short and long story.

Tell us about the experience of climbing down an entire beanstalk with one cluster that was originally written.

( 1) First you need to understand how crawlers work:

Imagine you are a spider, and now you are placed on the interconnected "web". Then, you need to go through all the pages. What to do? No problem, you just start somewhere, for example, the front page of the People's Daily, this is called initial pages, use $ to represent it.

On the front page of the People's Daily, you see the various links that that page leads to. So you happily crawled from there to the "National News" page. Great, so you've crawled both pages (home and national news)! Ignore for the moment what you do with the page you crawl down, and imagine that you've copied the page in its entirety into an html put on you.

Suddenly you find a link back to the "Home" page on the National News page. As a smart spider, you must know you don't have to climb back up, because you've already seen it. So, you need to use your brain and save the addresses of the pages you've already seen. That way, every time you see a new link that you might need to crawl, you first check to see if you've already been to the page address in your head. If it's been there, then don't go.

Well, theoretically if all pages can be reached from the initial page then it can be proven that you must be able to crawl all pages.

So how do you implement it in python?

import Queue initial_page = "" url_queue = Queue.Queue() seen = set() seen.insert(initial_page) url_queue.put(initial_page) while(True): #Keep going until the sea runs dry if url_queue.size()>0: current_url = url_queue.get() # Take the first one in the queue exampleurl store(current_url) # Put thisurl Good web storage on behalf of for next_url in extract_urls(current_url): # Extraction puts thisurl chainwiseurl if next_url not in seen: seen.put(next_url) url_queue.put(next_url) else: break

The writing is already very pseudo-code.

The backbone of all crawlers is here, and here's a breakdown of why crawlers are in fact a very complex thing - search engine companies usually have a whole team to maintain and develop them.

2) Efficiency:

If you process the above code a bit and run it directly, you'd need a whole year to crawl down the entire beanstalk. Not to mention the need for search engines like Google to crawl down the entire web.

What's the problem? There are just too many pages to crawl, and the code above is too slow and slow. Imagine that there are N websites in the whole network, then the complexity of analyzing the judgement weight is N*log(N), because all uy pages have to be traversed once, and each judgement weight requires log(N) complexity if set is used. OK, OK, I know the python implementation of set is hash - but that's still too slow, or at least not efficient in memory usage.

What is the usual practice of awarding weight? Bloom Filter. Simply put it is still a hash approach, but it has the feature that it can use a fixed amount of memory (which does not grow with the number of URLs) to determine if the URL is already in the set with O(1) efficiency. Unfortunately there is no free lunch, the only problem with it is that if the url is not in the set, BF can be 100% sure that the url has not been looked at. But if this url is in the set, it will tell you: this url should already appear, though I have 2% uncertainty. Note that the uncertainty here can become very small and minimal when you allocate enough memory. A simple tutorial:Bloom Filters by Example

Noting this feature, the url, if it has been looked at, may then be looked at repeatedly with a small probability (that's okay, you won't get tired of looking at it more than once). But if it hasn't been looked at, it will definitely be looked at (this is important, or we're going to miss some pages! ). [IMPORTANT: There is a problem with this paragraph, please omit it for now]

Okay, now it's getting close to the fastest way to deal with sentencing weight. Another bottleneck - you only have one machine. It doesn't matter how much bandwidth you have, as long as your machine is bottlenecking the speed of downloading web pages then you only have to speed that up. If using one machine isn't enough - use many! Of course, we assume that each machine is already in for maximum efficiency - use multi-threading (multi-processing in the case of python, I guess).

3) Clustered crawling :

In total, I used over 100 machines running around the clock for a month when crawling the doujinshi. Imagine if with just one machine you had to run it for 100 months...

So, assuming you have 100 machines available right now, how do you implement a distributed crawling algorithm in python?

We call 99 of the 100 smaller computing power machines as slaves, and another larger machine called master, then review the above code url_queue, if we can put this queue on the master machine, all the slaves can be connected to the master through the network, and whenever a slave finishes downloading a web page, it will request a new page to the master to crawl. And every time the slave catches a new page, it sends all the links on that page to the master's queue. Similarly, the bloom filter is put on the master, but now the master only sends the url to the slave that it is sure has not been accessed. The Bloom Filter is put into the master's memory, and the url being accessed is put into Redis running on the master so that all operations are guaranteed to be O(1). (at least the parity is O(1), for Redis access efficiency see:LINSERT - Redis)

Consider how to implement this in python: scrapy is installed on each slave, then each machine becomes a slave with scraping capabilities, and Redis and rq are installed on the master for use as a distributed queue.

The code is then written as current_url = request_from_master() to_send = [] for next_url in extract_urls(current_url): to_send.append(next_url) store(current_url); send_to_master(to_send) distributed_queue = DistributedQueue() bf = BloomFilter() initial_pages = "" while(True): if request == 'GET': if distributed_queue.size()>0: send(distributed_queue.get()) else: break elif request == 'POST': bf.put(request.ur

Well, actually, as you can imagine, someone has already written you what you need: darkrho/scrapy-redis - GitHub

4) Outlook and post-processing :

While the above uses a lot of "simple", it is not easy to actually implement a crawler that is usable at commercial scale. The code above is used to crawl a whole site with little to no problems.

But if attached you need these subsequent processes, such as

Efficient storage (how the database should be arranged)

Effectively weighted (in this case, web weighted, we don't want to crawl through the People's Daily and the Dailies that copied it)

Effective information extraction (for example, how to extract all the addresses on the web page extracted, "Chaoyang District Fenjin Road, China Road"), search engines usually do not need to store all the information, such as images I save for what...

Timely updates (predicting how often this page will be updated) As you can imagine, every point here is available to many researchers for ten or more years of research. Despite this, "the road is long, and I will go up and down to find out."

So, don't ask how to get started, just get on the road :)


Source: Knowledge

Copyright belongs to the author. For commercial reprints, please contact the author for permission, and for non-commercial reprints, please cite the source.

1、What the data tells you who is taking off in Chinas cities Who is in decline
2、Using JSONSchema to validate interface data
3、webpack 40 each hit 3 Assets chapter
4、ScottGuthrie sermons on Microsoft Smart Cloud big data machine learning and open source as pillars
5、6Market demand document MRD writing methods and skills

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送