Friends, I heard that you have a shortage of books. Is more than 100,000 books enough? Crawl in python in a few steps


This time with the mogodb database, feel mysql is too troublesome the following chart is my choice of Yisou inside the traversal of the site

First look at the code framework diagram

The first one, certainly first extract the links of each category inside the leaderboard ah, and then enter the links to crawl, first look at all_theme file

Look at the results of the run, which is for the books category

This is the link to all the pages inside each category constructed and the entrance to our crawler, a total of more than 5,000 pages

Next up are the encapsulated database operations, as multiple processes are used as well as multi-threading each process, they need to know which URLs have been crawled and which URLs need to be crawled! Let's set two states for each URL.

outstanding:URLs waiting to be crawled

complete:Crawl completed URL

processing:Ongoing URL.

Mmm! When an all initial URLs status is OUTSTANDING; when the crawl starts status changes to: PROCESSING; crawl completion status changes to: COMPLETE; failed URLs reset status to: OUTSTANDING.

To be able to handle the case where the URL process is terminated, we set a timing parameter and reset the state to outstanding when this value is exceeded.

Next is the main crawler program

Let's see the results.

There are only a hundred thousand books in there because a lot of them are duplicates, so disappointing after all the de-duplication ......

But with over 100,000 copies, that's enough!


Recommended>>
1、ELK log collector build
2、Yanan government big data public service platform line data Yanan to help a network to do
3、The most important chess gold flower open cheat hanging software
4、Decoding brain activity from brainwaves to robot motion using deep neural networks
5、Inkers Security Share What are the common DDOS attack tools

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号