Friends, I heard that you have a shortage of books. Is more than 100,000 books enough? Crawl in python in a few steps
This time with the mogodb database, feel mysql is too troublesome the following chart is my choice of Yisou inside the traversal of the site
First look at the code framework diagram
The first one, certainly first extract the links of each category inside the leaderboard ah, and then enter the links to crawl, first look at all_theme file
Look at the results of the run, which is for the books category
This is the link to all the pages inside each category constructed and the entrance to our crawler, a total of more than 5,000 pages
Next up are the encapsulated database operations, as multiple processes are used as well as multi-threading each process, they need to know which URLs have been crawled and which URLs need to be crawled! Let's set two states for each URL.
outstanding:URLs waiting to be crawled
complete:Crawl completed URL
processing:Ongoing URL.
Mmm! When an all initial URLs status is OUTSTANDING; when the crawl starts status changes to: PROCESSING; crawl completion status changes to: COMPLETE; failed URLs reset status to: OUTSTANDING.
To be able to handle the case where the URL process is terminated, we set a timing parameter and reset the state to outstanding when this value is exceeded.
Next is the main crawler program
Let's see the results.
There are only a hundred thousand books in there because a lot of them are duplicates, so disappointing after all the de-duplication ......
But with over 100,000 copies, that's enough!