Using multi-threaded crawlers can be several times faster than single-threaded crawlers. For operations such as making a request to a website, it can often take some or even a lot of time if it is single-threaded because of the wait time for a response. In contrast, with multi-threading, a request can be made to another link during the waiting time for a particular response, and another request can be made to another link during the waiting time for a response from another link. Ignoring the small time differences between each request, it is possible to think of them as parallel operations with multiple threads going on at the same time, so that the total time spent crawling is greatly reduced.
The library used for multi-threading is the Thread module under threading. Multiple threads can be created to execute the manipulation errors related to the url request, using these threads as downloaders; while only one thread is needed to extract and store the page data, and this thread can be used as data storage. Data can be transferred between different threads via the queue queue, using the Queue module under the library queue.
There are a dozen or even dozens of courses under each category, so first go to each category page and get all the course links under that page, then you can create a downloader thread for each course link to access its detail page and transfer the returned page data to a memory thread for it to extract the page and get the information we want, and finally store the obtained information in a local csv file.
You can see how it runs.
The following data was crawled to.
The code is placed at the end of the article.
Next the downloaded data is displayed visually.
Rank all hands-on courses by number of learners
And so the nice and beautiful tree diagram was obtained:.
As you can see, the introductory and hands-on WeChat applet course is quite popular, which is right in line with how hot WeChat applets are these days. Some of the classes behind it like vue.js and python for the front end, etc. are also things that are hot in the community right now.
Again, a ranking of the number of courses included by the instructors of the MU online real-world courses is shown.
The instructor with the highest number of courses included is Michael__PK, and we can see what all his courses are. run code
It is possible to show the
His courses are all in the cloud computing big data, and back-end development category, which is also a very hot technology direction right now, I think. Although it is said that this is the age of artificial intelligence, the web development craze has still not faded away and data, which is the basis for training machine learning models, is something that is important in the field of artificial intelligence.
This is followed by a ranking of the number of courses included in each category to see where the number of courses on the MU would be higher, using a pie chart to visualize.
Here is the nice pie chart.