Python multi-threaded crawler crawls mugshots.com + data visualization
Life is short, I use python.
Why do so many people use python these days? Many people say that it's because life is short and python can handle a lot of repetitive and tedious tasks in the world efficiently and easily.
That's really not an unreasonable statement.
But then again, I'd say it's because life is long, and a lot of the tedious and boring things you encounter along the way of this long life are made interesting by python.
main text (as opposed footnotes)
MU is an IT video learning site with many high quality video courses, and I have benefited from them on my learning journey. This time I want to write a crawler to crawl down all the Paid hands-on courses of course information, a visual display of the crawled information, and a brief analysis.
There are many videos on the site, there are free courses, career path courses, project based courses, etc., this time just crawling for information on project based courses.
To be crawled Target information are: course name, category, price, instructor, number of learners, course rating, course link, and course length.
As you can see there are a lot of hands-on courses, all of which are a lot of very interesting projects that fall under a total of 8 broad categories:frontier technology, Front-end Development, Backend Development, Mobile Development, cloud computing& big data, Operations and maintenance testing, databases,UI design。
Crawl ideas It is to first get the links to these 8 tags, then crawl the entry links to all the courses under each category, and finally go to each course's detail page in turn and crawl the data we want.
The initial link url is: https://coding.imooc.com/
Through this link to all the practical course page, press the browser F12 key to view the elements of the page, you can find the css path of all the elements of the page, through the css path first get all the links to the category tabs, and then enter each category page, through the css path to get each course course link, and finally through the course link to enter the details page of each course, as follows, within the details page through the css path will be able to get the information we want.
The chart above has the course name, difficulty, length, number of learners, course rating and price. The image below has the names of the instructors for the course.
Using multi-threaded crawlers can be several times faster than single-threaded crawlers. For operations such as making a request to a website, it can often take some or even a lot of time if it is single-threaded because of the wait time for a response. In contrast, with multi-threading, a request can be made to another link during the waiting time for a particular response, and another request can be made to another link during the waiting time for a response from another link. Ignoring the small time differences between each request, it is possible to think of them as parallel operations with multiple threads going on at the same time, so that the total time spent crawling is greatly reduced.
The library used for multi-threading is the Thread module under threading. Multiple threads can be created to execute the manipulation errors related to the url request, using these threads as downloaders; while only one thread is needed to extract and store the page data, and this thread can be used as data storage. Data can be transferred between different threads via the queue queue, using the Queue module under the library queue.
There are a dozen or even dozens of courses under each category, so first go to each category page and get all the course links under that page, then you can create a downloader thread for each course link to access its detail page and transfer the returned page data to a memory thread for it to extract the page and get the information we want, and finally store the obtained information in a local csv file.
You can see how it runs.
The following data was crawled to.
The code is placed at the end of the article.
Next the downloaded data is displayed visually.
Rank all hands-on courses by number of learners
And so the nice and beautiful tree diagram was obtained:.
As you can see, the introductory and hands-on WeChat applet course is quite popular, which is right in line with how hot WeChat applets are these days. Some of the classes behind it like vue.js and python for the front end, etc. are also things that are hot in the community right now.
Again, a ranking of the number of courses included by the instructors of the MU online real-world courses is shown.
The instructor with the highest number of courses included is Michael__PK, and we can see what all his courses are. run code
It is possible to show the
His courses are all in the cloud computing big data, and back-end development category, which is also a very hot technology direction right now, I think. Although it is said that this is the age of artificial intelligence, the web development craze has still not faded away and data, which is the basis for training machine learning models, is something that is important in the field of artificial intelligence.
This is followed by a ranking of the number of courses included in each category to see where the number of courses on the MU would be higher, using a pie chart to visualize.
Here is the nice pie chart.
This pie chart can clearly and intuitively see that the practical courses of Muchen are mainly front-end development and back-end development, of course, now there are gradually more courses related to artificial intelligence, I believe that in the future will also catch up with the number of front-end and back-end courses, this point let time to witness it.
Finally posting the code for the crawler section.
Thanks for watching and keep learning.