Preface:
A few months ago, at the request of a friend, wrote a LinkedIn crawler, not very difficult, but the function is also fun, so they sorted out a decentralized. Code See Github:linkedinspider.
Reptile function : Enter a company name, crawl related employee's LinkedIn data, field see screenshot below.
Text:
Let's start with LinkedIn's limitations: No login status, no search, but a user's LinkedIn information can be viewed (not fully). LinkedIn can search users (up to 100 pages), but also can search the company, but not the company below the employee information (showing the "LinkedIn member", does not have the right to view the details, the request to establish contact, as shown below, may open LinkedIn advanced account can be viewed, unknown).
So if you want to crawl the LinkedIn information of a company employee, what to do.
Method One, more money, open advanced account may be able to view.
Method two, to search for LinkedIn users, try to grab a full number of LinkedIn users, from which to filter out the employees of a company. (difficulty is how to search the user, and because of the page limit, almost unable to crawl full volume).
Method Three, with the help of third party platform. No sites have been found to be useful to LinkedIn data, but brainwave thought of Baidu included. We use Baidu Search, search for a company name, domain name requirements linkedin.com (such as crawling objects for Baidu, baidu Search can be searched in the "Baidu Site:linkedin.com"), from which to filter out the LinkedIn user ID, With the user ID, we can go directly to LinkedIn to grab employee information.
Now we're using method three. Tell me about the reptile process:
First log on to LinkedIn, with a LinkedIn cookie for Baidu search, which filters out LinkedIn's (jump to LinkedIn) jump links and then crawls and parses.
Note: In order to crawl to the latest data, generally do not directly crawl Baidu included content, just through Baidu collected to the user ID; In addition, to stay with LinkedIn cookies to open the search for the link, or you will jump to the LinkedIn login page, or crawl to the information is not complete.
Conclusion:
The code is placed in the GitHub, and the link is mentioned above. This article mainly as a note.
This is just a small crawler, I want to share, not only LinkedIn login, LinkedIn data capture and analysis, more importantly, through Baidu collection of target data to capture this method.
For a reptile, or want to learn the crawler students, the path must be wide, as long as the data can be guaranteed accurate, complete, should be from all channels to sniff, crawl data, crawl difficulty is smaller, faster, the better.
Reprint please indicate the source, thank you. (Original link: http://blog.csdn.net/bone_ace/article/details/71055153)