I have just left my office recently. In my spare time, I am still wondering how to proceed in addition to taking a good rest. I thought that there was not a recruitment sub-station in the blog Park. Then I went to visit.But the problem is coming soon. Well, yes, as you may think, there are too many recruitment entries. Although Dudu offers a lot of tags that can be filtered by category, however, the filtered information is still large. In addition, I have been paying attention to the recruitment information for more than two years, so I am not quite clear about what I have said in the recruitment process and what direction I prefer.
Why not write a tool to summarize the information? Well, this articleArticleThis is the original starting point.I think it should be very interesting. Just do what you say. Let's simply think about it first.ProgramProbably.
The core process is like this:
Capture page data-> convert data to raw data (pickitem is generated for the first time)-> traverse the converted data to a more specific data object (parseitem is generated for the second time) -> filter and count the data after the second conversion.-> display the result.
The core business objects include:
Capture the picker of the web page, parse the parser of the web page, filter the data, and count the counter used in the final statistics.
You can also design more details and get a few more business objects to achieve the ultimate Oo, system stability and scalability can also be considered in depth, haha, if you think about this program, you can't do it. Let's write down these core functions first.
Here I will take Picker as an example to briefly describeCode:
1 /// <Summary> 2 /// Capture webpages 3 /// </Summary> 4 Public Class Picker 5 { 6 Public Ienumerable <pickitem> Pickpage (pickrule rule) 7 { 8 Return Innerget (rule). selectiterator (P =>{ 9 VaR Items = Rule. dopick (P ); 10 Return Items = Null ? Enumerable. Empty <pickitem> (): Items; 11 }). Tolist (); 12 } 13 14 Private Ienumerable Innerget (pickrule rule) 15 { 16 VaR Currenturl = Rule. starturl; 17 Do 18 { 19 Yield Return Htmlhelper. gethtmldocument (currenturl, rule. pageencode ); 20 Currenturl = Rule. calccurrenturl (currenturl ); 21 22 } While (Currenturl! = String . Empty ); 23 } 24 }
The code is very simple. The captured task is handed over to the htmlhelper object (a simple encapsulation for the third-party library htmlagilitypack. The specific crawling rule is encapsulated in an object called pickrule. The pickrule code is as follows:
View code
Most of the other parts of the program use this design idea. Finally, let's take a look at the code of the "Main" function as the driver, as follows:
1 Protected Void Run_click ( Object Sender, eventargs E) 2 { 3 Picker picker = New Picker (); 4 Pickrule = New Pickrule_cnblogs (); 5 VaR Pages = Picker. pickpage (pickrule ); 6 7 Parser = New Parser (); 8 Parserule = New Parserule_cnblogs (); 9 VaR Parseditems = parser. parsepage (pages, parserule, 500 ); 10 11 Filter filter = New Filter (P => P! = Null & P. positioncategory. tolower () = " . Net programmer " ); 12 VaR Jobs = Filter. filting (parseditems). tolist (); 13 14 Counter = New Counter (); 15 VaR Result = Counter. Counting (jobs ); 16 17 // Print report 18 Foreach ( VaR Item In Result. orderbydescending (P => P. item2). Take ( 10 )) 19 { 20 Piedata. append ( String . Format ( " ['{0}', {1}], " , Item. Item1, item. item2.tostring ())); 21 } 22 }
We can clearly see the workflow mentioned above :)
Now let's take a look at the actual running results. The first is a complete set of statistics for. Net programmers:
Let's take a look at whether there are different requirements for. Net programmer skills in Shanghai and Chengdu:
Is it quite intuitive? No :)
PS:
1. Although simple, the program encountered a lot of trouble in the actual writing process. For example, in the second Information Extraction Transformation, The positionrequire attribute of the parseitem object is also a "job requirement. The initial design was to use the word segmentation component to perform word segmentation on the information, and to extract valid key information and assign it to the positionrequire attribute. In the code, it is as follows:
Parseitem1.positionrequire = new pangu (). segment ("information about the job requirement description on the crawled HTML page ");
After some processing, positionrequire may be like this: ", net", "Asp.net", "MVC", and so on; however, this method relies heavily on the word splitting effect of the word divider. (You may need to add additional auxiliary code to meet my requirements after reading relevant documents ), therefore, we finally implemented a simple dictionary list and a glossary ing table, which were processed in the order of traversal.
2. I think Dudu can consider adding such a data statistics analysis function to the recruitment sub-station. Starting with Dudu can solve the problem better and more elegantly. For example, job-related information can be defined as formatted information in advance, and further analysis and processing can be performed on formatted data in the future.
3. Well... According to the statistical results, I should supplement my knowledge in Asp.net MVC :)