Two days ago, he talked about software development with Huo and talked about the so-called dirty but fast approach, that is, to quickly complete a function, he did not hesitate to use the "not very elegant" mode (non-design pattern) or code implementation. Huo Ju gave this method an English phrase called dirty case. In terms of terms, dirty but fast approach is more appropriate, but dbfa is too long, in addition, sometimes dirty is not designed to be "fast", but to block vulnerabilities that are difficult to block by conventional methods, such as the case of root remote logon in the Unix system written by Huo. Therefore, it may be called dirty case or dirty approach.
Writing code, such as literary and artistic creations, is also highly competitive. Naturally, there is no way to get started. I just got started with a stupid method. After some improvement, I started to use a formal method. Experts dared to use a dirty method, however, the master is no longer capable of "starting from nothing. I am not a master, not a master, but I am lazy, so I often choose a dirty method, but it may be an effective implementation method. Everything can't be done too far. If it's not used, dirty but fast may be changed to dirty and slow. The following two examples tell the story about the development of the blog bulletin board function.
Example 1: counterexample
Blog e-reports are similar to RSS aggregation services like blogline, but prefer "editing your own newspapers and showing them together ". In the storage structure, blog_groups is used to store the electronic report list, blog_group_rssurl stores the RSS address list, and blog_group_grouprss stores the many-to-many relationship between the group and rssurl, blog_group_rsscontent stores the entries (titles, authors, summaries, links, etc.) of each article in RSS ).
When considering the paging system of e-reports, I was so lazy that I thought it was too troublesome to do paging queries. I wanted to find a opportunistic solution. After all, e-newspaper-> daily newspaper-> date, which is estimated to be read by date! You do not need to consider too many items during paging. Just select entries of a specific date. As a result, the reading by date function was completed for half a day.
After launching the test, two problems are found: one problem can be solved and the other cannot be solved. The one that can be solved first: there are no articles on the selected date. E-reports are RSS aggregation. If all RSS authors of an E-report do not write articles on a certain day, the result is that the e-Reports page is blank on that day. This is a bit depressing. So I thought of another method: Find the last day of the article and display the list of articles on this day. After a while, this feature is also implemented. Although it feels strange (for example, enter http://blog.csdn.net/group/experts/20050403.aspx.pdf for an article published in May April 4), it is not a blank one.
The second problem is coming soon. There are no articles to do on that day. It will be the last day of an article. What if there are few articles on that day? For example, there is only one article ...... Miserable, the page is still blank. In fact, blank pages are not the most important thing. What matters is that this method makes readers uncomfortable! I have no idea about the similarities and differences between e-reports and daily reports. The key lies in: although daily reports are published by day, their internal capacity is certain, and the writing date of articles can be different, but they must be published at the same time. The key is to ensure that there is quantitative content on the layout. "Navigation by date" violates this principle.
Of course, navigation by date is also useful. After all, an entry point is added, which provides the possibility of "Searching for articles on a specific day. In the end, I still keep this navigation method, but use it as the second navigation method. The first method is the traditional fixed number of lists per page.
Example 2: positive example
E-newspaper is an RSS aggregation service. "grabbing external RSS, parsing and saving content" is naturally the most basic feature. When the number of RSS feeds increases, the efficiency may be problematic. Here we will not talk about multithreading and distribution. If the hardware bandwidth resources are limited, how can we solve the efficiency problem?
The answer is "Minimize the number of crawlers, resolutions, and storage times ". The contradiction is that when there are more RSS addresses, the crawler should run more frequently to ensure "timely update ". How do you deal with this problem? My solution is to judge the last update date of RSS and the URL/update date of the article.
Each RSS has two key time points: 1. the last update time of the entire RSS 2. The last update time of each article. "Not modified" returned by HTTP cannot be a key element, unless it is the first time this RSS is obtained.
When getting RSS for the first time, I will record the lastmodified (an attribute in the RSS Specification) into the entries of the RSS in the database. The next time it is my turn to get this RSS, compare the stored lastmodified and remote RSS fields. lastmodified> localrssinfo. lastmodified to parse RSS. RSS is in XML format. It does not take much time to parse it once, but it adds up to a considerable amount.
Please pay attention to the "Turn" in the above section. The reader should have guessed that I will not take all the RSS addresses out of the database every time and capture them remotely. The cost of a roundtrip on the network is too high. Therefore, I use a dirty algorithm to determine which RSS should be captured during the current running of the program. Now we will introduce this "algorithm "--
1. When parsing an RSS feed, record the publishing interval between each article, for example, x1, x2, X3 ...... XN
2. After parsing the entire RSS, find x1, x2, X3 ...... The average value of XN, assuming y
3. remoterss. lastmodified + Y is the date that the next new article may appear.
The details are not described, and the reason is not explained. The reason is very simple. You can study them on your own. Some readers may suggest that the average value of the posting date interval of each RSS article stored in the database is closer to the date of the next update of the article? In fact, it seems that, because the number of posts published by a blog has never been stable for a long time, and the number of posts in different periods will change. Therefore, the average posting interval of the last 15 articles may be close to the date of the next article.
Of course, there is another case where the date of the next article is greatly advanced. But it seems to be against our rules. When this happens, this poor RSS may not be captured in time. The missing results may be very serious-if an article disappears from RSS, you may no longer be able to get it through RSS. When the system resources are limited, you can arrange some random "Lucky RSS", whether or not "on duty", try to catch it.
(To be continued)