Document directory
My random false text generator has officially released the access address: http://bugunow.com/lipsum
Cause...
Two days ago, I designed a style draft for my website. When I thought that the space on the page wanted to hold some text, I spent a lot of time looking for materials. at this time, I thought that I first saw a false document generation tool on a blog of an Taiwan cell (I don't remember the website), so I searched for it, loremipsum, which does not have any Chinese characters, and a false number generator with only traditional Chinese characters. maybe I don't know what other names it has in the Chinese mainland, so the passion is unmanageable as soon as it burns. I totally forget that Google Translate is the best tool for generating false texts...
What is a false data generator?
If you have to give it a definition, the random number pseudo-text generator is a tool that can generate a certain length of meaningless, but looks like natural text at first glance.
For details, click here.
Start.
Without analyzing any differences between Chinese and English, I started coding, and because of the time, I didn't think too much about how to implement these functions.
Because of the above reasons, I have laid two foreshadowing for this website, especially for performance. Let's talk about it later.
1. Where did these Chinese characters come from before generation?
I personally think these Chinese characters should exist in a "Chinese Character Library", which randomly extracts Chinese characters from the Chinese Character Library to generate a piece of false text. so I decided to create a "font" to store an order of magnitude of non-Repeated Chinese characters. a font storage will be created before the generator is created. I suddenly thought it was a little thing, but it was really troublesome to implement it!
Note: I actually made a mistake here, and now I suddenly realized it. Then I will reveal this feature that has made me feel different and is not implemented yet.
I spent the most time writing the font program because I was doing web development and rarely did winform development, so when we started winform, we pulled a lot of unused controls and made a lot of effort to make it beautiful. secondly, I pay great attention to OO when writing code, and try to keep the code in each method up to 10 lines. I have done a lot of checks to prevent misoperations, and I have also played an event mechanism. as a result, I realized that I made a "product" instead of just a small tool that I used temporarily.
Hands-on 1: the idea of implementing the font library is as follows:
1. First, I wrote a regular expression that matches English letters and spaces and an array of commonly used punctuation marks.
2. Because I only want the font to have the unique attribute "Chinese character", a list <char> is defined as the "Chinese character" of the font"
3. Read the document from the external TXT file as the source.
4. Remove all "invalid" characters, white spaces, and English words from source.
5. cycle every "Chinese character" in the source and check whether the font contains the Chinese character. If the Font does not contain the Chinese character, add it. Otherwise, continue the cycle.
6. The loop ends and serialized into binary files for storage.
I can complete the six-hop steps in an hour. Beyond the time of thinking, I have said that I am pursuing perfection ..
Start 2: import the font.
I downloaded a 6 MB TXT novel and threw it into the "Factory". After processing, the number of Chinese characters in the font changed from 0 to more than 3500. However, it took more than one minute ....
At this time, I entered the generator encoding stage and created a new project (Class Library) fluently. I decided to make it general enough.
Thinking 2: generator process:
1. construct a lorem ipsum class, give the path of the Chinese Character Library and English word library, and then give the line break symbol (because the web line feed is <br/> and in win, the line break is N ).
2. Execute the manufacturing pseudo method and pass it into a loremipsummodel parameter (entity class) which contains various generation options.
3. Calculate the number of paragraphs in a false text and the number of characters in each paragraph based on the parameter.
4. Start to randomly take the Chinese character heap false text, and add the ending punctuation marks and line breaks at the end of the paragraph.
5. Insert sporadic English words and punctuation marks based on the parameters.
6. Return false text.
From the process above, we can see that the following things will be made:
1. punctuation constants in Chinese and English, and end punctuation must be distinguished.
2. English font (the tragedy is coming to mind now ).
3. the pseudo-text segmentation algorithm, the algorithm used to calculate the number of English words, and the number of punctuation marks (including punctuation within a few Chinese characters ).
Tragedy: considering that English words can better represent the full power of a font (both Chinese and English), I stopped writing the builder and started to write the dictionary.
The dictionary processing tool is much more convenient than the Chinese Character Library, and a regular expression can pull all the English words. for efficiency, it took half an hour to write the tool, and deleted immediately.
Finally, in addition to considering versatility, the production generator quickly wrote out a specific generation method, which can be used in both winform and web to make me happy and finally cross-platform! Haha ..
Start 3: test the false file generator.
After several times of debugging to handle endless loop errors, the first test is very unsatisfactory. although I did generate a fake article, 500 words, 3 natural segments as I expected, and the punctuation is normal. as a result, there are a lot of messy words, and you or me cannot be seen in the whole article. and there are too many strokes for those words. Even tianshu cannot be so real.
At this time, I began to think about the differences between Chinese and English. English is a Chinese pinyin text, and N letters constitute a word separated by spaces. Although there are many spaces, they are quite elegant, at first glance, the general English is still very authentic (or because I have a poor level of reading in English, and I can't see the good or bad in the industry), because they are all presented in the form of words.
When using simplified Chinese, there are not many strokes in Chinese, and sometimes a Chinese character contains many meanings and is not often used. this novel probably has too many ghost and ghost contents. Although there are more than 3500 words, the overall effect is not good. I think, how many are the most common Chinese characters? About 1000 million characters, and more than 2500 of the remaining Chinese characters should be used less frequently or less frequently. So I decided to delete the Chinese Character Font and Import another one.
The goal was to target the works of two old gentlemen, Lu Xun and Zhu Ziqing. But the practice was: it was not enough to be plain-spoken or unsatisfactory! Finally, I decided to get started with my blog, my blog post, my blog, I copied about 1200 words and imported more than words. this has a good effect!
One of the foreshadowing points: the gap between Chinese and English characters and Chinese characters have not been analyzed. As a result, it takes a lot of time to adjust the font.
Tip 2: performance. I did a very simple test, and it took more than 7 seconds to generate 10000-word false texts for 1000 times in a loop .. it is still relatively long, And the CPU usage exceeds 70%. fortunately, this tool is not widely used by many people, and the number of users accessing this tool is not very large, so don't worry.
Finally, it took some time to fine-tune the punctuation and paragraph algorithms and began to develop websites. the website has no simple functions, but it takes a lot of time to design it. it is also on the way to the host.
Disadvantages:
1. No more authentic false texts are generated based on the frequent usage of Chinese characters.
I was wondering whether to create a key value pair, the Chinese characters in the key, and the value is the frequency. Compare the repeated auto-increment values with 1, and then obtain common words based on certain algorithms. this is the practice of changing the time for space. however, I personally think this is not necessary at all (I thought of it only when I wrote this article ). first of all, how can the font size exceed 10 MB? 10 MB of TXT text should contain at least 10 million words. In this business, even if the font size is set to 50000 words, the generated font file is less than 50 K. in addition, the time complexity of obtaining the index bit of an array is O (1), which does not cause performance problems due to the large amount of text in the font. when a font contains more words, the chances of being selected are fixed. The code is easy to write, and the memory and hard disk space are not a problem at all. so why should I spend more time on Chinese Character Font Processing tools? Orz ....
Update at 2010/5/16 10: 00.
2. No readability.
I would also like to make these texts seem readable, that is, they have at least three key components: "Main, that is," object. however, you need to add these attributes to all the text in the font. in addition, the younger brother did not think of a good batch processing method for the moment. so this function can be implemented when you have ideas to implement a more time-saving method.
3. Not many features.
Yes. N paragraphs, N words, N items in the list, and so on can be generated in other countries. These functions are very simple and will be added in the future.
At the end of the release to the server, system appears. security. securitypermisson: From the Perspective of stack errors, it is a problem generated during deserialization. After a long time, I don't know why. So I decided to modify the program tomorrow to reduce access permissions, after all, the customers on Godaddy are the lambs to be slaughtered ..
Seed and seed, my independent blog: http://bugunow.com/blog