Yesterday, Weibo saw a question in the 1 million Usernames. try to find the username automatically created by the machine. It is actually a simple anti-spam method. Some people say that they can search by google or baidu for each user name to see if there are any traces of Internet access. Not to mention that this is unreliable. The author obviously wants to solve this problem from an algorithm perspective rather than social engineering. I began to think about how to first break word segmentation for 1 million usernames, and then count each word. on these 1 million usernames, I saw a problem on Weibo yesterday.
Find the username automatically created on the machine among the 1 million usernames.
It is actually a simple anti-spam method.
Some people say that they can search by google or baidu for each user name to see if there are any traces of Internet access. Not to mention that this is unreliable. The author obviously wants to solve this problem from an algorithm perspective rather than social engineering.
I began to think about first word segmentation for 1 million usernames, and then count the number of times each word appears in these 1 million usernames, that is, the word frequency. Sort by word frequency in descending order and take top n. Next, find the words that appear in top n in the 1 million usernames. These are probably created by machines.
However, it was not scientific to do so, and a large normal user name may be mistakenly killed. Because some hot words appear in each time period, many people prefer to use these hot words as part of the user name. Or some classic words may be used by most people.
Therefore, I think that unless we can manually participate in the process to find some hot words. Exclude hot words from top n. Otherwise, this method is not good at all.
Let's take a look at your ideas and discuss them together. Note: This proposition only applies to the user name, rather than the user's speech or registration date.
------ Solution --------------------
1. from the past registration experience, the user names automatically created by the machine are combined with the registration information submitted by the user. There is also a prefix of the plus signs
2. check that usernames with the same prefix are the most concise method.
If you have data available at hand, you can explore the algorithm. Unfortunately, no
------ Solution --------------------
I am also paying attention to this. haha, although not quite familiar with beginners.
------ Solution --------------------
Reference:
1. from the past registration experience, the user names automatically created by the machine are combined with the registration information submitted by the user. There is also a prefix of the plus signs
2. check that usernames with the same prefix are the most concise method.
If you have data available at hand, you can explore the algorithm. Unfortunately, no
Take the csdn user library for trial... There is another 100 M + database on hand ....
At present, it seems that some characters + numbers are reliable, and the numbers keep going.
------ Solution --------------------
If I am a machine, I don't need to use simple words or english. I use Japanese, Korean, and Malay. can you use such a large database to tell the truth?
Therefore, the security token is still a verification code.
------ Solution --------------------
There is no solution to this problem using algorithms...
Ci169
Ci1699
Ci16999
Ci169999
Ci1699999
Just like which of the above CSDN accounts can be calculated for machine registration.
------ Solution --------------------
Why is hot considered a machine ????
------ Solution --------------------
Is there any free LAMP space for interesting questions? Upload a copy.
'tom'.substr(str_shuffle("abcdefghijklmnopqrstuvwxyz"), 0, 4);
------ Solution --------------------
Bayesian classification should be only positive, but how to organize raw data is a problem.
It is a bit inappropriate to mention algorithms without many uncertainties.
We recommend that you use weka (a java data mining software) for testing.
------ Solution --------------------
The user name registered by a person must have a certain logic, so that it is easy to remember, and the machine does not need to register automatically;
I think we can use a dictionary to screen the password first.
The question is just to find out as much as possible.
In fact, even the usernames sorted by disordered letters cannot be determined to be registered by machines,
Unless there are user logon behaviors, registration intervals, and other auxiliary information, otherwise, I really think this method is meaningless.