1. background with the mid-term code hosted on the csdn platform, ospaf (open-source project maturity analysis tool) already has a small prototype, of course, far from enough. First of all, I would like to thank csdn, the organizer of this event. It seems like Google summer code's Chinese version. In addition, my summer camp mentor David gave me a lot of guidance and help. Offline communication also gave me great insights. Next let's talk about the ospaf Project (if you are interested in Tx, you can check the case and address ). According to the previous plan, we should complete the understanding of GitHub-related APIs before the Interim Defense, clone some GitHub data to a local database, and use some machine learning algorithms to train models, this model can then be used to evaluate other projects. So far, these functions have been simply implemented, but they are all very basic versions (Code addresses ). The following describes the project process.
2. Project Process Step 1: GitHub API call Survey Three GitHub-related APIs: GitHub official API, GitHub archive, and ghtorrent. Among them, ghtorrent provides the most comprehensive data (including commits and other information), but because the data volume is too large, it has to be discarded before the server is available. The remaining two types of data are actually the same, but the official GitHub API has a certain traffic limit. Finally, the official API is selected. The first step is to get the API address of the project above GitHub, to do some JSON Format Parsing work, there is some regular matching. Stored in the database is roughly as follows,
Figure 2-1 The URL can read each URL address, obtain detailed information about the relevant project, and save it to the database.
Figure 2-2 repo info
Step 2: process the data, build the training set, and put the data into the database. The rest is the machine learning content. Due to traffic restrictions, only 43 projects can be cloned per hour, so the training set is insufficient, there is no feature scaling operation (this will be addressed in the next phase ). In terms of features, we only adjusted the time and changed the format of year, month, and day to the date difference from the day, for example: created_at = 500, that is, this project was created 500 days ago. Because algorithms use supervised learning, you need to set the target queue. The method for obtaining the target queue is to extract some GitHub showcase projects as positive samples, and other projects as negative samples. In this way, the training set is simply formed. The ratio of positive sample to negative sample is about. The total data volume has more than 60 training samples (poor ).
Step 3: Machine Learning uses fewer algorithms because the training set is small. The normalization algorithm and sampling algorithm are used to process data. The calculation model uses logistic regression. Is the regression factor of each feature. Features greater than zero have a positive impact on the sample, while those smaller than zero have a negative impact.
Figure 2-3 feature
Step 4: Evaluate (score a project) the following four items are used to test the model. The first three are hot projects on GitHub, and the fourth is one of my own projects. Figure 2-4 preset
Figure 2-4 result chart
If the score is greater than zero, it is a project with high maturity. The higher the score, the higher the maturity.
Project address: https://code.csdn.net/davidmain/ospaf
/********************************
* This article is from the blog "Li bogarvin"
* Reprinted please indicate the source: http://blog.csdn.net/buptgshengod
**************************************** **/