360 Internship Logbook 2015.10 ~ 2015.12

Last Update:2016-03-11 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-13
1. Using the computing platform statistics 2015-09-28 = 2015-10-11 A total of 14 days of spe_num=502306 of the cloud, 1 days of results in 100,000-200,000 + bar, two weeks results in about 2 million.
Due to the large amount of data, the data can not be downloaded directly from the computing platform, and the download will report the error of memory overflow.
2. Start Learning MongoDB

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-14
1. Calculate cloud Avira Log 2015-09-28 = 2015-10-11 two weeks spe_num=502306 UV/PV
2. Try to verify that the call show log and V5 log Wid,imei correspond to the situation

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-16
1. Learn to run hive programs using cloud images
2. Using the computing platform to find the cloud one months spe_num=501833 log
3. Learn how hive is partitioned, and how to optimize hive

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-19
1. Complete the 1-month cloud Avira log SPE_NUM=501833UV/PV statistics
2. Apply for MONGODB test environment, and carry out a simple test to add and revise the sentence
3. The 2015-10-15 day of the call show log for the IMEI to the weight, ready to verify its correspondence with the V5 log wid

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-20
Ask for a day call show log and V5 log IMEI intersection, the specific steps are:
1. Call show log, v5 log IMEI to go heavy
2. Two copies of the logs are sorted separately
3. Compare the IMEI and wid correspondence of two logs sequentially. For an IMEI in two logs, the number of wid corresponding to the count statistics, for the case does not correspond to the IMEI and wid output. Time Complexity O (max (n,m)), number of n=v5 logs, m= number of call show logs.
Note: Data processing is slow due to a volume of nearly 70 million rows per log.

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-22
1. Completion of call show, V5 log wid corresponding statistics task
2. Learn Elasticsearch related knowledge

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-23
1. I set up Dlc_datamining/phone_info/wid documents (document) on the ES small cluster
Example: http://eseng1.safe.lycc.qihoo.net:9200/dlc_datamining/phone_info/0ff24eec0b6a6153522be24dabe422f0
The current document content is {"IMEI": "xxxx", "IMSI": "xxxx", "Phone_num": "XXXX"}, this later expansion

is still running in PHP, slower, inserted 40多万条, the main bottleneck in the HTTP request.
Later, I looked at the document and found that I could pack and insert it in bulk mode. I have configured the official Python SDK and are using Python to rewrite the program that built the library.

2. Learn the knowledge of Python and Elasticsearch

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-26
The program for inserting Phone_info in ES is rewritten in Java because the official Java API documentation is verbose, and Java supports multithreading.
65 million + Bar data is currently inserted and statistics can be referenced http://eseng1.safe.lycc.qihoo.net:9200/_cat/indices?v
At present Java program is used bulk every 1000 data batch inserts, but seems to have individual data omission but not insert the situation, the problem reason waits to find. According to the reference analysis, it is possible that the number of bulk bars is too large.

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-27
1. Through the IPC Log Hi.adstag field statistics 16th who has the guardian of the se/speed of the user installed what software
2. Calculated from 2015-10-14 ~ 2015-10-18, full volume Ipc_mid_adstag, approx. 170 million
3. Use Hadoop to generate the 2015-10-15 V5 log wid and the corresponding IMEI, Mac, IP, brand and other information (in JSON format) for later insertion into ES