360 Internship Logbook 2015.10 ~ 2015.12

Source: Internet
Author: User
Tags hadoop fs

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-13
1. Using the computing platform statistics 2015-09-28 = 2015-10-11 A total of 14 days of spe_num=502306 of the cloud, 1 days of results in 100,000-200,000 + bar, two weeks results in about 2 million.
Due to the large amount of data, the data can not be downloaded directly from the computing platform, and the download will report the error of memory overflow.
2. Start Learning MongoDB

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-14
1. Calculate cloud Avira Log 2015-09-28 = 2015-10-11 two weeks spe_num=502306 UV/PV
2. Try to verify that the call show log and V5 log Wid,imei correspond to the situation

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-16
1. Learn to run hive programs using cloud images
2. Using the computing platform to find the cloud one months spe_num=501833 log
3. Learn how hive is partitioned, and how to optimize hive

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-19
1. Complete the 1-month cloud Avira log SPE_NUM=501833UV/PV statistics
2. Apply for MONGODB test environment, and carry out a simple test to add and revise the sentence
3. The 2015-10-15 day of the call show log for the IMEI to the weight, ready to verify its correspondence with the V5 log wid

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-20
Ask for a day call show log and V5 log IMEI intersection, the specific steps are:
1. Call show log, v5 log IMEI to go heavy
2. Two copies of the logs are sorted separately
3. Compare the IMEI and wid correspondence of two logs sequentially. For an IMEI in two logs, the number of wid corresponding to the count statistics, for the case does not correspond to the IMEI and wid output. Time Complexity O (max (n,m)), number of n=v5 logs, m= number of call show logs.
Note: Data processing is slow due to a volume of nearly 70 million rows per log.

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-22
1. Completion of call show, V5 log wid corresponding statistics task
2. Learn Elasticsearch related knowledge

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-23
1. I set up Dlc_datamining/phone_info/wid documents (document) on the ES small cluster
Example: http://eseng1.safe.lycc.qihoo.net:9200/dlc_datamining/phone_info/0ff24eec0b6a6153522be24dabe422f0
The current document content is {"IMEI": "xxxx", "IMSI": "xxxx", "Phone_num": "XXXX"}, this later expansion

is still running in PHP, slower, inserted 40多万条, the main bottleneck in the HTTP request.
Later, I looked at the document and found that I could pack and insert it in bulk mode. I have configured the official Python SDK and are using Python to rewrite the program that built the library.

2. Learn the knowledge of Python and Elasticsearch

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-26
The program for inserting Phone_info in ES is rewritten in Java because the official Java API documentation is verbose, and Java supports multithreading.
65 million + Bar data is currently inserted and statistics can be referenced http://eseng1.safe.lycc.qihoo.net:9200/_cat/indices?v
At present Java program is used bulk every 1000 data batch inserts, but seems to have individual data omission but not insert the situation, the problem reason waits to find. According to the reference analysis, it is possible that the number of bulk bars is too large.

RE: Gio Ching ~ Daily-by Gio Ching added 5 months ago

2015-10-27
1. Through the IPC Log Hi.adstag field statistics 16th who has the guardian of the se/speed of the user installed what software
2. Calculated from 2015-10-14 ~ 2015-10-18, full volume Ipc_mid_adstag, approx. 170 million
3. Use Hadoop to generate the 2015-10-15 V5 log wid and the corresponding IMEI, Mac, IP, brand and other information (in JSON format) for later insertion into ES

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-10-28
Under Build17 's own directory:
1. Given the high efficiency of PHP development, the official ES_PHP_SDK environment has been configured
2. Compiled and installed a new version of PHP, and installed the SSH, multi-threaded extension
3.ES_PHP_SDK official documents are relatively brief, so in their own done based on the PHP in ES cluster INSERT, UPDATE and Upsert operations, familiar with the official ES PHP API use.
JSON information for 4.2015-10-15 's V5 logs has been generated in the Hadoop cluster

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-2
1. Merge the call show and V5 log wid and corresponding IMEI, IMSI, Mac, IP, brand and other information on Hadoop, and generate JSON to avoid the problem of not being efficient in the ES merge update json
2. Learn the Multipleoutputs usage of Hadoop to output the MapReduce results to a specified folder. For example: The result of wid corresponding to one mid output to the One2One directory, while wid corresponds to multiple mid results output to the One2many directory

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-3
1. Refer to the Http://stackoverflow.com/questions/18541503/multiple-output-files-for-hadoop-streaming-with-python-mapper method, Use Multipletextoutputformat to output wid corresponding to one mid to the One2One directory and One2many directory respectively
2. Insert phone information into ES via multi-line libraries of PHP, main performance bottleneck in HTTP request

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-5
1. Continue to learn more about the features and usage of the ES platform
2. Learn the full ordering of the Hadoop platform and the use of the Totalorderpartitioner class to increase the speed of two log merge
3. Try to filter on Hadoop first and insert the new ES into ES to improve efficiency

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-6
In view of the large amount of data processed per day to insert information into ES (independent wid≈1.2 billion), in order to improve the efficiency of inserting mobile information into ES, the following ideas are implemented:
1. In the HDFs store the inserted data (total), the total data is based on the full ordering of WID, and the first 2 bits of wid are grouped by prefix.
2. Each time the incremental data is inserted, the corresponding files are read from HDFs according to the first two bits of WID, and the file is sorted by merging, and can filter out the non-inserted and data changes of WID, time complexity O (n)
3. Bulk the filtered data that needs to be inserted into ES every 1000 bars via the ES Java API
4. Re-write the sorted data to the total file
------------------------------------
The above process requires the use of the Java API for Hadoop and ES. Since the previous use of PHP to write the MapReduce program, so now in conjunction with the "authority of the Hadoop" book is familiar with the method of writing MapReduce through Java, as well as file grouping, full ordering, binary input and output, data compression, cache files and other advanced features of Hadoop.

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-9
Continue writing Java programs that insert data into ES
1. Try to serialize data with Hadoop's built-in mapwritable and arraywritable, and find that it is not efficient to parse the data and a large number of unboxing and unpacking operations.
So instead of using a third-party JSON library to serialize the data, the effect is better, but because the string is stored directly, if the generated file is not compressed large.
2. Because the Java program is not easy to debug, during some of the details of the problem took more time to solve, fortunately, now the program basically can run through
3. The next step is to incorporate the API on the ES to insert data and run timed tasks on the compute platform, which is being implemented ...

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-10
1. Completed a Java program that inserts data into ES via Hadoop and inserts data for the day
2. Due to the original log format parsing problem, the mobile phone number inserted error data, the first solution is: First Use update to delete the phone number column, and then update the correct data.
However, by querying the data found that ES using the Lucene engine is read-only, that is, the update operation background will be first get, then merge, then Delete, in insert, very low efficiency.
Therefore, you can only delete the data and reinsert it.
-----------------------------
The above is a lesson:
Due to the large amount of data, ES writing data efficiency problems, write the program every step should be careful, and should be tested in a small area.
Java write hadoop+es difficult to debug the problem is still not resolved

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-11
1. Learn how to use Mrunit to do unit testing of the MapReduce program
2. Using Mrunit to test the algorithm for data in the merge another HDFs file in reduce
3. Apply the test success algorithm to the actual MapReduce program
4. In view of the curiosity of Mrunit in the local simulation of Mr Environment mechanism, the source code of some mrunit is reviewed, and the Mockito Java Simulation Test Library is learned, and the method of simulation test is studied.

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-13
1. A file (part-r-00339) is missing for the ES insert data previously saved in Hadoop, for unknown reasons. Re-ran the mapreduce that inserted es, and the results were normal.
3. In the Reduce Program cleanup processing es bulk operation, ignoring the boundary check, resulting in the MR Program error, modified to run successfully, but due to the large amount of data, took a lot of time
2. Finally the cycle task of inserting ES data into the computing platform ran up

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-16
1. In the computing platform ran ES Insert Mobile Information task, there are some file access issues, has been modified.
2. The original is the Mr Program is to save the increment + the full amount of the previous day, that is, the full amount of today, while deleting the full amount of yesterday, now instead of daily save the Increment + the full amount, for future inspection
3. Read some of the source code for the Java API of ES, especially the network communication section, and discover that the Java API is communicated through sockets rather than HTTP, but the protocol is not simpler than HTTP and the implementation is more complex.
4. Because Java development uses the IDE to improve efficiency, and Windows is not easy to install the CentOS virtual machine and simple configuration, but the performance of the virtual machine is worrying, the graphical interface is slow.

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-17
1. Solve the problem of timeout due to the ES batch submission overload, the workaround is:
① join catch (elasticsearchtimeoutexception) error handling
②timeout parameter increased to 60 seconds
Running the non-throwing exception again avoids the problem that 50% of the reduce program needs to retry 2 times

2. Learn the security authentication mechanism of Hadoop and HDFs with the source code of Hadoop
3. Learn about the RPC communication protocol for Hadoop and how to implement it
4. Deep Learning ES search function

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-19 ~ 2015-11-20
1. Learn more about the query API for ES and how to query the data in Es in Java.
Special attention:
By default, ES will be inserted into the data word processing, which has an impact on the precision query, the default is to remove the symbol, and by the space word, so if you insert the entire string and do not want to let ES analysis, you need to add the following options before indexing, which will have a large impact on future queries and searches.
Put/my_store {
"Mappings": {
"Products": {
"Properties": {
"ProductID": {
' Type ': ' String ',
"Index": "Not_analyzed"
}
}
}
}
}
2. Learn more about Hadoop's communication mechanism through source code
3. Learn the principle and implementation method of Java dynamic agent used in Hadoop RPC protocol

RE: Gio Ching ~ Daily-by Gio Ching added 4 months ago

2015-11-23 ~ 2015-11-25
1. Help Li Wei use hive to query the cloud to specify mid data in the log, and to
2. Before the mid in Es is extracted from the V5 log, the mid is found in ES in the process of a large number of repetitions of the problem, consulting Chen Ming learned that the reason is that part of the V5 mid is not the real value, but written dead.
The index in ES was reconstructed with backup data, and the mid column was removed
3.mid data is retrieved from the Wid2mid log and entered into ES
4. Solve the problem of Hadoop running ES entry program too slow: Because the V5 log is composed of a large number of small files, if not Combinetextinputformat format map number of more than 190,000, run once MapReduce takes 3 hours (one Day data volume), Because map cannot run across files by default.
After adding the Combinetextinputformat input format, the map number is reduced to 3000 + and the time is reduced to half an hour.
This leads to a deep learning of Hadoop sharding, partitioning, and grouping mechanisms with the Hadoop authoritative guide
5. Found that if the input file uses multipleinputs, the path cannot have commas, combined with source code to find that this is a bug of Hadoop itself, because the configuration file is formatted as "<path 1>,<mapper class 1>; <path 2>,<mapper class 2> ", between path and mapper, separated by commas,
If you have a comma in path that causes an exception, you cannot use a wildcard character with commas in the path, such as/xxx/2015{11,12}/when you use Multipleinputs to set the path.

RE: Gio Ching ~ Daily-by Gio Ching added 3 months ago

2015-11-27
1. In order to solve the problem of a third-party jar package that needs to be uploaded each time the MR Task is run (and the Mr Program is packaged in a jar), it intends to use Hadoop's distributed cache to set the jar that already exists in HDFs to Classpath,
But without the current success, there will be a class not found error.
2. In combination with the source code in-depth study of Hadoop distributed cache and Java ClassLoader mechanism,
The problem may be on a path reference,
is being phased out

RE: Gio Ching ~ Daily-by Gio Ching added 3 months ago

2015-12-11
The HDFs block for the 1.hadoop computing platform is 256M instead of the default 64M, so special attention needs to be paid to map shard control.
2.HDFS tiles cannot span files, even if 1kb files occupy a block, but this chunk size (such as 256M) is a logical unit (the smallest unit of Hadoop processing files), and the actual 1kb of files occupying the HDFs disk size is still 1kb.
3. Some of the original log files of the business are composed of many small files (several m), such as V5 upgrade and cloud Avira, usually set the input format for Combinetextinputformat to solve the fragmentation granularity problem.
But Hadoop calls the list<inputsplit> Getsplit () method implemented by the InputFormat subclass when processing the input, Inputsplit contains the input path, and when a large number of small files are used as input paths
, there will be outofmemory exceptions (such as 2 days of cloud logging as input, even if the JVM sets 2G heap memory is not enough). Hive also comes back to the above problem.
This problem can lead to, even though the amount of data is not large, but because the data is composed of large amounts of small files, still need to be divided into several jobs to run, greatly restricting the computing power of Hadoop.
The solution to this problem is:
① bottlenecks in list<inputsplit> This container can consume a lot of memory, it is not necessary to save all the input directories to the list.
The source code shows that Hadoop will eventually write Shard information to the Job.split file and upload it to the job's temporary directory, and the map task relies only on the file as input shard information.
So you can generate this file yourself (Generate method source code is very clear), filesystem Listlocatedstatus method will return remoteiterator<locatedfilestatus> iterator, Therefore, the input file information can be written to job.split through a stream-like method, thus avoiding memory overflow.
② Multi-threaded version of Listlocatedstatus can refer to the latest version (2.7) of the Hadoop source code for a slightly more complex, but can speed up the reading of a large number of small file directories.
③ can appropriately refer to the Getsplit method of the Combinefileinputformat class to realize the combine operation of the small file according to the requirement.
4. Write the MapReduce input path should try to avoid (careful) use wildcards, such as the Dir directory contains 20,000 files (no subdirectories), the Java version of the Mr Program using input=/dir/* and input=/dir/the same effect, But the latter is nearly 10 times times faster than the former and can be tested using Hadoop fs-ls.
This is related to how Hadoop reads the path to the file.

RE: Gio Ching ~ Daily-by Gio Ching added 3 months ago

2015-12-15
Hadoop Considerations Summary (cont. 12-11)
The 5.hadoop platform uses Facebook-based versions, and some APIs or APIs behave differently from the Apache version
① distributed cache requires the use of distributedcache.addsharedxxx, such as Addsharedcachearchive, Addsharedarchivetoclasspath. Instead of the usual addcachearchive, not adding "shared" is the inability to use the caching mechanism (experimented).
The ②facebook version uses Coronajobtracker, similar to the MapReduce v2 version of yarn.
There are three modes of Coronajobtracker:
-In-process:in This mode, the CJT performs it entire functionality in the same process as the jobclient
-Forwarding:in The CJT just forwards the calls to a remote CJT.
-Standalone:this is the remote CJT, that's serving the calls from the forwarding CJT.
The Hadoop client machine (05V) is configured by default mapred.coronajobtracker.forceremote=true, which means that Coronajobtracker forces startup of remote mode (Forwarding), While this setting is easy to manage centrally, it is a waste of resources for small Mr Tasks (the default number of maps is less than 1000).
For small tasks you can use the in-process mode, which is forceremote=false and guarantee that the number of maps is less than 1000 (this threshold can be set).

RE: Gio Ching ~ Daily-by Gio Ching added 3 months ago

2015-12-21
1. Use PHP's Curl module to input data into ES should be set curl_setopt ($ch, Curlopt_httpheader, Array ("Expect:")); option, or Curl first confirms the ES handshake (expect:100), ES responds to HTTP code:100, and then transmits the data, which is more efficient.

2.hadoop streaming mode combined with-inputformat org.apache.hadoop.mapred.lib.CombineTextInputFormat option, and reasonable setting
-D mapred.max.split.size=$[10*1024*1024*1024]
-D mapred.max.num.blocks.per.split=99999999
It can significantly improve the operational efficiency of Mr Jobs with a large amount of input and a large number of files (such as the cloud logging).
Java programs do the same.

The 3.hadoop streaming mode allows you to get configuration information for Hadoop through environment variables, just the "." of the jobconf variable name. To "_", for example, in PHP to get the value of Mapred.input.dir at run time, just $var=getenv (' mapred_input_dir ');
This greatly facilitates the development of the Mr Program based on the streaming model.

360 Internship Logbook 2015.10 ~ 2015.12

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.