Dockone WeChat Share (99): Hainan Hna Ecological Science and technology public opinion big data Platform container transformation

Source: Internet
Author: User
Tags add time using git
This is a creation in Article, where the information may have evolved or changed.
"Editor's words" HNA public opinion monitoring system can provide monitoring network public opinion information, to the negative information, the important public opinion timely early warning, to judge the specific public opinion or a certain public opinion topic event development and change trend, to generate icon reports and various statistics, improve public opinion efficiency and assist leadership decision-making. However, as the project continues to run, many issues are emerging, in order to address these challenges, the entire project redesign, migration to the Hadoop, spark Big Data platform, the introduction of continuous Docker container deployment and release, development and operational efficiency has been significantly improved.

I. Introduction of public opinion platform

The original intention of the public opinion platform project is to strengthen the brand effect of HNA Group and its subordinate member enterprises, and to reduce the cost of key information dissemination, to gain timely insight into customer evaluation and public opinion, and to guide the guidance of public opinion and speed up the response to emergency events.

Work needs to be done include analysis and prediction of the spread of sensitive content on the Internet, social networks and other vectors, including data collection, sentiment analysis, outbreak forecasting, sensitive early warning, etc.

The current size:

    • Micro-bo Category:
      By setting up a micro-bo seed account (part through the search, part of the company Weibo account), mining fans deep-seated mining, crawl data daily information items currently have about 20w, will gradually add more seed account, but also in communication to buy Sina's Open API;

    • News, forums, blogs:
      30 of mainstream media;
      Large-scale forum of 20;
      Technology industry 70;
      Financial industry 30;
      Tourism industry 33;
      Aviation industry 30;


Other such as the public number, from the media category, the same industry fare website, altogether more than 300 sites, the data dimension reached more than 30, the daily data volume of 150w, the data volume is close to 10G;

The main functions are as follows:
    • Data crawling: Daily scheduled crawl of designated Weibo, latest news release information, storage for analysis
    • Data storage: Store Weibo, news content, images, etc., as well as intermediate analysis results, calculation results
    • Weibo public opinion: statistical analysis, information monitoring, information retrieval
    • News Public Opinion: statistical analysis, information monitoring, information retrieval
    • Hot Word statistics: high-frequency hot word statistics
    • Affective Analysis: Text analysis and the orientation of emotion based on text content
    • Public opinion Monitoring: Filtering information according to specified sensitive words and providing notification function
    • Data Interface Services: Providing external REST API Data Services
    • Hot Event Grooming: Provide search, prioritize news and Weibo records with high heat
    • Image recognition and Content analysis: (this part is being done)


Some of the display effects are as follows:

Second, initial structure

When joining the project, the project structure is relatively simple, as a validation phase, is a traditional Web application, using the Spring Web MVC + MySQL, plus data acquisition function crawler system + Text analysis Model (CNN), code review using Git + GitLab.

Reptile part:

Java language Implementation, based on the webmagic framework two times development. Because the page layout of each website does not have a unified format, developers need to write a single crawler for each site to do page data parsing. When the crawler is deployed, it compiles manually, executes multiple executable jar packages according to the running plan, deploys to multiple nodes separately, and data is stored in the MySQL database (deployed with a dedicated node). Support for the initial 30 websites and Weibo data, the volume of data about less than 20w per day.

Text Analysis Model:

Python implementation, using the Stutter Word breaker tool and the CNN (convolutional Neural network) model, support matrix batch operations. The way to run the Python Web (with the framework is Tornado) provides APIs, called by the Crawler, and backfill results, increase sentiment, heat, keywords and other fields, and then deposited into the database.

Front end + Background:

Typical spring MVC application, using spring MVC + MyBatis + MySQL, the front end uses echarts to generate graphs and reports; statistics are calculated in advance, stored in the MySQL database, and scheduled operations and data updates via quartz.

It is clear that MySQL cannot cope with the massive growth of data, which is not adaptable to the growth and expansion of data, and that the application's interface response time has been extended from a few seconds to a few minutes to an unacceptable amount.

To summarize, there are several obvious drawbacks to this framework (which is also initially used as validation, and on the other hand because of insufficient resources to start):
    • Cannot support large amounts of data storage (while maintaining good performance)
    • Data storage in multiple formats is not a good way to support
    • Project Dependent library file is also not code-based management, update, upgrade, packaging is very troublesome
    • Difficult deployment, manual packaging, tomcat deployment, inconvenient development and testing personnel, unfriendly to newcomers
    • Poor performance and difficulty in scale-out


Third, the application of containerized

To solve the above problem, we try to make sure that we need to move to the big data platform first. At the same time, we did some containerized work. The purpose of these tasks is to facilitate deployment and migration, easy scaling control, and the use of tools to automate the direction.

1) Introduction of gradle+jenkins continuous build tool

Using the Gradle build tool, the Gretty plug-in is used to remove code-dependent jar packages, to rely on code, to configure one-click Debugging and operation, and to build a pipeline code test, package, and deployment for each module using the Jenkins Continuous build tool, which is currently deployed as a shell script implementation.

2) Code Structure collation

Crawler code in each site data crawl is a pipeline, each pipeline has the same process, we put the configuration part of the code out, rewrite the boot portal receive configuration parameters, the configuration to determine which sites to start the pipeline; Change the spring web to a front-end separation;

3) Application of container

First, the MySQL database is containerized, the default/var/lib/mysql data directory and the configuration file directory are attached to the local, the previous data was migrated, then the Web service, using the Tomcat image, mount the WebApps directory, Gradle to copy the war packet to the local mount directory;

Then is the text analysis model, because the text analysis model needs to install a large number of dependent files (PIP), we rebuilt the image submitted to the local registry, cycle execution of the calculation task into a jar package, the runtime starts a new mirror instance run.

4) Using rancher container management monitoring platform

Container orchestration We are using the rancher platform, which uses the default cattle orchestration engine. We have about 40 + long-running instances, divided into 3 categories:

A crawler instance, with nearly 40 instances scheduled to more than 20 host nodes. Our data on the CDH platform, these containers do not occur between the communication, only with the text analysis model to communicate, the final data sent to the CDH cluster Kafka, these instances only for code replacement, update and operation of operations;

Currently, 3 instances of the text analysis model are deployed, which are randomly requested by the crawler according to the name.

Batch task class, using the Crontab tool provided by rancher, runs periodically.

Automatic code updates and deployments can now be done in less than one hours, before being deployed at least half a day.

5) Local Mirror Warehouse

Rancher provides registry management capabilities to easily manage registry. To speed up downloads, we deployed a registry on-premises to facilitate mirroring updates and app migrations.

IV. Technical Framework Migration

As the crawling data of reptiles increases day by day, the system must not support it. After discussion, we identified the basic architecture. Using HBase + Elasticsearch as the data store, Kafka as the message queue, HBase is responsible for storing the crawler data, ES is responsible for indexing (our consistency is not high). The rancher manages the data crawled by the distributed crawler to the KAKFA cluster, before sending an HTTP request to the text analysis model (in the container) to backfill the corresponding fields. The data is then transferred to HBase and ES by two Kafka consumer to complete the data save.

Reptiles are now containerized and managed by rancher.

The statistical work is done by spark SQL read-write hbase and is not yet real-time. Our practice is to store the data in the table by day, and select the calculation range for real-time calculation according to the request condition when the service request. This is a step forward from real-time, and will continue to be transformed into real-time.

There is a detail here, because our data is time-ordered, there are time-based requirements, and we are dealing with data that is mostly in the near-term (the most recent day/week/month/year), so we want hbase to be in reverse order based on the published time of the record. So we use the timestamp as the first paragraph of the rowkey splicing of HBase, but this introduces a new problem, the record in the HBase cluster will "get together", so in order to alleviate this problem, we put the release time of the hour to put it before the time stamp, so that the local or time-based sorting, It will not affect the scaling of hbase nodes for the time being.

The backend uses spring data (ES + HBase) to manipulate the information, and the cache mechanism is not added at the moment, but the front end is Angularjs, but it is separated by front and back. Now the total data volume has reached the previous dozens of times times, the data request is basically within 1S, the retrieval query is provided by ES data, the request is basically 300ms to 1s. The offline batch job execution time is reduced from the previous 8min to an average of 2.5 minutes.

At present, the big data platform is not containerized, running on a set of CDH clusters, the cluster configuration is highly available. Kafka and Es use open source (Spring data version reason) to improve the reliability of their services by using Supervisord.

In this together, our next goal is to separate the computational parts of the big data platform, such as Spark, the model algorithm, to be containerized, so that we can achieve the computational power of elastic auto-scaling based on the computational capacity, we have a set of test clusters to manage Docker images on Mesos, This part is being tested, including spark applications and distributed machine learning algorithms.

V. Continuous deployment and Release

This piece uses Gitlab + Gradle + Jenkins (Docker) + shell script
    • Gradle: Perform test, build, apply package, Debug and run locally;
    • GitLab: Code warehouse, code review;
    • Jenkins: Run in containers, continue building management, and perform regular builds and deployments;


Gitlab set commit trigger, Jenkins set receive trigger execution Pipeline,jenkins execute build, call Gradle and Shell command to execute build; Because code and configuration files are mapped to local Copy the packaged code to the deployment node replacement code file at deployment time, and restart the container instance to complete the service deployment.

Q&a

Q:spark Direct operation in the Mesos is not very convenient, the advantages of container is obvious? Where is the main point of consideration?

A: containerized mainly consider two points: a solution to the massive data calculation of the resource orchestration, the future will also be based on CaaS Cloud services, two research and development system of agile and standardization issues. We are considering elastic scaling based on computational needs, and containerized is a boost.
Q: Why Elasticsearch, and not choose SOLR?

A: Under the index, ES performance will be better, and it is natively distributed support, installation configuration is simple.
Q: What is the reason why the code is not packaged into the image?

A: This makes the deployment run more flexible and I can put the code locally or upload it to an instance. Code submissions are relatively frequent, and the execution environment changes little, and only partial deployments that replace transformations are faster. The main purpose is to adapt to the current deployment mode.
Q: How does a crawler container dispatch, distributed?

A: It is distributed, this is run on a timed schedule, rancher provides the crontab, and the crawler provides the execution portal.
Q:hbase PRIMARY Key design still does not solve hot issues?

A: It is not completely solved, and a better rowkey design method is not found based on time series; divide him into 24 small sections, add time, separate for each paragraph, it is sorted by time, also is a kind of compromise.
The above content is organized according to the December 13, 2016 night group sharing content. Share people Gao Yan, employed in HNA Ecological Technology Research Institute, as a big data development engineer, engaged in big data platform application development year, responsible for big data platform technology selection, architecture design and code development work。 Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.