hadoop+spark+mongodb+mysql+c#

Last Update:2016-08-23 Source: Internet

Author: User

Tags sodium knowledge base hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, preface

From the the 1990s Digital hospital concept proposed to so far more than 20 years, Digital Hospital Hospital in the domestic major hospitals, the rapid popularization and development, and achieved remarkable results. Not only has the digital hospital Management Information System (his), image archiving and communication systems (PACS), Electronic medical Record System (EMR) and Regional Health Services (GMIS) and other successful implementation and popularization, and with the rapid development of computer technology and network technology innovation, Further to the digital hospital to bring new interactive channels such as: Remote medical services, online registration appointment.

With the rapid development of it technology, more than 80% of the three-level hospitals have established their own hospital information system (his), Electronic medical Record system (EMR), rational drug Use System (PASS), Inspection Management System (LIS), medical image storage and sharing system (PACS) and mobile rounds, Mobile care system and the integration of applications with a large number of third-party interfaces, it in the medical field has entered a big data era, with his extensive application and continuous improvement of functions, he collects a large number of medical data.

Into the 2012, big data and related large processing technology is increasingly mentioned by people, the concept of big data is widely accepted, big data technology also affects our daily life, the Internet industry has been widely used, telecommunications, Banks and other industries are already experimenting with big data technologies to deliver more robust and quality services.

In the current situation, the medical IT system collects these valuable data, but these large amounts of valuable historical medical data do not play their rightful role in providing medical diagnostic assistance to frontline clinicians, nor do they provide the necessary support for hospital management and business decisions.

In view of the above situation, the paper intends to use the current history of hospital records, prescriptions, diagnoses, medical records data, mining valuable statistics-based medical rules, knowledge, and based on these rules, knowledge and information to build a professional clinical knowledge base, for frontline medical personnel to provide professional diagnostic, prescription, drug recommendation function, Based on the strong association recommendation ability, it greatly improves the quality of medical service and reduces the work intensity of frontline medical personnel.

Second, Hadoop&spark

There are many frameworks in the field of big data processing at present. From a computational standpoint, there are mainly MapReduce frameworks (which belong to the Hadoop ecosystem) and the spark framework. Spark is the next Generation computing framework in the last two years, the memory-based feature makes it much better than the MapReduce framework in terms of computational efficiency, and from a storage point of view, the HDFS framework in the Hadoop ecosystem is now mostly used. The range of HDFs features makes it ideal for storage in big data environments.

2.1 Hadoop

Hadoop is not a software, but a distributed system infrastructure, an open source project developed by the Apache Foundation. Hadoop enables users to develop distributed programs without understanding the distributed underlying implementation, thereby leveraging the power of a computer cluster to achieve high-speed computing and large-scale data storage. Hadoop is mainly composed of HDFs, MapReduce, HBase and other sub-projects.

Hadoop is a software framework that enables distributed processing of large amounts of data and data processing in a reliable, efficient, and scalable manner. Hadoop assumes that data processing and storage fail, so the system maintains multiple copies of the work data, ensuring that the processing can be re-distributed against failed nodes. Hadoop improves data processing speed by working in parallel. Hadoop is capable of processing petabytes of data, which is not achievable by regular data servers. In addition, Hadoop relies on the open source community, and any problem can be solved in a timely manner, which is one of the great advantages of Hadoop. Hadoop is built on a Linux cluster, so it's low cost and can be used by anyone. It mainly has the following advantages:

1 High reliability. The Hadoop system has a default of three backups, and Hadoop has a system of data inspection and maintenance mechanisms, thus providing a high-reliability data storage.

2 Strong extensibility. Hadoop allocates data on a common PC server cluster and computes tasks through parallel operations, which makes it easy to expand more nodes for the cluster.

3 efficiency. Hadoop is able to dynamically transfer data between different nodes in the cluster. and ensure the dynamic balance of each node, so processing speed is very fast.

4 High fault tolerance. Hadoop is able to hold multiple copies of the data so that the data can be reassigned when it fails.

As shown in the Hadoop overall architecture, the core of the Hadoop architecture is the two components of MapReduce and HDFs.

Google published the paper "Google File System", the system describes the design of Google's Distributed file systems, Apache for GFS, open source development, released the Hadoop Distributed File system: Hadoop distributed File System, abbreviated to HDFS. The core idea of MapReduce is also proposed by Google's paper "mapreduce:simplified Data processing on Large Clusters", the core idea of the single-story MapReduce is: task decomposition and execution, A summary of the execution results.

2.2 Spark

Spark is an open-source, similar MapReduce computing framework for the UC Berkeley University amp Lab, a memory-based cluster computing system that initially aims to address the overhead of mapreduce disk reads and writes, and the current version is 1.5.0. spark-, with its high performance and ease of use, attracts a lot of big data researchers, and with the efforts of many enthusiasts, spark has evolved its own ecosystem (spark-based, with the upper layer including spark Sql,mlib,spark Streaming and Graphx), and become Apache's top project.

The core concept of spark is the elastic distributed storage (resilient distributed Datasets, RDD), which is a spark's abstraction of distributed memory that allows the user to manipulate the RDD as if it were a local data set, thus concentrating on the business process. In the Spark program, the operation of the data is based on the RDD, such as the classic WordCount program, which operates as shown in the Spark programming model:

You can see that spark first abstracted from the file system RDD1, and then by RDD1 through the flatmap operator to RDD2,RDD2 then Reducebykey operator to get RDD3, finally the data in the RDD3 back to the file system, all operations are based on RDD.

Iii. Ideas and architecture

After a lot of thinking, the final decision based on spark technology to build and implement the hospital clinical Knowledge Base system, using MONGODB/SEQUOIADB to build Big Data Warehouse, as Big Data storage center, using HADOOP+SPARK1 to build big data analysis platform, Based on the agileeas.net SOA middleware, the ETL Data Extraction transformation Tool (Pentaho kettle) is built, and the service Portal based on Agileeas.net SOA middleware builds the knowledge base, through wcf/ WebService integrates business integration with his system, and uses agileeas.net Soa+fineui to build the basic dictionary management to analyze the image display function of the structure.

Originally we chose sequoiadb as the Big Data Storage Center, for this I also deliberately completed the C # driver for sequoiadb , refer to I write for the giant FIR database (open source NoSQL) C # Driver, Support LINQ, all open Source, GitHub has been submitted, but on the one hand familiar with sequoiadb 's technical staff is too few, maintenance is a problem, finally, in almost 8 months after we swapped with MongoDB 3.0 as a big data storage center.

Initially we chose the hadoop2.0+spark1.3.1 version to complete the hospital clinical Knowledge Base system using scala2.10, please refer to the centos+scala2.11.4+hadoop2.3+ spark1.3.1 environment, but in the late replacement of SEQUOIADB for MongoDB, we have also upgraded the computational framework from hadoop2.0+spark1.3.1 to hadoop2.6+spark1.6.2.

Given that spark is deployed in Linux, the result output for the spark analysis is stored in the Mysql5.6 database, and the various dictionary information used by the system is stored in MySQL.

The code for the Spark data Analysis section is written using the IntelliJ idea 14.1.4 tool, and other parts of the code are written using VS2010.

3.1 Overall architecture

The whole system consists of the data acquisition layer, the storage analysis layer and the application logic layer, and the external data source which has been selected by the system.

The external data source of this system is mainly the clinical data produced by the hospital information system at present, it is mainly concentrated in his system, and later will rely on EMR, LIS, PACS system.

The data acquisition layer is mainly responsible for collecting massive historical clinical data from the clinical business system, which is divided into batch acquisition and real-time collection, and the original data is checked in the process of data collection, and the original data is cleaned and transformed, and the processed data is stored in the Big Data Warehouse.

Storage analysis layer is mainly responsible for data storage and data analysis of two major businesses, through the clean conversion of reasonable and effective data is stored in the Big data cluster, using JSON format, Big Data storage reference using SEQUOIADB database, data analysis part of the Hadoop/spark cluster to complete, Big data storage is imported and analyzed by Spark, the results are written into the clinical knowledge database, and the clinical knowledge database is stored using the MySQL database.

The application of the logic layer main person-in-charge machine interaction and analysis of the structure of the channel to the clinical system, through the WebUI way to clinicians, business managers to provide a list-based, image of the knowledge display, but also for the clinical system of business assistance, recommended functions to provide call integration API, At present, the API mainly through WebService, webapi two kinds of ways to provide.

3.2 Overall process

The entire system is collected through data sources, written to the big data storage sequoiadb cluster, then analyzed and calculated by spark, and the resulting clinical knowledge is written to the MySQL Knowledge Base, which is referred to the clinical use via WebUI and the standard API.

3.3 Data import Process

The acquisition of historical data is implemented using the Agileeas.net SOA planning task in the early stages, which is coordinated and timed by the scheduled task, and the specific data import code is adapted to different clinical business systems for scripting code, or you can use Pentaho Kettle is implemented through the Pentaho Kettle configurable data import.

3.4 Physical Structure Design

Clinical data source for the system analysis of the source of data, from the clinical his, EMR, the current hospital's his using SQL Server R2 database, EMR using Oracle 11G database, running on the Windows2008 operating system.

SEQUOIADB cluster is a large data storage cluster, currently using SEQUOIADB v2.0, running on the Centos6.5 operating system, uses a 2-16-node cluster on a business scale to store massive historical clinical data that has been cleaned and transformed for analysis by the Spark cluster, and to supply SOA servers for historical data queries and historical-related recommendations.

Hadoop/spark cluster is the core node for the analysis and calculation of the system, which is used to analyze the historical data in the SEQUOIADB cluster, and to generate the medical knowledge used by the assistant clinicians, the cluster uses 2-16 node cluster according to the business scale. Using the Centos6.5 operating system, install the JAVA1.7.79 Runtime environment, scala2.11.4 language, and use the hadoop2.3,spark1.3.1 analysis framework.

The MySQL Knowledge base is the repository database of the system, the analysis structure produced by the Hadoop/spark cluster is written to this database, which is processed by the SOA server and Web service for clinical system integration and WebGui. Currently using the MySQL5.6 version, installed on the WINDOWS2008/CENTOS6 operating system.

SOA server serves as an external interface application server for the system, provides business computing logic to clinical business systems and Web server, and provides service APIs to clinical business systems, currently running on the Windows2008 operating system and deploying the. NET Framework 4.0 environment, SOA services that run agileeas.net SOA middleware provide standard WebService and WEBAPI to external systems by the Agileeas.net SOA middleware SOA service.

Web server provides the system with a standard-based B/S browser user interface for business people to manage the system through the B/S Web page, query the medical knowledge in the Knowledge base, currently running on the Windows2008 operating system, deploying a. NET Framework 4.0 environment, Run in IIS7.0.

The clinical workstation system runs his, EMR system, both systems are developed using C # language SOA architecture, and after integration with the system, using the standard WebService interface of the system, using the API provided by the system to provide clinical diagnosis and treatment assistance.

Iv. environment, installation, Pit

At present, the system runs in the virtualized environment, among which three Centos6 compose big data storage, compute cluster, each allocate 16CPU (kernel) 16G memory 2T hard disk, 3 units total 48 cores 48G, three machines each installed java1.8.25+scala2.10+ hadoop2.6,spark1.62,mongodb3.0 combination of 3-node clusters, Spark uses standalone cluster mode, single master node, each machine allocated 12 core 12G for worker, The rest of the CPU memory is reserved for MongoDB clusters and runs as follows:

One Win2008 as a soa| application server, allocates 32 cores of 64G of memory, deploys mysql5.6,iis,agileeas.net SOA services, and the entire system's SOA service and Web management interface is hosted by the server, providing Web-based management and querying , on the other hand, WebService and WEBAPI provide services for clinical systems.

The installation of the specific environment due to the reasons for the length of this is not in one by one detailed, I will write a separate article for everyone to introduce in detail.

The first use of spark, and there is not much information to refer to, so in the development process encountered a lot of pits, especially the initial time, build the environment for a week, write code process Pit also has been found to have been pits, a bit pit also fill not, straight good change of thinking around, remember in spark SQL UDF custom function, not all functions have pits, and occasionally write their own UDF function is not difficult to find the reason, see Spark source code also didn't see why, finally not rewrite code, change ideas to engage.

I feel especially like the Scala language, I think that using. NET 4.0 (C #) Friends, especially with the mature LINQ brothers, Scala language is too convenient, I feel he is basically as convenient as LINQ, and not moral integrity, in the function can be defined classes, but, It's really convenient, I don't like Java very much, but I like Scala.

V. Effect SHOW 5.1 outpatient diagnosis Ranking

Outpatient Diagnosis Ranking is a graphical interface display for outpatient diagnostic knowledge, which is used to display the common diagnosis of TOPN in a hospital or a designated specialist, and also to show the correlation between each diagnosis and gender, age, and the solar-throttle correlation.

5.2 Outpatient Diagnostic Query

Outpatient diagnostic ranking is a knowledge display interface for outpatient diagnosis of complications, which is used to demonstrate the likelihood of another disease in one disease.

5.3 Outpatient Automatic Group query

Outpatient Automatic group query, used to show the most commonly used drugs, treatment of automatic group knowledge, that is, for example, the most commonly used 0.9% sodium chloride injection of 1g with injection, often suitable for inflammation of the tonsils, wheezing bronchitis, upper respiratory infections and other diseases, intravenous drip method used daily.

5.4 Outpatient Diagnosis Group Inference

Outpatient diagnostic group inference, used to demonstrate clinical disease diagnosis and commonly used drugs, treatment to the relevance of the association, such as the display of the upper respiratory tract infection commonly used ammonia phenol hemp dry suspension 1 packets, Four Seasons antiviral agent, 0.9% sodium chloride injection 100ml+ injection with the head of sulfur 1g, sterile injection water 2ml+ Injectable recombinant human interferon a1b 10ug and other such combination treatment regimen.

5.5 Medical Clinical System integration

In order to achieve the overall system that originates from the clinical system and serves the clinical system, we have linked the outpatient doctor station in the hospital's his system, with the system based on WebService integration, as shown in the integrated interface:

After the completion of the basic outpatient medical records, the system will automatically for its examination of the outpatient disease diagnosis, 80%-90% of the situation can be directly selected, in a few cases not recommended when the doctor will enter, save the doctor to enter the diagnosis of trouble, but also reduce the doctor input of the irregular data caused by the confusion.

After the clinician has written out the outpatient medical record, carries on the examination, the examination, the medicine, the treatment time, the system will recommend the appropriate treatment option according to the diagnosis information, the clinician only needs in the right recommendation group side double-click can realize the quick prescription prescribe, greatly facilitates the clinician's work.

For the Chinese hospital, the system integrates more than 3W classic Solutionkeys, according to historical data and the combination of Solutionkeys dictionary analysis and comparison, greatly facilitate the daily work of TCM doctors:

VI. implementation details, follow-up articles

For big data technology, and big data technology in the medical Information industry practice, and the implementation of the ideas and details, not just a little bit of space can be introduced to complete, this article is also in our implementation of the requirements, after the practice of writing, so always feel that things are relatively simple, I only hope that this article can achieve the role of throwing reference, Can peer to do related work of friends have reference, ideas can be used for reference, but this article also really did not speak clearly all the details.

In the next step, we write an article that matches the technical environment used in this article, as well as the details of the configuration, and please look forward to it.

Have to do related business friends can contact me to carry out the relevant discussion.

qq:47920381

Email: [Email protected],

Tel: 18629261335.

hadoop+spark+mongodb+mysql+c#

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More