[Large data 100 points]
Keynote speaker: Bai
Moderator: Carey
Organizer: Zhongguancun Large data Industry Alliance
Zhongguancun Large Data Industry alliance specially invited white teacher to take the first "big Data 100 Points" Forum keynote speaker!
Bai is a chief engineer of Shanghai Stock Exchange, Ph. D., PhD. She is a ph. D. tutor at the Institute of Information Engineering, Chinese Academy of Sciences. Also serves as the executive director of Chinese Information Society of China, vice chairman of the Securities Sub-Committee of the National Financial Standardization Committee. White teacher research and work in the field across academia, industry, capital, the study of large data is both at the forefront of practical and theoretical heights.
The following is the exchange of live text, the middle interspersed with some experts and white teacher Interaction:
It is an honor to have a "big data 100 points" first exchange. I as a former scholar and the current financial institutions technical director, from our industry needs, combined with my personal professional background, to talk about a little personal experience of large data.
One, large data is not equal to the data big, who also does not have the qualification monopoly Big Data concept definition right
Some people will say, how much data do you have? Don't talk to me about big data without P-scale data. This view is very representative, not only do they have p-scale data, internet companies, operators say so, some scholars also said.
(We haven't infiltrated P class yet)
My view is that large data is not equal to the size of the data. Data is large, but there is no commensurate with this individual processing means, application needs and even business model, the value of data can not be fully played, it is empty its large. The data is large, even if you have the appropriate means of self-realization data value, but this means if not radiation to the data is not too large (for example, between 1T to 1P) of the field, it is only narcissistic. We human society in the progress, certainly need to constantly challenge the data processing limit, in the challenge limit to develop new technology for their own use, do these things people and institutions worthy of our admiration, this is certain. But they are far more valuable than that. The results of their being challenged to extremes can radiate farther.
If you can constantly impact the limit ("Sky"), can also generally reduce the non-extreme situation of data processing cost-effective ("site"), this is the real value of large data technology. So now, the big data is not just the P-scale giants talking, but also the wider it application community. No one is qualified to monopolize the definition of large data concepts.
Second, "nobility" and "nobility" in the field of data processing
What I feel most deeply in my work is the "nobility" in the field of data processing. What I have benefited most from the big data boom is the "aristocracy" of data processing. Large data "generally reduce the cost of data processing in the non-extreme situation" This feature is that we go to nobility of the weapon.
What is "Go aristocratic"? This is a general statement. The IT circle many people name to go to XXX, this although understandable, but to see the specific company will change, will also progress. We discard is actually a kind of aristocratic solution, so I prefer to "go aristocratic" argument.
So, what is "aristocratic"? In my opinion, the aristocratic solution has three main features: heavy, stagnant, expensive.
First look at "heavy", where "heavy" is not the physical weight, but refers to a bulky pile. Give you 10,000 functions, you may not use 100, but these 10,000 features forced you to embark on a path of no return: Your software and hardware can not be separated, storage and computing functions can not be separated, real-time processing functions and historical analysis function can not be separated, The unstructured data processing function can only be converted into structured fit, and then processed by a structured processing engine.
Interactive: @ Yianyang: General 恵 Finance, general benefits data. dimensionality reduction Processing
Looking at the "lag", this is mainly about the architecture of such a solution is a huge inertia. Faced with the changing business needs and changing service patterns, it is difficult to quickly turn around, rapid follow-up. On the one hand, users are license to each other, many common things can not be accumulated and shared; On the other hand, due to the closed platform, the resolution of the Platform-related defects and problems is slow and inefficient due to lack of competitive incentives.
Interactive: @ Carey: Just like the clunky word
Finally see "expensive", as the name suggests, the purchase cost is expensive, the maintenance cost is expensive, the platform migration cost is more expensive. That's not to count, when the solution evolves from the license model to the cloud model, it also encounters stubborn resistance from vested interests. These expensive costs, of course, end up falling on the user. But in the past, users under the great pressure of safe operation, only in "This" aristocratic and "kind of" aristocratic middle choice, only aristocratic solution to demonstrate procedural justice.
Interactive: @ Yianyang: Ios example, another example of Windows
A single user unit, to do out of the aristocratic technical decision-making, the political pressure can be imagined.
Now, big data is coming. In the volume of data on the impact of the pioneers, first of all aware of the "aristocratic" solution of the intolerable, creating a noble data processing solution to the precedent.
Interactive: @ Yianyang: 08 Our first taste of green Plum, out of the Data Warehouse low price road, very tired
They use a lightweight general-purpose hardware platform, open source operating system and grassroots platform architecture together, constitute the noble solution to the core content, for us to establish a model of nobility.
What follows is that for the wider user, including us, the option of "going aristocratic" has been used to highlight procedural justice with the practice of following large data pioneers. This is a remarkable advance, and the significance of this change in the past "aristocratic" solution-intensive financial securities industry is far-reaching.
Question: Liu Donghua: Teacher Bai, what are the typical cases of the exchange's role in large data?
The process of nobility, for many already on the aristocracy of the unit, is a painful and long process. Not only to face the technical differences in understanding, and even to face the restructuring of the organizational structure.
Interaction: @ Zhi Gang: the need for the rise of brand names, innovation in imitation, development in innovation
Our large data applications are mainly in the regulation of this piece. To put it simply is to catch the bad guys.
Of course, when we first built enterprise data Warehouse and data mining platform, we mentioned service supervision, service innovation, service Investor Education, service information management. The famous TopView is the application result of data Warehouse.
Interactive: @ Liu Donghua: haha, tell us how to catch bad people
Haha, catching rat barn is certainly one of the most important applications. But I don't really know. Nor is it authorized to speak of this piece. It is possible to simulate a variety of extreme scenarios before a business innovation is launched. It can be said that the data warehouse, such as data infrastructure for business, we are still relatively early, but also have to aristocratic. It is the big data that gives us hope and confidence to be aristocratic.
Question: @ Carey: White teacher, how to understand cross-domain correlation?
I think, from the technical framework of nobility, but a large data flow of a very small side, a greater impact on the business model.
Take our securities industry for example. The securities industry is an industry that relies closely on information technology and information services. The market data interruption for a few minutes, that is a big thing. The elimination of asymmetric information of sellers and buyers mainly relies on mandatory information disclosure according to law. Macroeconomic information, and the capital market has a direct or indirect correlation of fundamental information, such as air and water, the capital market players can not leave a moment. There are more advanced, the information itself has become a game props.
Therefore, the market and information, securities industry information Services are the two core areas. Of course, exchanges and regulators also need to look for irregularities in information that is not publicly traded, a regulatory-oriented information service. In short, the securities industry's reliance on information technology and information services is much deeper and much heavier than in many other industries.
We have been using TD for more than 10 years and are currently facing a selection point. Light is inevitable, but there are different options for how to light it. Often have to engage in data mining, machine learning, business intelligence academic research Friends asked me, you can get some data ah, I have what what technology, indicators have how much cow. I admire people who can make a lot of technical indicators, but it doesn't seem to be the game to introduce a generic (domain-independent) technology into a strange field, not to say that you have the technology, you have everything you need, and you just owe the data.
In fact, from a global point of view, any mature application area, as a whole, will not be insensitive to the general technology to this extent, before you put forward this request, people will have to make a lot of calls for similar requests. In their own data, can try the new tricks do not know how many times to try. If you want to prove your worth, you must build on it, and you have the possibility of dialogue and cooperation. In fact, do not say that do not understand the field only understand technology. It is not enough to know the field now. In the current situation, only cross-border, can go farther. A large part of the value of large data is generated by the "chemical reaction" that is caused by data transboundary associations.
What is a cross boundary connection? I understand that the data that people produce in two relatively independent activity spaces are linked together through some kind of medium. Without this medium, two sets of data are independent, but with this medium two sets of data form new structures, new semantics, new values. such as antivirus. If only limited to antivirus, this pattern has been played almost exhausted, Lianyun kill all out. But if the cloud killing data as a network access log, and network traffic into the current combination, it gives the antivirus with a new meaning.
Interactive: @ Carey: Multidimensional!
Another example is E-commerce, if only for e-commerce to provide means of payment, it is still in the subsidiary of E-commerce Stage. But if the payment data of electronic commerce is used as the credit means of internet finance, this realizes the trans-boundary qualitative change, is no longer to pay Paul, but the east wall west wall all live, has formed the complementary, each top ecology, has formed the so-called "dimensionality reduction attack".
So, the technical experts who are looking for big data challenges in the capital markets, our cooperation point is not in my data you are out of the technology, but in you help me to find the two areas of data can produce chemical reactions, create a new ecological cross-border correlation model. We are waiting for such an expert to be born. Of course I say two areas, not limited to two traditional areas. Can be a traditional field, the other is a new fashion to create the field.
I've been thinking, if there is a service that can pool the flow of capital markets, if there is any service that can precipitate the behavior data of the players in the capital market, if there is any service that can create a new trail in the traditional market and information service and have a "chemical reaction" that has a cross-border connection with the traditional service, There will be disruptive changes in our industry.
Third, "Machine readable news"
In many of the emerging services for the capital market, I am most concerned about the combination of text mining, emotional analysis technology, "machine readable news." I would like to share with you on this issue.
In the middle of last year, there was a wonderful thing about the US stock market: When hackers hacked the Associated Press website, released the White House bombing and Obama's injuries, the US stock market plunged in seconds.
What surprises me is not how clever hackers are, but how the reaction time can be so short. It's hard to imagine human flesh reacting so quickly to the news. This is the "machine readable news" that plays a key role in this reaction chain.
The so-called machine readable news, the principle is that the original news text to automate the analysis, when certain conditions are met, the formation of a well predefined with this condition to match the electronic tag data. Automated program trading system can automatically identify such electronic tag data, and in the capital market to respond to action. This means that the machine not only read the market data, but also to a certain extent read the electronic label (fundamental) text information data. Of course, most of their systems are aimed at English, and the logic of judgment is still slightly too simple and rough, otherwise there will be no such oolong.
But to be fair, it's a huge opportunity, in particular, there is no such thing for Chinese, China's capital market is still in the stage of new and transitional, information asymmetry is also very common, with machines to replace human flesh to Skinner has a high value, so the machine can read the news this thing, who first get out, Who has the opportunity to thoroughly.
In particular, with the Internet finance as the representative of the general benefits of finance, will inevitably involve more grassroots of the company's direct financing needs, in this field information asymmetry is very serious. It is even more powerful to use machine-readable news to break information asymmetry and to help investors better master the overall information of the companies they invest in.
Question: @ Grapefruit: What's the difference between that and reptilian technology?
Reptiles do not look at the content, but are the infrastructure. The things that come back in the instant Select, not only to determine what is relevant, but also to determine whether the relevant investment decisions are positive or negative, this is the machine readable news.
In fact, the potential revelation of the label is more useful than its literal revelation. Today, we are rumors about the anti-Vice of Dongguan information means what stocks to see what stocks look empty, this is the label along the chain of value spread. With a good communication model, the value of labels will be more than expected.
Machine-readable News as an information service, alone has been the opportunity to see the traditional information services in the capital market to look more different. Who subscribe to what label, who saw what stock quotes, who at what price of what products published what the substantive evaluation and recommendations ... If the use of a certain medium to achieve the integration of these cross-border data, this Internet play will certainly subvert our industry information services in the existing business.
Interactive: @ Carey: Reptiles climb First, and then "intelligent worms" interpretation! Tag Chain!!
I have noticed that in today's spread of the jokes, both pornography affects the sauna and the sauna affects the water supply. This is typically the label that travels along the value chain.
Question: @ Rain drunk Heaven: Ask white teacher, behavioral finance and large data collection analysis is not counted in this field
Does it mean that many of the things in the past that quantify investment decisions have been made by computers? There are already some IT companies in the country trying to enter this field (voiceover: What is the specific, everyone to launch their own personal wisdom to go, contains the opportunity in the stock market OH)
Two directions: Structured data-> News text vs. news text-> structured data. The former is data news, the latter is machine readable news. The data news is the news composing process automation, the manifestation data, the machine readable news is realizes the article this unstructured data the structure.
Interaction: @ Xu Qi: "Machine readable news" is undoubtedly a direction of human endeavor, but the stock market fall that white teacher mentioned is not related to this.
@ Bai:
Three possible scenarios: (1) Someone on the AP Twitter account, (2) an automated watchdog focused on a source of information, including the Associated Press Twitter account, on an automated program trading software (3) Watchdog, a third party service, stared at a source of information, an automated program-trading software that converts machine-readable news to its customers. Not excluded (1) and (2) at the same time made a response, (3) This service form, shelves easy, good quality difficult, but definitely the direction.
@ Xu Qi
The American trading system still has "specialist" or "Maket Maker" artificial operation, so the instantaneous "Stop" is the root of all. The core value of large data for the stock market is that the instantaneous participant behavior can be visualized instantly.
Question: @ Shong: Teacher Bai, is the current model of Wall Street based on social networking a machine readable news?
@ Bai:
In this large category, text mining, affective analysis is the technical point, machine-readable news is the service form.
Iv. my view on the mission of the Big Data Alliance
Finally, for our large Data industry alliance burritos, talk about some personal ideas. Data exchange, or even the formation of fair-priced data the basic premise of the common market is that the use and dissemination of data is controlled and the basic environment for the use and dissemination of data is credible. We also have a lot of data, some data are valuable to market services, such as the market playback environment, validation algorithms and quantitative trading strategy of the experimental bed. There are many exchanges in the world to do this service, we can also provide. However, when we consider providing this service, we encounter a dilemma. We don't want our proprietary data to be spread out to users, nor do they want their core strategy to stay in our environment.
It is indeed a common challenge for us to provide a credible mechanism that takes into account the concerns of "reasonable mistrust" among the participants and the effective sharing of data.
I don't have a mature solution, but I think I can do an analogy that's not necessarily the right one: Bitcoin's starting point is based on the premise that every individual "reasonably does not trust", but on the whole it allows most participants to trust. So bitcoin's approach may be our confidence to realize the sharing of important lessons.
The original data can only be encrypted into this peer-to-peer network and can only flow within the Peer-to-peer network, and the traces of the flow are all traceable. Only authorized summary data can be decrypted to discharge this Peer-to-peer network ... And so on, if these ideas can be realized, it is expected to open a door to data sharing, of course, electronic data non-proliferation is not so simple, the difficulty is certainly not small.
I hope that people with lofty ideals can vigorously contribute to this matter.
Finishing: @ Grapefruit @ grass
(Responsible editor: The good of the Legacy)