Absrtact: 1, what is the hottest and most famous High-tech start-up company in Silicon Valley? In Silicon Valley, we are very enthusiastic about the opportunity to talk about entrepreneurship, I also through their own some observation and accumulation, saw a lot of recent years, the emergence of the popular start-up companies. I'll give you a
1, what is the most famous High-tech start-up companies in Silicon Valley?
In Silicon Valley, we are very enthusiastic about the opportunity to talk about entrepreneurship, I also through their own some observation and accumulation, saw a lot of recent years, the emergence of the popular start-up companies. I'm giving you a list of the world's Wall Street Web site, the size of the venture capital. Its original title is billion startup club, I also shared last year's domestic lectures, less than a year, as of January 17, 2015, now ranked and scale has undergone a great change.
The first valuation in ten Billlon reached 7, and a year ago not a home. Second, the first is a household name of millet; third, the top 20, the vast majority (80% in the United States, in California, in Silicon Valley, in San Francisco!). For example, Uber, Airbnb, Dropbox, Pinterest, four, there are many similar models of success, such as Flipkart is the Indian market Taobao, Uber and Airbnb are the areas of shared economy. So you can still find the next big opportunity in Mobile (Uber), large data (Palantir), consumer-level Internet, communications (Snapchat), paying (Square), O2O app. I've been interviewing and feeling about the environment in many of these companies.
2. Is there such a large number of high valuation companies, does it mean there is a big bubble?
Looking at so many high valuation companies, many people feel very crazy, this is not a big bubble, the bubble is not going to break, is a lot of people's doubt. I think in Silicon Valley, where it's a dream place, where investors encourage entrepreneurs to do the same, it also encourages bubbles, and many projects are valued at 2, 3 times times in a few months, like uber,snapchat on the scale of their huge financing. So this picture is the "emerging technology hype" cycle, the various technologies according to the technical maturity and expectations of classification.
Innovation Bud Innovation Trigger "," the peak peak ofinflated expectation "," downward expectations to the low point trough of disillusion "," return to the ideal slope ofenlightenment "," productivity platform plateau of productivity ", the more to the left, the technology is about trendy, the more in the conceptual phase; The longitudinal axis represents the expected value, and the new technology usually comes into being, expectations are rising, accompanied by media hype to reach the top; with technical bottlenecks or other reasons, it is expected to cool down gradually, but as technology matures, expectations are rising again, accumulating users, and then reaching a healthy track of sustainable growth.
Gartner publishes technology trend hype every year. The comparison between this year and last year shows that the concepts of IoT, self-driving cars, consumer-grade 3D printing, and natural language quiz are at the top of the hype. And big data has slipped from the top, NFC and cloud computing near the bottom.
3. What is the trend of High-tech entrepreneurship in the future?
I'll start with a recent movie, "Imitation Game", the founder of computer logic, Allen Turing (named after the computer's highest award). He made an outstanding contribution to the victory of the Second World War by deciphering the German code, and saving the lives of tens of millions of people, But at that time because homosexuality was sentenced to chemical castration, suicide ended a brief 42-year-old life. One of his great contributions was in the development of artificial intelligence, and he presented the Turing Test (Turing test) to test whether a machine showed an equivalent or indistinguishable intelligence.
Today, artificial intelligence has made great progress, from expert system to statistical learning, from support vector machine to neural network depth learning, each step leads the machine intelligence to the next ladder.
Dr. Wu, senior scientist at Google (the beauty of mathematics, the top of the wave author), he puts forward three trends in current technology development, first, cloud computing and and mobile Internet, which is being carried out when; second, machine intelligence is now beginning to occur, but many people are not aware of the impact on society; The combination of large data and machine intelligence, this is the future, will certainly happen, there are companies doing, but not too large scale. He thinks the future machine will control 98% of the people, and now we have to make a choice, how to become the remaining 2%?
4, why the big data and machine intelligent combination of the future must come?
In fact, before the Industrial Revolution (1820), world GDP per capita had not changed in the two thousand or three thousand years before 1800 years, and from 1820 to 2001 180 years, the world's per capita GDP from the original 667 U.S. dollars to increase to 6049 dollars. As a result, the industrial Revolution brought about by the income growth is really shaking. What's going on here, you can think about it. But human progress has not stopped or steadily increased, in the invention of electricity, computers, the Internet, mobile Internet, global annual GDP growth from the extremely 5到2%, information is also in sharp growth, according to the calculation, the last two years of information is the sum of the previous 30 years, the last 10 years is far more than the total amount of information accumulated before. In the computer age, there is a well-known Moore's law, that is, the same cost every 18 months the number of transistors will double, in turn, the same number of transistors cost halved, this rule has been good match for the last 30 years of development, and can be derived from many similar areas: storage, power, bandwidth, pixel.
Neumann is one of the most important mathematicians of the 20th century, and one of the greatest scientific versatile in many fields, such as modern computer, game theory and nuclear weapons. He proposed that it would approach a singular point in the history of mankind, after which all human behavior could not continue to exist in the familiar face. This is the famous singular point theory. Now that the exponential growth is growing faster, American futurist Ray Kurzweil said that human beings could live in digital life in 2045, and he founded the University of Singularity, believing that with exponential growth in the fields of information technology, wireless networks, biology and physics, artificial intelligence would be realized in 2029, The life expectancy of people will also be greatly extended in the next 15 years.
5. What are the big data companies that are worth paying attention to abroad? What are there in China?
This is the 2014 summary of the big Data Company list, we can roughly divide into the infrastructure and applications, and the bottom is to use some common technology, such as Hadoop,mahout,hbase,cassandra, I will be covered below. I can cite a few examples, in the analysis of this piece, cloudera,hortonworks,mapr as Hadoop, some of the operational dimensions, Mangodb,couchbase are all NoSQL representatives, as service domain AWS and Google BigQuery rattling, in the traditional database, Oracle acquired MYSQL,DB2 established bank dedicated, Teradata has done a multi-year data warehouse. The above apps more, such as social consumer domain Google, Amazon, Netflix, Twitter, Business Intelligence: Sap,gooddata, some in the Advertising media field: Turn,rocketfuel, do intelligent operation Sumologic and so on. Last year's Nova Databricks was accompanied by a wave of spark that shook Hadoop's ecosystem.
For the fast-growing Chinese market, big companies also mean big numbers, with bat three investing in big data.
When I was in Baidu 5 years ago, in the last two years, they set up the Silicon Valley Research Institute, digging up Andrew Ng as the chief scientist, the research project is the Baidu brain, in speech, image recognition greatly improve accuracy and recall rate, recently also made a unmanned bicycle, very interesting. Tencent, as the biggest social application, also has a passion for large data, and has developed a massive storage system for C + + platforms. Taobao last year double 11 main battlefield, 2 minutes breakthrough 1 billion, turnover breakthrough 57.1 billion, behind is a lot of stories, that year in Baidu do Pyramid (according to Google three carriages built pyramid three-tier distributed system), continue to create myths in Oceanbase. And Aliyun was controversial, Ma Yun also doubt is not be Jian fooled, finally experienced a double 11 baptism proved the Aliyun of the reliable. Millet Lei Large data is also pinned high hopes, on the one hand so much data geometric growth, on the other hand storage bandwidth is a huge cost, no value will go bankrupt.
6, Hadoop is today's most popular data technology, at the time it emerged, what caused the popularity of Hadoop? What are the design advantages of Hadoop?
Looking at where Hadoop started, I had to mention Google's advanced nature, more than 10 years ago, Google out of 3 monitors to discuss the practice of distributed systems, GFS, MapReduce, BigTable, very NB of the system, but no one has seen, A lot of people in the industry are itching to imitate the idea, then Apache Nutch Lucene author Doug Cutting is one of them, and then they were bought by Yahoo, set up a team dedicated to do, is the beginning of Hadoop and large-scale development of the place, Then, as Yahoo's bulls went to Facebook, Google, too, set up big data companies such as Cloudera, Hortonworks, to bring the practice of Hadoop to Silicon Valley companies. And Google has not stopped, and out of the new three carriages, Pregel, caffeine, Dremel, and then a lot of the following, began a new round of open source war.
Why is Hadoop better suited to bigger data? First of all, the expansion is very good, directly through the addition of nodes can improve the system capacity, it has an important idea is to move computing rather than moving data, because the movement of data is very high cost of network bandwidth. Second, the goal is to take advantage of cheap, regular computers (hard disks), which may be unstable (the probability of disk failure), but high reliability through fault tolerance and redundancy at the system level. And very flexible, you can use a variety of data, binary, document type, record type. The use of various forms (structured, semi-structured, unstructured, so-called schemaless) is also a skill in on-demand computing.
7. What are the companies and products around Hadoop?
The mention of Hadoop generally does not say something, but refers to an ecosystem in which too many interacting components are involved, involving IO, processing, application, configuration, workflow. In real life, when several components interact with each other, the maintenance of your headaches is just beginning. I would also like to say a few: Hadoop core on three Hdfs,mapreduce,common, in the periphery of the Nosql:cassandra, HBase, has a data warehouse developed by Facebook Hive, Yahoo's main research and development of the Pig Workflow language, There is a machine learning algorithm library mahout, workflow management software Oozie, in many distributed System Selection master plays an important role in zookeeper.
8. Can you explain the workings of Hadoop in a way that ordinary people can understand?
Let's start with the HDFs, the so-called Hadoop Distributed File system, which is capable of truly high intensity fault tolerance. And according to the locality principle, the continuous storage is optimized. It simply means allocating large chunks of data, reading the whole number consecutively at a time. If you were to design your own Distributed file system, what would you do if you hung out on a machine? First need to have a master as a directory lookup (that is, Namenode), then the data node is divided as a good chunk, the same data in order to do a backup can not put on the same machine, otherwise the machine hangs, you back up also cannot find. HDFs uses a rack-and-position perception to put a copy into a machine on the same rack, then in a copy to another server, perhaps a different data center, so if a data point is broken, it is called from another rack, and the same rack their intranet connection is very fast, if the machine is also broken, Can only be obtained remotely. This is a way, and now there is based on erasure code is used in the field of communication fault-tolerant approach, can save space and achieve fault tolerance, we are interested in the query.
Then said MapReduce, the first is a programming paradigm, its idea is the task of batch processing, divided into two stages, the so-called map phase is the data generation key, value pair, and then sorted, the middle of a step called Shuffle, The same key is transported to the same reducer, and on the reducer, because the same key has been ensured on the same, it can be done directly aggregation, calculate some sum, and finally output the results to the HDFS. For developers, all you need to do is write the map and reduce functions, like intermediate sorting and shuffle network transmissions, fault-tolerant processing, and the framework has been done for you.
9, the MapReduce model itself has some problems?
First: Write a lot of low-level code is not efficient, second: all things must be converted into two operations map/reduce, which in itself is very strange, and can not solve all the situation.
10. Where did spark come from? What are the advantages of spark compared to the design of Hadoop MapReduce?
In fact, Spark appears to solve the above problem. First, some spark origins. From the 2010 Berkeley Amplab, published in Hotcloud is a successful example from academia to industry, and has also attracted the capital injections of top Vc:andreessen Horowitz. In 2013, these Daniel (Berkeley Dean, MIT's youngest assistant professor) went out from Berkeley Amplab to set up Databricks, which led countless hadoop barons to bow, written in functional language Scala, Spark is simply a memory calculation (including iterative calculation, DAG calculation, flow calculation) framework, before the mapreduce due to inefficient people often ridiculed, and the spark of the emergence of everyone very fresh. Reynod as Spark core developer, introduced spark performance Super Hadoop, the algorithm implementation only has its 1/10 or 1/100. On last year's sort benchmark, spark used 23min to run the 100TB sort, refreshing the world record for Hadoop.
11. If you want to work in large data, can you recommend some effective learning methods? What are the recommended books?
I also have some suggestions, first or lay a good foundation, although Hadoop is fiery, but its basic principles are books for many years of accumulation, like algorithmic Introduction, UNIX design philosophy, database principles, in-depth understanding of computer principles, Java Design Patterns, some of the weight of the book can be referred to. Hadoop's classic The Definitive guide, I also share in the knowledge.
Next is to choose the target, and if you're working as a data scientist, I can recommend the Coursera Data Science course, which is easy to understand. Learn Hive,pig These basic tools, if you do the application layer, the main thing is to familiarize yourself with some of the workflow of Hadoop, including some basic tuning, if you want to do the architecture, in addition to build clusters, the basic software services are very understanding, but also understand the computer bottleneck and load management, Linux performance tools. The last thing to do is to practice, and the big data itself is practiced, you can first by the API to write the example of the book, you can first debug success, in the following is more accumulation, when encountering similar problems can find the corresponding Classic mode, and then further is the actual problem, perhaps no one around, you need some inspiration and online question skills, Then make the best choice according to the actual situation.
12, the most closely related to large data technology is cloud computing, you have worked in the Amazon Cloud Computing department, can you give a brief introduction to Amazon's redshift framework?
I've worked in the Amazon Cloud Computing department, so I'm still more aware of AWS, which has a high overall maturity and a lot of startup based on its development, such as the famous Netflix,pinterest,coursera. Amazon is still innovative, annual reinvent conference to promote new cloud products and share success stories, here I casually say a few: like S3 is a simple object-oriented storage, DYNAMODB is a relational database supplement, glacier to the cold data archive processing, Elastic MapReduce directly to the MapReduce to do packaging to provide computing services, EC2 is the basis of the virtual host, Data Pipeline will provide a graphical interface directly in tandem work tasks.
Redshift, it is a (massively parallel computer) architecture, is a very convenient data warehouse solution, is the SQL interface, with the various cloud services seamless connectivity, the biggest feature is fast, at the TB to the PB-level very good performance, I am also directly used in the work, it also supports different hardware platforms, if you want to faster, you can use SSD, of course, the support capacity is smaller.
13. What are the big data-open source technologies that LinkedIn uses?
In LinkedIn, there are a lot of data products, such as arranges you might like, the job you will interested, your user access, and even your dour path can be mined. So in LinkedIn is also a lot of open source technology, I say here is the most successful Kafka, it is a distributed message queue, can be used in tracking, machine internal metrics, data transmission. Data in the front-end backend will be different storage or platform, each platform has its own format, if not a unified log, there will be disaster type O (m*n) data docking complexity, if you set the format once changed, but also to modify all the relevant. So the middle Bridge proposed here is Kafka, we agreed to use a format as a transmission standard, and then at the receiving end can arbitrarily customize the data source you want (topics), the last implementation of the Linear O (m+n) complexity. Corresponding design details, or refer to the design documentation. This is the main author of Jay Kreps,rao June out of the establishment of Kafka as an independent development of the company.
In Linkedin,hadoop as a bulk processing of the main, a large number of applications in various product lines, such as advertising groups. On the one hand, we need to do some flexible query analysis advertisers match, advertising forecasts and actual results, in addition to report generation is Hadoop as a support. If you want to interview the LinkedIn backend group, I recommend that you go to the hive, Pig, Azkaban (data Flow Management software), AVRO data definition format, Kafka,voldemort to see some design concepts, LinkedIn has a dedicated open source community, Build their own technology brand.
14, talk about Coursera in the large data architecture and other Silicon Valley start-up companies what is the characteristics? What are the causes and technical orientations that have led to these characteristics?
Coursera is a mission-driven company, we are not in pursuit of the ultimate technology, but to serve good teachers, students, solve their pain points, share their success. This is the biggest difference from other technology companies. On the one hand, now or early accumulation stage, large-scale computing has not come, we only actively learn to adapt to change to maintain the rapid growth of start-up companies.
Coursera, as a start-up company, wants to be agile and efficient. Technically, all are developed on the basis of AWS, you can imagine the free start cloud services, do some experiments. We are divided roughly into product groups, schema groups, and data analysis groups. I've put all the development techniques I've used into the list. Because the company is relatively new, there is no historical legacy migration problem. The bold use of Scala as the main programming language, using Python as a scripting control, such as product groups is provided by the course products, which use the play Framework,javascript backbone as a control center. The architecture group primarily maintains underlying storage, common services, performance, and stability.
I am in the data set by more than 10 people, part of the commercial products, core growth indicators to do monitoring, mining and improvement. Part is to build a data warehouse to improve the seamless flow of data with various departments, but also use a lot of technology such as using scalding to write Hadoop MapReduce program, but also someone to do AB testing framework, referral system, as far as possible with the least human power to do things. In fact, in addition to the open source world, we are also actively using third-party products, such as sumologic do log error analysis, redshift as a large data analysis platform, slack do internal communications. And all this is to liberate productivity, focus on the user experience, product development and iteration.