CS224D Lecture 1 Notes __stanford

Source: Internet
Author: User

Recently began to watch Stanford's cs224d course, which is the newest Stanford's Practical course on deep learning. Interested friends can point here

This article is just one of my learning notes, inevitably have errors and omissions, if a netizen found my mistake or omission, please do not hesitate, in this thank.

The course information is still quite complete, the video has been uploaded to YouTube, various matrials are also available on the official website to download, thanks to Stanford's generosity, so that the world's deep learning interested friends can have access to cutting-edge technology.

Don't say much nonsense, start writing notes.

The prerequisites of this course is many, the proposal unsatisfied friend still don't look, otherwise read also unintelligible to have no meaning.

Lecture 1 is a review of the deep Learning for Natural Language processing, as is the opening of most courses, and establishes a framework Each subsequent lecture is filled with a detailed explanation and description of each of the articles.

1.Video Description

The first part of the video talks about the various issues of the course, the course time, and so on.

The pipeline of the latter part of the video is NLP Description-> NLP levels-> NLP application-> NLP in Industry-> DP Description-> ML vs DP-> History of DP

-> reasons for exploring dl-> DL application-> Deep NLP application Example


NLP levels speak similar to the pipeline of NLP

(a) The language is stored in the computer. Human language has two main channels, the first is through the speech, the second is through the text, first through the morphological analysis (morphological analyses), that is, the first is: voice signals-> computer text, the second is: human handwriting/ Book Print-> text inside a computer

b The following two is the input in the text analysis, so that the computer can "understand" the human language, first through the syntactic analysis (grammar), and then semantic analysis (word meaning), then see recommend Reading It was found that the two steps were usually mixed, and it was considered that the chart described the syntactic analysis before semantic analysis.


DP Description

Deep Learning is a branch of ML (machine learning), which focuses on the features of the DL in terms of mining and can automatically learn from raw Data features, Then ML uses the learned features for prediction.


2.Linear algebra Review

This review is just a lot of talk about Linear algebra only as a similar handbook things, reference books I still like a professor of MIT Strang Introduction to Linear algebra online are downloaded, Meet the forgotten things, take out to see there will always be a new harvest. And there's the Matrix cookbook. This book is just a copy of the handbook to some of the formula can not think of it is very good to take out and turn over.


3.Convex Optimization Overview

Start with convex set, and then convex functions, the second section is mainly about what convex is, and how to judge whether a set or function is convex

The fourth section is on the play: convex problem must have a globally optimal point and no locally to optimal to what a good character I go. This problem will not lead parameters astray in optimize, just like a bowl, throw a ping-pong ball in, it can only fall in the bottom of the bowl, it is impossible to always stick on the bowl wall (--of course, you do not put the bowl diagonally). Convex problem is like the surface of the moon, table tennis fell into a hole can not come out, it is difficult to find globally optimal point.

Article 4.2 introduced several convex problems more general, but I still did not understand the last kind of semidefine programming is going on. Convex problem has one feature: the function satisfies the conditions of the convex function, constrain satisfies the conditions of the convex set, so that a bowl can be constructed instead of a moon. (-_-| | )

Finally, two examples of how to put the specific problem into convex problem standard type, the standard form is mainly to facilitate the generation of off-the-shelf software so that can easily solve the optimal point, But sometimes off-the-shelf software efficiency is not very good, but also specific problems to deal with, write their own optimal function.


4.Stochastic gradient descent (SGD)

There is no review of this review, which makes me look a little bit older, but it may also be a part of a review from the cs231n course. In the first few chapters, if you have a friend to look at this review I suggest you look at the bottom of summary and summary.

SGD (random gradient descent) differs from batch gradient descent (sorry for not knowing Chinese how to say-).

In fact, I personally think that the purpose of this review is to understand the part of the score function and loss function that is not so important in the first paragraph of the SGD.

SGD This article first introduces two examples of negative optimal.

The first is random Search, that is, you are in a bowl of random jump, jump to a point of record at this time of the height, if the lowest than before the lowest update minimum value, vice versa.

The second is random local Search, where you randomly start a point in the bowl, and then you randomly look around to see which side is the biggest, to assume that 1000 times, to the lowest direction to go down, and then repeat the process of the above foot.

The last one is the play, toward the gradient direction, there are two ways to calculate the gradient direction, the first one is the approximate solution of calculus, the efficiency of low precision; the second is matrix calculus but the solution is easy to calculate, but the efficiency is high, very accurate. Of course we use the second kind, but the first method is used to test the correctness of the second analytic solution. If you find the direction, how much step you should take. It's also crucial, like ping-pong, if it goes down fast, it's easy to get out of the bowl.

Know how to find direction and how to move the next is based on what direction, can be based on all the data to do a step so accurate but inefficient, better way is mini-batch gradient Descent randomly select a small set and then follow this small set selection direction to step, Mini-batch selected to the minimum of 1 is SGD.


5.From Frequency to meaning

This paper mainly introduces the application of VSM (Vector space Models) in semantic (semantics).

VSM has four main steps: establishing the appropriate matrix-> linguistic processing (equivalent to the initial processing of the matrix)-> mathematical Processing (equivalent to further processing of the matrix in the previous step)-> comparison (do actual comparison work)

A VSM there are three commonly used matrices may not be found, but now the main use of these three species, and the effect is better. The first is the Term-document Matrix, the line represents the word, the column represents the file, it is so simple. The second is the Word-context Matrix, the line represents the same thing, the column represents a context, that is, the contexts, you can customize the scope of the context (eg: Define range = 2, then select the specified word +- All words in the range 2 can be considered as context, which is what is extracted from the document. The third is that the Pair-pattern matrix is the most complex, pair refers to two words that appear in a pattern (what is that pattern--|) Well, pattern I understand is a certain type of sentence (for example: Xu Zhimo love lin reason), X love Y is an patterns, Xu Zhimo and Lin because is a pair of pair, if this to the number of pair appear more of the relationship between them-- Xu Zhimo and Lin Lin because these two times really have the possibility of love this pattern is very large. On the contrary, Xu and Lin are also very likely to have similar meanings because of the appearance of pair pattern and love.

Types and Tokens:types are the same spelling words in one index, tokens is the same spelling of words according to different context with different index, it is obvious that the latter can deal with more than one time, the former does not.

The article 2.7 is a review of the matrix can be used as a reference.

b) Linguistic processing

As the name suggests linguistic processing is relatively rough based on language preprocessing, pipeline as follows

Tokenization-> Normalization (Optional)-> Annotation (Optional)

Tokenization simple to say is the punctuation (punctuation know, puzzled. - -||| And then remove the useless high-frequency words, such as "", "yes" and the like.

Normalization is to the different state of the word into its root, this situation in the Chinese appeared relatively few, in English more, such as the plural-s, passive-ed,-ing, etc. In normalization also introduced the concept of precision and recall, at first I was a bit confused for example. Set a represents document

Truly relevant to a Query,set B represents we say that a document relevant to a query,c = a intersect B. Recall = c/a, Precision = c/b. Normalization will improve precision lower recall

Annotaion is similar to a normalization reverse operation, which specifies that a word is a state, such as whether the program is a noun or a verb. Annotation will improve recall lower precision


c) Mathematical processing

Weight the Elements: After the establishment of the raw matrix is still not very good to solve the problem ah, next to the mathematical aspects of processing. For example, two Chinese men only look at the face to distinguish between two people, but most Chinese are black hair, black eyes, nose, small ears. If you want to distinguish two people can not find most of the common, if one of the male incisors missing one, the other man with earrings, wear lipstick, then it is very good to distinguish the two men. This is the first sentence of 4.2 (the idea of weight are to give more weight to surprising events and less weight to expected events) in layman's accounts. A surprising event has higher information content than an event. An element gets high weight when the corresponding term are frequent in the corresponding document,but the term was rare in Other documents in the corpus.

Using mathematics to express the above sentence is the mutual in the information theory information the concrete formula is given, no longer repeat.

Smoothing The matrix: using SVD (singular value decomposition) of the matrix for smoothing processing, very simple method, do not repeat.


D) Comparing the Vectors

Compare vectors can compare row vectors, or you can compare column vectors. The comparison of line vectors can infer that the similarity or not similarity between words can be classified into words and clustering. Comparison of column vectors can infer the similarity between document,contexts,patterns and can be used for information retrieval and document categorization.


e) Three Open Source VSM Systems

This section describes each VSM open source project according to the Matrix division, and can be used as a reference for the projection of the course. The application chapter points out the practical function of each VSM, also can be used as reference for projection.


6.Python Tutorial

Python was used to find that Python was so handy, just as Pseudo-code was simple. Nothing to say, read tutorial reference their official website manual practice on the line. Official website Manual Link: Python 2.7.10 Manual numpy 1.9 Manual Sci Manual matplotlib Manual


7.Lecture Notes

Notes in some of the first lesson in the video of a lot of content, I guess is the second lesson preview and summary, can look as a preview.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.