An RDBMS summary of introduction to Data Science in public class

Last Update:2015-06-03 Source: Internet

Author: User

Tags sqlite sqlite query

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently in the learning public lesson Introduction to the Data science

The second part is about relational database and relational algebra (relational databases,relational Algebra)

Which speaks of relational databases.

The relationship between SQL and RA

Select selection, minus σC (r) to indicate the column in query table R that meets the criteria C.
For example σSalary > 40000 (Employee)
Represents "SELECT * from Employee where Salary > 40000"
Projection projection, minus πname , salary (Employee)

means "select Name, Salary from employee;
Set operation, in accordance with R 650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "alt=" \times "style= "Background:url ("/e/u261/lang/zh-cn/images/localimage.png ") no-repeat center;border:1px solid #ddd;"/> S that represents R and S do Cartesian product (all combinations of columns in R and s total)
e.g. J X P
Represents "select * from J,p; If J is a table with 2 rows and 3 columns, p is a table with 4 rows and 5 columns, then the result is a 2*4 row 3+5 column table.
Connect Join, Conditional connection, left connection, right connection

650) this.width=650; "Src="/e/u261/themes/default/images/spacer.gif "Style=" Background:url ("/e/u261/lang/zh-cn/images/localimage.png") no-repeat center;border:1px solid #ddd; "alt=" Spacer.gif "/> 4.1 Conditional connection, symbol r θ s, represents a combination of qualifying θ in RxS, such as "select * from j,p where j.id = p.id" Equals select * from J join P on &NBSP;J.I D = p.id ", INNER join is the same as join.
4.2 Left outer connection r 650) this.width=650; "alt=" \ltimes "src=" http://upload.wikimedia.org/math/d/d/f/ Ddf0bdae0e2fc3d918948fe0c7c8bef9.png "/> S", which represents a combination of the matching conditions in R and S, if there is no corresponding entry in S, its value is considered null, such as "SELECT * from P left OUTER JOIN on p.age = J.age; " Where the left OUTER join equals the LEFT join
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "spacer.gif"/>4.3 right outer link, opposite to left outer link.

5. Union symbol R1∪R2, which represents the combination of two tables, which must have the same format, such as select Age from P Union (select age from J); If there are two columns of age in P, and J has two columns of age (2,3), then the result is (three-way), notice that there is a duplicate record in the cell, and the union all can be used to get (1,2,2,3) all results with duplicate records.

6. Intersection symbol R1∩R2, which indicates that R1 is the same record as R2, R1∩R2 = r1– (R1–R2), r1∩r2 = R1 _{θr2 condition is the same as record.}

7. Difference symbol R1-R2, which indicates a record in the R1 that is not found in the R2, select * from P where the age is not in (select Age from J);

SQL optimization

In the course, we talked about SQL is. It means that SQL defines the results we need, not how they work. In fact DB will do some optimizations on SQL such as SQLite's sqlite Query Planner, Mysql optimization optimization, in fact, still need manual optimization

in the homework similarity matrix, we need to calculate the similarity of 22 documents, which is actually a matrix operation.

1) The code is as follows, spents 1m22.042s

Select X.docid,y.docid,sum (X.count*y.count) as Count from Frequency X, Frequency y where x.term = Y.term and X.docid < y . DocId GROUP BY X.docid, Y.docid ORDER by Count ASC;

2) Submit the answer only need, one of the results, time 1m10.919s, you can see here is actually calculated the similarity of all documents intercepted, DB is not optimized.

SELECT * FROM (select X.docid as a, y.docid as B,sum (X.count*y.count) as Count from Frequency X, Frequency y where x.term = Y.term and X.docid < Y.docid GROUP by X.docid, Y.docid ORDER by Count ASC) x where x.a= "10080_txt_crude" and x.b= "17 035_txt_earn ";

3) Manually put the above where limit statement into the calculation process, time 0m0.005s

SELECT * FROM (select X.docid as a, y.docid as B,sum (X.count*y.count) as Count from Frequency X, Frequency y where x.term = Y.term and X.docid < Y.docid and x.docid= "10080_txt_crude" and y.docid= "17035_txt_earn" group by X.docid, Y.docid OR DER by Count ASC) x;

Put the above example into MySQL SQL 1) execution, run for 10 minutes no results!

Answers to the questions after class

It's uploaded to GitHub.

Https://github.com/trumanz/coursera/tree/master/datasci/assignment_Relation_DB

This article is from the "10305054" blog, please be sure to keep this source http://10315054.blog.51cto.com/10305054/1657686

An RDBMS summary of introduction to Data Science in public class

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More