Recently in the learning public lesson Introduction to the Data science
The second part is about relational database and relational algebra (relational databases,relational Algebra)
Which speaks of relational databases.
Select selection, minus σC (r) to indicate the column in query table R that meets the criteria C.
For example σSalary > 40000 (Employee)
Represents "SELECT * from Employee where Salary > 40000"
-
Projection projection, minus πname , salary (Employee)
means "select Name, Salary from employee;
Set operation, in accordance with R 650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "alt=" \times "style= "Background:url ("/e/u261/lang/zh-cn/images/localimage.png ") no-repeat center;border:1px solid #ddd;"/> S that represents R and S do Cartesian product (all combinations of columns in R and s total)
e.g. J X P
Represents "select * from J,p; If J is a table with 2 rows and 3 columns, p is a table with 4 rows and 5 columns, then the result is a 2*4 row 3+5 column table.
Connect Join, Conditional connection, left connection, right connection
-
650) this.width=650; "Src="/e/u261/themes/default/images/spacer.gif "Style=" Background:url ("/e/u261/lang/zh-cn/images/localimage.png") no-repeat center;border:1px solid #ddd; "alt=" Spacer.gif "/> 4.1 Conditional connection, symbol r θ s, represents a combination of qualifying θ in RxS, such as "select * from j,p where j.id = p.id" Equals select * from J join P on  J.I D = p.id ", INNER join is the same as join.
4.2 Left outer connection r 650) this.width=650; "alt=" \ltimes "src=" http://upload.wikimedia.org/math/d/d/f/ Ddf0bdae0e2fc3d918948fe0c7c8bef9.png "/> S", which represents a combination of the matching conditions in R and S, if there is no corresponding entry in S, its value is considered null, such as "SELECT * from P left OUTER JOIN on p.age = J.age; " Where the left OUTER join equals the LEFT join
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "spacer.gif"/>4.3 right outer link, opposite to left outer link.
5. Union symbol R1∪R2, which represents the combination of two tables, which must have the same format, such as select Age from P Union (select age from J); If there are two columns of age in P, and J has two columns of age (2,3), then the result is (three-way), notice that there is a duplicate record in the cell, and the union all can be used to get (1,2,2,3) all results with duplicate records.
6. Intersection symbol R1∩R2, which indicates that R1 is the same record as R2, R1∩R2 = r1– (R1–R2), r1∩r2 = R1 θr2 condition is the same as record.
7. Difference symbol R1-R2, which indicates a record in the R1 that is not found in the R2, select * from P where the age is not in (select Age from J);
SQL optimization
In the course, we talked about SQL is. It means that SQL defines the results we need, not how they work. In fact DB will do some optimizations on SQL such as SQLite's sqlite Query Planner, Mysql optimization optimization, in fact, still need manual optimization
in the homework similarity matrix, we need to calculate the similarity of 22 documents, which is actually a matrix operation.
1) The code is as follows, spents 1m22.042s
Select X.docid,y.docid,sum (X.count*y.count) as Count from Frequency X, Frequency y where x.term = Y.term and X.docid < y . DocId GROUP BY X.docid, Y.docid ORDER by Count ASC;
2) Submit the answer only need, one of the results, time 1m10.919s, you can see here is actually calculated the similarity of all documents intercepted, DB is not optimized.
SELECT * FROM (select X.docid as a, y.docid as B,sum (X.count*y.count) as Count from Frequency X, Frequency y where x.term = Y.term and X.docid < Y.docid GROUP by X.docid, Y.docid ORDER by Count ASC) x where x.a= "10080_txt_crude" and x.b= "17 035_txt_earn ";
3) Manually put the above where limit statement into the calculation process, time 0m0.005s
SELECT * FROM (select X.docid as a, y.docid as B,sum (X.count*y.count) as Count from Frequency X, Frequency y where x.term = Y.term and X.docid < Y.docid and x.docid= "10080_txt_crude" and y.docid= "17035_txt_earn" group by X.docid, Y.docid OR DER by Count ASC) x;
Put the above example into MySQL SQL 1) execution, run for 10 minutes no results!
Answers to the questions after class
It's uploaded to GitHub.
Https://github.com/trumanz/coursera/tree/master/datasci/assignment_Relation_DB
This article is from the "10305054" blog, please be sure to keep this source http://10315054.blog.51cto.com/10305054/1657686
An RDBMS summary of introduction to Data Science in public class