An RDBMS summary of introduction to Data Science in public class

Source: Internet
Author: User
Tags sqlite sqlite query

Recently in the learning public lesson Introduction to the Data science

The second part is about relational database and relational algebra (relational databases,relational Algebra)

Which speaks of relational databases.


    • The relationship between SQL and RA

  1. Select selection, minus σC (r) to indicate the column in query table R that meets the criteria C.

    For example σSalary > 40000 (Employee)

    Represents "SELECT * from Employee where Salary > 40000"

  2. Projection projection, minus πname , salary (Employee)

    means "select Name, Salary from employee;

  3. Set operation, in accordance with R 650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "alt=" \times "style= "Background:url ("/e/u261/lang/zh-cn/images/localimage.png ") no-repeat center;border:1px solid #ddd;"/> S that represents R and S do Cartesian product (all combinations of columns in R and s total)

    e.g. J X P

    Represents "select * from J,p; If J is a table with 2 rows and 3 columns, p is a table with 4 rows and 5 columns, then the result is a 2*4 row 3+5 column table.



  4. Connect Join, Conditional connection, left connection, right connection

  •  650) this.width=650; "Src="/e/u261/themes/default/images/spacer.gif "Style=" Background:url ("/e/u261/lang/zh-cn/images/localimage.png") no-repeat center;border:1px solid #ddd; "alt=" Spacer.gif "/> 4.1   Conditional connection, symbol r  θ s,  represents a combination of qualifying θ in RxS, such as "select * from j,p where j.id = p.id" Equals  select * from J  join P on  J.I D = p.id ",   INNER join is the same as join.  

  • 4.2 Left outer connection r 650) this.width=650; "alt=" \ltimes "src=" http://upload.wikimedia.org/math/d/d/f/ Ddf0bdae0e2fc3d918948fe0c7c8bef9.png "/> S", which represents a combination of the matching conditions in R and S, if there is no corresponding entry in S, its value is considered null, such as "SELECT * from P left OUTER  JOIN on p.age = J.age; " Where the left OUTER join equals the LEFT join

  • 650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "spacer.gif"/>4.3 right outer link, opposite to left outer link.  


5. Union symbol R1∪R2, which represents the combination of two tables, which must have the same format, such as select Age from P Union (select age from J); If there are two columns of age in P, and J has two columns of age (2,3), then the result is (three-way), notice that there is a duplicate record in the cell, and the union all can be used to get (1,2,2,3) all results with duplicate records.

6. Intersection symbol R1∩R2, which indicates that R1 is the same record as R2, R1∩R2 = r1– (R1–R2), r1∩r2 = R1 θr2 condition is the same as record.

7. Difference symbol R1-R2, which indicates a record in the R1 that is not found in the R2, select * from P where the age is not in (select Age from J);

SQL optimization

In the course, we talked about SQL is. It means that SQL defines the results we need, not how they work. In fact DB will do some optimizations on SQL such as SQLite's sqlite Query Planner, Mysql optimization optimization, in fact, still need manual optimization

in the homework similarity matrix, we need to calculate the similarity of 22 documents, which is actually a matrix operation.

1) The code is as follows, spents 1m22.042s


Select X.docid,y.docid,sum (X.count*y.count) as Count from Frequency X, Frequency y where x.term = Y.term and X.docid < y . DocId GROUP BY X.docid, Y.docid ORDER by Count ASC;

2) Submit the answer only need, one of the results, time 1m10.919s, you can see here is actually calculated the similarity of all documents intercepted, DB is not optimized.

SELECT * FROM (select X.docid as a, y.docid as B,sum (X.count*y.count) as Count from Frequency X, Frequency y where x.term = Y.term and X.docid < Y.docid GROUP by X.docid, Y.docid ORDER by Count ASC) x where x.a= "10080_txt_crude" and x.b= "17 035_txt_earn ";


3) Manually put the above where limit statement into the calculation process, time 0m0.005s

SELECT * FROM (select X.docid as a, y.docid as B,sum (X.count*y.count) as Count from Frequency X, Frequency y where x.term = Y.term and X.docid < Y.docid and x.docid= "10080_txt_crude" and y.docid= "17035_txt_earn" group by X.docid, Y.docid OR DER by Count ASC) x;

Put the above example into MySQL SQL 1) execution, run for 10 minutes no results!



Answers to the questions after class

It's uploaded to GitHub.

Https://github.com/trumanz/coursera/tree/master/datasci/assignment_Relation_DB



This article is from the "10305054" blog, please be sure to keep this source http://10315054.blog.51cto.com/10305054/1657686

An RDBMS summary of introduction to Data Science in public class

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.