Broad Stepssetup a new AWS VPC (This step was optional, so don't have the to follow along if you don ' t want to).
Stanford is running a entire AWS VPC devoted to analytics, which hosts:
- The analytics report, API application, and dashboard application databases,
- The Elasticmapreduce cluster,
- The Task Scheduler (which we use Jenkins for),
- The API servers, and
- The dashboard app servers.
Our data VPC also have a peering connection to our prod VPC, so that the EMR cluster machines can get access to our product Ion RDS Read-replica, needed for some of the analytics tasks.
Note that the none of this is necessary. Everything would work fine as long as can set up a cluster, the app machines, and the databases, and they can all Conne CT to the other as needed.
UPLOAD your tracking logs somewhere (like S3) accessible to the Hadoop cluster you'll create.
Tracking logs, in recent release of Edx-platform, is typically located on the app server /edx/var/log/tracking/tracking.log-+%Y%m%d-%s
. At Stanford (and EdX), the tracking logs from all our apps servers get synced up to a single buckets in S3. (Stanford uses rsync). Whether it ' s pushed by the app servers or periodically synched by some other process, make sure there is no duplicate or Missing tracking log files in this bucket, as that would affect the statistical calculations.
Launch a elasticmapreduce (EMR) Hadoop Cluster.
Stanford keeps a long running cluster around (1 m3.medium master node and 1 m3.medium core node) and sizes Up/down the num ber of task instances with each task run. The article on creating an EMR cluster have more details.
Note that this is somewhat different than edx.org, which, with every task run, provisions a new EMR clusters using a Custo M ansible module driven by a shell script. Consult theedx-analytics-configuration Repo If you is interested in this workflow.
Setup The analytics report, API Application and dashboard application MySQL databases.
It ' s pretty much standard RDS, but make sure your RDS security groups for the reports database (written to by the code in and read by the edx-analytics-pipeline
code in edx-analytics-data-api
) Allow access by all the master and slave cluster machines (there is Security Groups associated with Emr-master and EMR-SL Ave that were created for us if we launched an EMR cluster), and all the data API servers. The Data API and dashboard ( edx-analytics-dashboard
) Django apps also need databases to function, and we just use the same DB server for the SE 3 databases.
Setup The Tasks Scheduler
The reports DB is filled periodically by the Luigi tasks, so a scheduler is needed. We set up a Jenkins box because it provides a nice interface to allows us to schedule jobs periodically (and to view the C Onsole output) but also run them on demand. We did a vanilla on sudo apt-get install jenkins
a Ubuntu server. However, the needs to is checked out and installed on this edx-analytics-pipeline
Jenkins box, because the executable python script remote-task
s Upplied by the install are what kicks off the Luigi tasks on the EMR cluster.
Setup tasks themselves, and provide all the parameters and the sundry things needed by tasks
Task parameters can supplied in 3 ways, on the command line remote-task
of the command, or via a file that lives on the overrides.cfg
File System of Scheduler Jenkins box and pointed to by a command line parameter to (this is what remote-task
Stanford does Curr ently), or in a override.cfg
kept in another repo, with the repo location being supplied by yet another command-line parameter to remote-task
.
Sundry things is mainly kept in S3, like MySQL credentials files for the reports database or .jar
libraries needed by Var IOUs tasks.
Setup the API app servers (from Repo:https://github.com/edx/edx-analytics-data-api)
Once you ' re able to launch tasks and has them run to completion and confirm there ' s data in your reports MySQL DB, you NE Ed to setup the Data-api application servers to serve that data from the reports MySQL db, over a REST API. There is ansible roles available in the edx configuration repo (https://github.com/edx/configuration/tree/master/ PLAYBOOKS/ROLES/ANALYTICS-API) for this, and even a playbook which runs this role (ours are at Https://github.com/Stanford-O NLINE/CONFIGURATION/BLOB/MASTER/PLAYBOOKS/EDX-WEST/DATA-API.YML) so your don ' t need to do much except to edit the VARs file s used by the playbook.
The Data API app has a self-documenting front page (https:///docs/) so can use to test that the data is being CORREC T served.
Setup the Insights app servers (from Repohttps://github.com/edx/edx-analytics-dashboard)
Once you confirm this data API is serving up data over REST, you can set up the Insights (dashboard) app which is RESP Onsible for the ux/presentation of the analytics data. Need to setup the Data-api application servers to serve this data over a REST API. This app does not directly interact with the reports database, but rather it makes REST calls to the data API and Interpre Ts/displays the JSON retunred.
There is ansible roles available in the edx configuration repo (https://github.com/edx/configuration/tree/master/ Playbooks/roles/analytics-insights) for this, and even a playbook which runs this role (ours are at https://github.com/Stanf ORD-ONLINE/CONFIGURATION/BLOB/MASTER/PLAYBOOKS/EDX-WEST/DATA-INSIGHTS.YML) so your don ' t need to do much except to edit th e vars files used by the playbook.
Configure the OpenID Connect (OAuth2) parameters between insights and Edx-platform
The Insights app relies on the Edx-platform instance for it authentication/authorization to create a more integrated US ER experience. In particular, when a user visits the Insights app, the app uses the OpenID Connect protocol to seamlessly create an Insig HTS account that's linked with the users ' Edx-platform account. The users ' course staff privileges is also propagated from edx-platform to insights, so that users have see analytics dat A for courses in which they has a staff privileges.
This means. Some configuration is required in Edx-platform to add insights as a OpenID Connect client, and that confi Guration needs to is in synch with configuration in the Insights app. See article for details.
This is a database analysis of Stanford University.