R script for automatic processing of high-throughput sequencing (RNA-SEQ) data

Source: Internet
Author: User

Feedback method:

    1. Any errors in this article, please correct in the message, you can also send e-mail to [email protected], welcome to communicate;
    2. For any suggestions on the new features, you can also follow the previous step to communicate;

Where this procedure is to be improved:

    1. Think, while running the program, the program will copy itself to the output folder for backup (Current_file_path_getter), but the function is very poor portability, temporarily do not recognize the R-cmd run the script, but through the source ("") and "R-- File= "way to run without problems, but also please have a better way of the cattle, thank you for the first;
    2. It is expected that after each step, the script will automatically send an email to the specified mailbox alert or inform the user; However, the "SENDMAILR" package is less likely to be used in the study;
    3. The script, written smelly long smelly long, many places can be more concise, the reason is so long, mainly personal look at the clear;

Step Description:

    1. This script can be used free and freely;
    2. This script is currently only used for single-ended sequencing;

How to use:

  1. Install the necessary software, this should not be a problem, mainly have TOPHAT,CUFFLINKS,FASTQC, etc.
  2. Create a new working directory and set the "working_directory" of footsteps to the new working directory;
  3. Under working directory, create a new "input_files" folder;
  4. Under the "Input_files" folder, create a new "Sequenced_reads" folder. And will only put all the sequencing data files into the folder;
  5. Based on the sample species, in the "Input_files" folder, create a new name for the species folder (example "Homo_sapiens"), while the script "Specie_name" is set to the name of the species.
  6. Under the name of the species folder (homo_sapiens), create a new folder to hold the Bowtie2 index file "Bowtie2index" and the "genes" folder for storing the gene annotations;
  7. Download reference files and notes for the corresponding species of interest (Reference sequences and Annotations) to the Igenomes home page. The corresponding files in the Igenomes compressed package are placed in the file created in the previous step.
  8. Run the script according to the system you are running and select the appropriate method;
  9. After two minutes of operation, the script will generate "output_files_xxxxxxxx_xxxxxx" folder in the working directory, go to the folder to find "Accepted_hits_bam.txt" file, the following steps to edit (considering the speed of different machines, The best time to edit the file is that the "Tophat_output" folder under the "Output_files_xxxxxxxx_xxxxxx" folder is not empty):
    1. Insert a new line at the beginning of the file, enter: "sample_id group_label" (note, does not include quotation marks, the Middle tab tab is divided;
    2. After the other lines, add a tab tab, and then enter the label (label, for grouping) of the corresponding sample for the row, using the same label for the different repeats in the group;
    3. Save the file.
  10. Wait for the machine to run and finish, just fine;
  11. downstream analysis using Cummerbund;

The code is as follows:

#! /data02/software/r/bin/r## Created by Roger Young, 2015-02-05## nohup/data02/software/r/bin/r--file=~/mrna_workflow. R &>./nohup.log## nohup/data02/software/r/bin/r CMD BATCH ~/mrna_workflow. R &>/nohup.log##-------Those lines should be changed accordingly-----------------------------# # under normal circumstances, You need to change the following lines as needed: Output_folder and output_folder;## directory: The absolute path representing a directory ending with "/", unless otherwise stated! # # Foler: Represents the file name of a folder, ending with "/", unless otherwise stated! # # Path: Represents the absolute path of a file, unless otherwise stated! Debug_flag <-false;## R path:/data02/software/r/bin/r## source ("~/documents/mrna_profiles/mrna_workflow. R ") working_directory <-" ~/documents/mrna_profiles/"; # # If you test under a Linux system, modify it as appropriate. # # C:\Program Files\r\r-3.1.2\bin\i386\r.exe--file=~/mrna_workflow. Rif (debug_flag) {working_directory <-"d:/roger/data/mrna/";} Output_folder <-paste ("Output_files_", Format (Sys.time (), "%y%m%d_%i%m%s"), "/", sep= ""); output_d Irectory <-paste (working_directory, output_folder, Sep = "");d ir.create (path = Output_direcTory, recursive = TRUE, showwarnings = FALSE); # #-------The software catalog used, Need to change according to specific equipment------------------------------------------------fastqc_path <-"~/BIN/FASTQC/FASTQC"; tophat_app_path <-"Tophat", Cufflinks_app_path <-"cufflinks"; Cuffmerge_app_path <-"Cuffmerge"; Cuffdiff_app_path <-" Cuffdiff "; # Parameters used by all those softwares!threads_params <-"-P ";d ebug_flag <-" Debug flag: "; # #--- ----output folder, generally do not need to modify----------------------------------------------------log_folder <-"log/"; Log_directory <- Paste (output_directory, log_folder,sep = "");d ir.create (path = log_directory, recursive = TRUE, showwarnings = FALSE); sin K (file = paste (log_directory, "Log_", Format (Sys.time (), "%y%m%d_%i%m%s"), ". Log", sep= ""), split = TRUE); # #-----Get the path to this script and save it in the output folder (note, you still need to modify to ensure portability)--------------Current_file_path_getter <-function () {Cmdargs        <-Commandargs (trailingonly = FALSE);     Print (Paste ("Cmdargs:", Paste (Commandargs (), collapse = "")));   R_cmd_indicator <-"[[: Blank:]]cmd[[:blank:]]";        cmd_mode_matched <-grep (pattern = r_cmd_indicator, x = Cmdargs);                if (length (cmd_mode_matched) > 0) {}else{file_arg_indicator <-"--file=";                Matched <-grep (pattern = file_arg_indicator, x = Cmdargs);                                      if (length (matched) > 0) {# Rscript return (Normalizepath (the sub (pattern = File_arg_indicator,        Replacement = "", x = cmdargs[matched]));        } else {# ' source ' d via R console return (Normalizepath (Sys.frames () [[1]] $ofile)); }}}current_script_path <-Current_file_path_getter ();p rint (Paste ("Current_script_path:", Current_script_path)) ; File.Copy (from = Current_file_path_getter (), to = Paste (Output_directory, "/rscript.r", Sep = ""), OVERWR Ite = true, Copy.mode = true, Copy.date = TRUE); # #-------Define input files and folders-----------------------------------------------input_folder <-"input_files/"; input_ Directory <-paste (working_directory, input_folder, Sep = ""); Sequenced_reads_folder <-"Sequenced_reads"; # # no "/" Sequenced_reads_directory <-paste (input_directory, Sequenced_reads_folder, Sep = ""); # #-------Genome References-----------------------------------------------------specie_name <-"Homo_ sapiens/", Annotation_file_name <-" GENES.GTF "; Annotation_file_folder <-" genes/"; annotation_index_directory <-paste (input_directory, Specie_name, annotation_file_folder, Sep = ""); annotatio N_index_path <-paste (annotation_index_directory, annotation_file_name, Sep = ""); bowtie_i Ndex_name <-"Genome"; Bowtie_index_folder <-"bowtie2index/"; bowtie_index_directory <-paste (input_ Directory, Specie_name, Bowtie_index_folder , Sep = ""); Bowtie_index_path <-paste (Bowtie_index_directory, bowtie_index_name, Sep = ""); #-------to confirm that the dependent package is properly installed------------------------------------------check_bioconductor_pkg <-function (pkgs) {if (R    Equire (pkgs, character.only = TRUE)) {print (Paste ("******", pkgs, "is loaded correctly", "******"));        } else {print (paste ("******trying to Install", pkgs, "******"));        # # Update R packages Print ("******update r packages******");                Update.packages (ask = FALSE);        Source ("Http://bioconductor.org/biocLite.R");        Bioclite ();         # # Install desired Packages bioclite (PKGS); # # Re-check if (Require (pkgs, character.only = TRUE)) {print (Paste ("******", pkgs, "Installed and loaded!"        , "******"));        } else {Stop (paste ("??????", "Error:could not install", pkgs, "??????")); }}} check_pkg <-function (pkgs) {if (Require (PKGS, CHaracter.only = TRUE) {print (Paste ("******", pkgs, "is loaded correctly", "******"));        } else {print (paste ("******trying to Install", pkgs, "******"));        # # # Update R packages Print ("******update r packages******");        Update.packages (ask = FALSE);        Install.packages (PKGS); # # Re-check if (Require (pkgs, character.only = TRUE)) {print (Paste ("******", pkgs, "Installed and loaded!"        , "******")); } else {Stop (paste ("??????????", "Error:could not install", pkgs, "??????????")        ); }}} # # (Check_pkg ("SENDMAILR")); # # (Check_bioconductor_pkg ("cummerbund"); # #------- Gets the list of sequenced files required for processing------------------------------------------(sequenced_files <-list.files (path = Sequenced_reads_ Directory, pattern = "*[^ (. MD5) | (.                        TXT)]$ ", All.files = True, Full.names = True,       recursive = TRUE); # #-------Sequencing quality assessment--------------------------------(fastqc_output_folder <-"fastqc_output/");(F Astqc_output_directory <-paste (output_directory, fastqc_output_folder,sep = "");(d IR . Create (Path = fastqc_output_directory, recursive = TRUE, showwarnings = FALSE));(sequenced_file_lists <-sequenced_    files);p rint (Paste ("sequenced_file_lists:", sequenced_file_lists)); for (files_to_test in sequenced_file_lists) {    Print (Paste ("Files_to_test:", files_to_test));    Fastqc_params <-"--outdir=";    Fastqc_output_directory;                                              # # Tedious lines for output folder construction temp_output_directory_dot_repaced <-gsub (pattern = "\ \",        Replacement = "_", x = files_to_test);                                                    Fastqc_output_directory <-gsub (pattern = paste (Input_folder,                   Sequenced_reads_folder,                                  "/", sep= ""), # # Specical replacement = paste (output_                                                        folder, Fastqc_output_folder,        Sep = ""), x = temp_output_directory_dot_repaced);        Print (Paste (Debug_flag, "fastqc_output_directory:", fastqc_output_directory));        Dir.create (fastqc_output_directory, recursive = TRUE); # # Build a command to run FASTQC fastqc_cmd <-paste (Fastqc_path, files_to_test, paste (fastqc_params, fast    Qc_output_directory, Sep = ""), Sep = "");    Print (Paste (Debug_flag, "Fasqc_cmd:", Fastqc_cmd));    # #质量评估与后续部分无联系, so run in the background to improve speed! if (!        Debug_flag) {print ("Running fastqc!")     System (Command = fastqc_cmd, wait = FALSE);    } else {print ("in Debug mode!") }}##-------Align the rna-seq reads toThe Genomes--------------------------------# # Map The reads for each sample to the reference genome (Tophat_output_folder & lt;-"tophat_output/");(tophat_output_directory <-paste (output_directory, tophat_out Put_folder, Sep = ""));(d ir.create (path = tophat_output_directory, recursive = TRUE, showwarnings = FALSE));(Cufflinks_ou                                      Tput_folder <-"cufflinks_output/");(cufflinks_output_directory <-paste (output_directory, Cufflinks_output_folder, Sep = ""));(d ir.create (path = cufflinks_output_directory, recursive = TRUE, Showwarni NGS = FALSE));(sequenced_file_lists <-sequenced_files);(cuffmerge_assemblies_file <-"Assemblies.txt") (                                          Cuffmerge_assemblies_file_text <-"");(cuffmerge_assemblies_file_path <-paste (output_directory, Cuffmerge_assemblies_file, Sep = ""));(Accepted_hits_bam_fil E <-"Accepted_hits_bam.txt") (AccePted_hits_bam_file_text <-"");(accepted_hits_bam_file_path <-paste (output_directory, Accepted_hits_bam_file, Sep = "")); # # This section is used to build Cuffdiff ACCEPTED_HITS_BAM_ File_path file; # # This section will need to modify the Accepted_hits_bam_file_path file as required after it finishes running.        For (Sequenced_read-sequenced_file_lists) {print (Paste (Debug_flag, "The file to process:", sequenced_read)); temp_output_directory_dot_repaced <-gsub (pattern = "\ \", Replacemen        T = "_", x = Sequenced_read); Tophat_output_directory <-gsub (pattern = paste (Input_folder, Sequen                                    Ced_reads_folder, "/", sep= ""), # # Specical        Replacement = Paste (Output_folder, Tophat_output_folder,                                                Sep = ""), X = Temp_output_directory_d        ot_repaced); Accepted_hits_bam_file_text <-paste (accepted_hits_bam_file_text, tophat_output _directory, "/", "Accepted_hits.bam", "\ n"        , Sep = "");                                                       Cufflinks_output_directory <-gsub (pattern = paste (Input_folder,                                       Sequenced_reads_folder, "/", sep= ""), # # Specical Replacement = Paste (Output_folder, cufflink                                       S_output_folder, Sep = ""),,        x = temp_output_directory_dot_repaced); Cuffmerge_assembLies_file_text <-paste (cuffmerge_assemblies_file_text, Cufflinks_output_dir Ectory, "/", "TRANSCRIPTS.GTF", "\ n"    , Sep = "");       }write (cuffmerge_assemblies_file_text, file = Cuffmerge_assemblies_file_path); write (Accepted_hits_bam_file_text, FILE = Accepted_hits_bam_file_path); accepted_hits_bam_file_modify_time <-File.info (accepted_hits_bam_file_path # # to send a message to the user informing Accepted_hits_bam_file_path that the file has been established, please modify it; print (Paste ("* * * File:", Accepted_hits_bam_file_path)); Print (Paste ("First, insert the following line (seperated by tab): \" sample_id group_label\ ";")); Print ("The, Append each and lines with a tab, followed by the sample's label (group)") print ("Finally, Save the file!!") number_of_files_had_processed <-0; # #length (sequenced_file_lists);p rint (Paste (Debug_flag, "number_of_files_had_processed:", number_of_files_had_processed)); for (Sequenced_read in sequenced_file_lists) {print (Paste (Debug_flag," the file T        O Process: ", sequenced_read)); temp_output_directory_dot_repaced <-gsub (pattern = "\ \", replacement = "        _ ", x = Sequenced_read); Tophat_output_directory <-gsub (pattern = paste (Input_folder, Sequen                                    Ced_reads_folder, "/", sep= ""), # # Specical                                                        Replacement = Paste (Output_folder, Tophat_output_folder, Sep = ""), X = Temp_output_direc        tory_dot_repaced);        Print (Paste (debug_flag, "Tophat output folder:", Tophat_output_directory)); (dir.create (Path = Tophat_output_directory, recursive = TRUE, showwarnings = FALSE); # #构建运行tophat的命令 (tophat_cmd <-paste (Tophat_app_path, Threads_params, "-G", Annotation_index _path, "-O", Tophat_output_directory, Bowtie_ind    Ex_path, sequenced_read, Sep = ""));        Print (Paste (Debug_flag, "Tophat_cmd:", Tophat_cmd)); if (!    Debug_flag) {System (Command = tophat_cmd, wait = TRUE);                                                       # # # Build the command to run cufflinks cufflinks_output_directory <-gsub (pattern = paste (Input_folder, Sequenced_reads_folder, "/", SE                                                           P= ""), # # Specical replacement = paste (Output_folder,                       Cufflinks_output_folder, Sep = ""),,                 x = temp_output_directory_dot_repaced); Print (Paste (debug_flag, "Cufflinks output directory", Cufflinks_output_directory)) (dir.create (Path = Cufflinks_out        Put_directory, recursive = TRUE, showwarnings = FALSE)); (Cufflinks_cmd <-paste (Cufflinks_app_path, Threads_params, "-O", cufflinks_output_directory  , Paste (Tophat_output_directory, "/", "Accepted_hits.bam", Sep    = ""), Sep = ""));        Print (Paste (Debug_flag, "Cufflinks_cmd:", Cufflinks_cmd));    # #同步运行cufflinks!    number_of_files_had_processed <-number_of_files_had_processed + 1;        Print (Paste (Debug_flag, "Number of files had been handled:", number_of_files_had_processed)); if (!    Debug_flag) {System (Command = cufflinks_cmd, wait = TRUE); }}warnings () # #-------Run cuffmerge to create a single merged transcriptome annotation------(cuffmerge_output_folder <-"cuffmerge_output"); # #没有 "/" (Cuffmerge_output_directory <-paste (output_directory, cuffmerge_output_fol Der, Sep = ""));d ir.create (path = cuffmerge_output_directory, recursive = TRUE, showwarnings = FALSE); Cuffmerge_cmd <-                       Paste (Cuffmerge_app_path, "--REF-GTF", Annotation_index_path, "-O", Cuffmerge_output_directory,                                               "--ref-sequence", Paste (bowtie_index_directory, "Genome.fa",                       Sep = ""), Threads_params, Cuffmerge_assemblies_file_path, Sep = "");p rint (Paste (Debug_flag, "Cuffmerge_cmd:", Cuffmerge_cmd)); Debug_flag) {System (Command = cuffmerge_cmd, wait = TRUE);} # #-------Identify differentially expressed genes and transcripts------(Cuffdiff_output_folder <-"cuffdiff_output/ ");(cuffdiff_output_directory <-paste (output_directory, Cuffdiff_ouTput_folder, Sep = ""));(d ir.create (path = cuffdiff_output_directory, recursive = TRUE, showwarnings = FALSE)); cuffdiff_c                      md <-Paste (cuffdiff_app_path, "-use-sample-sheet", "-O", Cuffdiff_output_directory, "-B", Paste (bowtie_index_directory, "genome.fa", Sep = ""), T                      Hreads_params, "-U", Paste (Cuffmerge_output_directory, "/", "MERGED.GTF", sep= ""), Accepted_hits_bam_file_path, sep= "") Print (Paste (Debug_flag, "Cuffdiff_cmd:", cuffdiff_cmd, Sep = "") if (Debug_flag) {sys.sleep (time = 20);} # # Here, we think the user has correctly modified the file if the Accepted_hits_bam_file_path refers to (identical (File.info (Accepted_hits_bam_file_path), accep    Ted_hits_bam_file_modify_time) {print (Paste ("This file stays unchanged:", Accepted_hits_bam_file_path, Sep = "")); Print ("Cuffdiff is skipped!    Please run it manually!! "); Print (Paste ("The cmd for running Cuffdiff is: ", cuffdiff_cmd, Sep =") ");}    else {print ("Cuffdiff_cmd is running"); if (!    Debug_flag) {System (Command = cuffdiff_cmd, wait = TRUE,); }}print ("done!"); Sink ();p rint ("done!"); if (! Debug_flag) {Quit (save = "no");}

  

Complete!

R script for automatic processing of high-throughput sequencing (RNA-SEQ) data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.