Example: [the path at the beginning of the code was necessary for excute the code]/usr/lib/qiime/bin/split_libraries.py -mmap.txt -f mtt.fna -q mtt.qual -o lib1 -r -l 150 -b variable_length -M 6

split_libraries.py

-m 后接map文件（详见之前的博文）

-f 后接fasta文件，如果有很多个，用逗号隔开

-q 质量文件，如果有很多个，用逗号隔开

-r 删掉未匹配的序列，默认的是不加这个参数，即保留呗

-l 最短序列，默认的是200，低于这个值得就给丢掉

-L 保留的最长的序列，默认的是1000，超过这个长度就给丢掉

-t 在去掉引物和barcodes后，计算序列的长度，默认的是False,即不计算

-s read中容许的最小的平均得分，默认的是25，片段质量得分低于这个值得就给丢掉。

-k 保留引物

-B 保留barcode

-b barcode的类型：hamming_8,golay_12, variable_length (will disable any barcode correctionif variable_length set),或者是barcode的长度（例如-b 4 表示barcode长度为4）默认的是golay_12

-e 容许的最大的barcode错误，默认的是1.5

-c 关闭寻找最相近的barcode

-a 容许的最大的未知的碱基数，默认的是6

-H --max-homopolymer，默认的为6

-M 容许最大引物错配，默认的为0

-o 生成的文件夹

-n Seqid to use for the first sequence [default: 1]

--retain_unassigned_reads 保留没有分配到生成文件中的序列，默认为不保留

-w , --qual_score_window

Enable sliding window test of quality scores.If the average score of a continuous set of w nucleotides fallsbelow the threshold (see -s for default), the sequence isdiscarded. A good value would be 50. 0 (zero) means no filtering.Must pass a .qual file (see -q parameter) if this functionality isenabled. Default behavior for this function is to truncate thesequence at the beginning of the poor quality window, and test forminimal length (-l parameter) of the resulting sequence. [default:0]

-g, --discard_bad_windows

If the qual_score_window option (-w) isenabled, this will override the default truncation behavior anddiscard any sequences where a bad window is found. [default:False]

-p, --disable_primers

Disable primer usage when demultiplexing.Should be enabled for unusual circumstances, such as analyzingSanger sequence data generated with different primers.[default: False]

-z, --reverse_primers

Enable removal of the reverse primer and anysubsequence sequence from the end of each read. To enable this,there has to be a “ReversePrimer” column in the mapping file.Primers a required to be in IUPAC format and written in the 5’ to3’ direction. Valid options are ‘disable’, ‘truncate_only’, and‘truncate_remove’. ‘truncate_only’ will remove the primer andsubsequent sequence data from the output read and will not alteroutput of sequences where the primer cannot be found.‘truncate_remove’ will flag sequences where the primer cannot befound to not be written and will record the quantity of such failedsequences in the log file. [default: disable]

--reverse_primer_mismatches

Set number of allowed mismatches for reverseprimers (option -z). [default: 0]

-d, --record_qual_scores

Enables recording of quality scores for allsequences that are recorded. If this option is enabled, a filenamed seqs_filtered.qual will be created in the output directory,and will contain the same sequence IDs in the seqs.fna file andsequence quality scores matching the bases present in the seqs.fnafile. [default: False]

-i, --median_length_filtering

Disables minimum and maximum sequence lengthfiltering, and instead calculates the median sequence length andfilters the sequences based upon the number of median absolutedeviations specified by this parameter. Any sequences with lengthsoutside the number of deviations will be removed. [default:None]

-j, --added_demultiplex_field

Use -j to add a field to use in the mappingfile as an additional demultiplexing option to the barcode. Allcombinations of barcodes and the values in these fields must beunique. The fields must contain values that can be parsed from thefasta labels such as “plate=R_2008_12_09”. In this case, “plate”would be the column header and “R_2008_12_09” would be the fielddata (minus quotes) in the mapping file. To use the run prefix fromthe fasta label, such as “>FLP3FBN01ELBSX”, where “FLP3FBN01” isgenerated from the run ID, use “-j run_prefix” and set the runprefix to be used as the data under the column headerr“run_prefix”. [default: None]

-x, --truncate_ambi_bases

Enable to truncate at the first “N” characterencountered in the sequences. This will disable testing forambiguous bases (-a option) [default: False]

生成文件：

.fna 序列的名字中包含了来自map文件中sampleid的编号

histograms.txt包含了特殊长度的序列的数目

split_library_log.txt 质量过滤后的总结文件

1，如果是好几个样品，只要他们Map文件中barcode不一样，可以这么来：

split_libraries.py -mMapping_File.txt -f 1.TCA.454Reads.fna,2.TCA.454Reads.fna -q1.TCA.454Reads.qual,2.TCA.454Reads.qual -oSplit_Library_Output_comma_separated/

也可以直接将所有序列合并后再来处理

2，如果是双端测序，来自两个测序。比如说同一个barcode的几个不同测序结果中编号一样，如果都用同一个barcode，导致的结果就是不同测序中的片段被划分了同一个编号。

split_libraries.py -mMapping_File.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -oSplit_Library_Run1_Output/

split_libraries.py -mMapping_File.txt -f 2.TCA.454Reads.fna -q 2.TCA.454Reads.qual -oSplit_Library_Run2_Output/ -n 2000000

catSplit_Library_Run1_Output/seqs.fnaSplit_Library_Run2_Output/seqs.fna >Combined_seqs.fna

-n后面接着起始序列编号，这个数值应该大于打一个脚本中序列数之和

参考资料：

http://qiime.org/scripts/split_libraries.html

铁汉1990的博客

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。