
Consensus and Profile

A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that ‘A’ occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Sample input


Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)
Sample output

A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

题目给出了多个核酸序列,需要我们按照碱基类型输出各个位置碱基的出现次数,即Profile,最终各个位置出现次数最多的碱基组成consensus 。


public class Consensus_and_Profile {public static void main(String[] args) {//1.读取fasta文件List<String> fasta = BufferedReader2("C:/Users/Administrator/Desktop/rosalind_cons.txt", "fasta");//2.遍历并获取Consensus和ProfileStringBuilder Consensus = new StringBuilder();StringBuilder ProfileA = new StringBuilder();StringBuilder ProfileT = new StringBuilder();StringBuilder ProfileG = new StringBuilder();StringBuilder ProfileC = new StringBuilder();ProfileA.append("A: ");ProfileT.append("T: ");ProfileG.append("G: ");ProfileC.append("C: ");//建立指针i遍历索引for (int i = 0; i < fasta.get(0).length(); i++) {Map<String, Integer> maps = new HashMap<>();maps.put("A", 0);maps.put("T", 0);maps.put("C", 0);maps.put("G", 0);//建立指针j遍历元素for (int j = 0; j < fasta.size(); j++) {String key = String.valueOf(fasta.get(j).charAt(i));if (maps.containsKey(key)) {maps.put(key, maps.get(key) + 1);}}//遍历完成后获取出现次数最多的碱基并进行输出int maxvalue = 0;String maxKey = null;Set<String> keys = maps.keySet();for (String key : keys) {int value = maps.get(key);if (value >= maxvalue) {maxvalue = value;maxKey = key;}//构造Profileswitch (key) {case "A":ProfileA.append(value + " ");break;case "T":ProfileT.append(value + " ");break;case "G":ProfileG.append(value + " ");break;case "C":ProfileC.append(value + " ");break;default:break;}}Consensus.append(maxKey);}System.out.println(Consensus);System.out.println(ProfileA);System.out.println(ProfileC);System.out.println(ProfileG);System.out.println(ProfileT);}public static ArrayList<String> BufferedReader2(String path, String choose) {//返回值类型是新建集合大类,此处是Set而非哈希。BufferedReader reader;ArrayList<String> tag = new java.util.ArrayList<String>();ArrayList<String> fasta = new java.util.ArrayList<String>();try {reader = new BufferedReader(new FileReader(path));String line = reader.readLine();StringBuilder sb = new StringBuilder();while (line != null) {//多次匹配带有“>”的行,\w代表0—9A—Z_a—z,需要转义。\W代表非0—9A—Z_a—z。if (line.matches(">[\\w*|\\W*]*")) {tag.add(line);//定义字符串变量seq保存删除换行符的序列信息if (sb.length() != 0) {String seq = sb.toString();fasta.add(seq);sb.delete(0, sb.length());//清空StringBuilder中全部元素}} else {sb.append(line);//重新向StringBuilder添加元素}// read next lineline = reader.readLine();}String seq = sb.toString();fasta.add(seq);reader.close();} catch (IOException e) {e.printStackTrace();}if (choose.equals("tag")) {return tag;}return fasta;}

