RandomAccessFile读性能优化

背景

公司的日志采集器是我自己开发的，没用开源产品。日志采集虽然是个小功能，但是要想写好也没那么容易。对于一个日志采集器来说，它应该稳定、可控、占用尽量少的资源。因为日志采集器是和核心业务服务部署在同一台服务器上，如果它工作时CPU和内存占用率飙升、重度磁盘I/O，显然是不合适的，毕竟只是一个辅助功能，不能因小失大。

在生产环境运行时发现部分服务日志过多，日志采集器的采集速度跟不上，日志上报有较大延迟。在对我自己编写的代码进行优化后，我想看看jdk的代码有没有优化的空间。

RandomAccessFile性能低下的原因

我主要用的是RandomAccessFile的readLine()方法，点进源码发现readLine()调用的是read()方法，read()最终调用的又是read0()，read0()是一个native方法，无法看到它的实现，但我在read()方法的注释上看到它每次只读一个字节。

/*** Reads a byte of data from this file. The byte is returned as an* integer in the range 0 to 255 ({@code 0x00-0x0ff}). This* method blocks if no input is yet available.* <p>* Although {@code RandomAccessFile} is not a subclass of* {@code InputStream}, this method behaves in exactly the same* way as the {@link InputStream#read()} method of* {@code InputStream}.** @return     the next byte of data, or {@code -1} if the end of the*             file has been reached.* @exception  IOException  if an I/O error occurs. Not thrown if*                          end-of-file has been reached.*/
public int read() throws IOException {return read0();
}

我们都知道，I/O会产生系统调用，因为对I/O设备的操作是发生在内核态的，用户态和内核态之间的切换会有一定的系统开销，频繁切换会带来很大的开销。每次只读一个字节显然性能非常低，可以考虑使用用户进程缓冲区，也就是一次读很多个字节放到buffer里，减少I/O次数。

RandomAccessFile + Buffer

我们先做一个基准测试，按行读取一个6MB的日志文件，比较RandomAccessFile.readLine()的性能和BufferedReader.readLine()的性能。测试代码如下：

public static void randomAccessFile() {try (RandomAccessFile raf = new RandomAccessFile(PATH, "r")) {int count  = 0;long beginning = System.currentTimeMillis();while (raf.readLine() != null) {count++;}long end = System.currentTimeMillis();System.out.println("RandomAccessFile, line count: " + count + ", cost: " + (end - beginning) + "ms");} catch (IOException e) {e.printStackTrace();}
}public static void bufferedReader() {try (BufferedReader br = new BufferedReader(new FileReader(PATH))) {int count = 0;long beginning = System.currentTimeMillis();while (br.readLine() != null) {count++;}long end = System.currentTimeMillis();System.out.println("BufferedReader, line count: " + count + ", cost: " + (end - beginning) + "ms");} catch (IOException e) {e.printStackTrace();}
}

	RandomAccessFile	BufferedReader
第一次	9580ms	63ms
第二次	9379ms	77ms
第三次	9340ms	66ms

不测不知道，一测吓一跳，使用缓冲区性能提升了100多倍！

既然这样，那我使用缓冲区对RandomAccessFile进行改造。首先，我们定义一个抽象类，抽象类里的几个抽象方法都是我们需要用到的RandomAccessFile的方法。

import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;public abstract class LogReader {/*** 获取操作系统默认字符编码的方法：System.getProperties().get("sun.jnu.encoding");* 获取操作系统文件的字符编码的方法：System.getProperties().get("file.encoding");* 获取JVM默认字符编码的方法：Charset.defaultCharset();*/public static final String DEFAULT_CHARSET = Charset.defaultCharset().name();protected String charsetName;abstract long getFilePointer();abstract void seek(long pos);abstract String readLine();/*** 字节数组扩容* @param arr* @return*/public byte[] grow(byte[] arr) {int len = arr.length;int half = len >> 1;int growSize = Math.max(half, 1);byte[] arrNew = new byte[len + growSize];System.arraycopy(arr, 0, arrNew, 0, len);return arrNew;}/*** 字节数组解码成字符串* @param arr* @param arrPos* @return*/public String decode(byte[] arr, int arrPos) {if (arrPos == 0)return null;try {return new String(arr, 0, arrPos, charsetName);} catch (UnsupportedEncodingException e) {e.printStackTrace();throw new RuntimeException(e.getMessage());}}
}

接着新建一个类BufferedLogReader并继承LogReader和实现Closeable，BufferedLogReader其实是使用缓冲区对RandomAccessFile进行了一层包装。代码如下：

import java.io.*;
import java.nio.charset.Charset;public class BufferedLogReader extends LogReader implements Closeable {public static final int DEFAULT_BUFFER_CAPACITY = 8192;private byte[] buffer;private int position;private int limit = -1;private RandomAccessFile raf;public BufferedLogReader(String pathName) {init(new File(pathName), DEFAULT_BUFFER_CAPACITY, DEFAULT_CHARSET);}public BufferedLogReader(String pathName, String charsetName) {init(new File(pathName), DEFAULT_BUFFER_CAPACITY, charsetName);}public BufferedLogReader(String pathName, int bufferCapacity) {init(new File(pathName), bufferCapacity, DEFAULT_CHARSET);}public BufferedLogReader(String pathName, int bufferCapacity, String charsetName) {init(new File(pathName), bufferCapacity, charsetName);}public BufferedLogReader(File file) {init(file, DEFAULT_BUFFER_CAPACITY, DEFAULT_CHARSET);}public BufferedLogReader(File file, String charsetName) {init(file, DEFAULT_BUFFER_CAPACITY, charsetName);}public BufferedLogReader(File file, int bufferCapacity) {init(file, bufferCapacity, DEFAULT_CHARSET);}public BufferedLogReader(File file, int bufferCapacity, String charsetName) {init(file, bufferCapacity, charsetName);}private void init(File file, int bufferCapacity, String charsetName) {if (bufferCapacity < 1)throw new IllegalArgumentException("bufferCapacity");Charset.forName(charsetName); // 检查字符集是否合法try {raf = new RandomAccessFile(file, "r");} catch (FileNotFoundException e) {e.printStackTrace();throw new RuntimeException(e.getMessage());}buffer = new byte[bufferCapacity];this.charsetName = charsetName;}@Overridepublic long getFilePointer() {try {return raf.getFilePointer();} catch (IOException e) {e.printStackTrace();throw new RuntimeException(e.getMessage());}}@Overridepublic void seek(long pos) {try {raf.seek(pos);} catch (IOException e) {e.printStackTrace();throw new RuntimeException(e.getMessage());}}@Overridepublic String readLine() {try {if (position > limit) {if (!readMore())return null;}byte[] arr = new byte[336];int arrPos = 0;while (position <= limit) {byte b = buffer[position++];switch (b) {case 10: //Unix or Linux line separatorreturn decode(arr, arrPos);case 13: //Windows or Mac line separatorif (position > limit) {if (readMore())judgeMacOrWindows();} elsejudgeMacOrWindows();return decode(arr, arrPos);default: // not line separatorif (arrPos >= arr.length)arr = grow(arr);arr[arrPos++] = b;if (position > limit) {if (!readMore())return decode(arr, arrPos);}}}} catch (IOException e) {e.printStackTrace();throw new RuntimeException(e.getMessage());}return null;}private void judgeMacOrWindows() {byte b1 = buffer[position++];if (b1 != 10) // Mac line separatorposition--;}private boolean readMore() throws IOException {limit = raf.read(buffer) - 1;position = 0;return limit >= 0;}@Overridepublic void close() {try {raf.close();} catch (IOException e) {throw new RuntimeException(e.getMessage());}}
}

RandomAccessFile里的readLine()方法局限性比较大，它不支持所有的编码方式，这从源码以及源码的注释可以看出来。这里我自己实现的readLine()可以支持所有的编码方式。

代码是写完了，那性能如何呢？俗话说，是骡子是马，拉出来遛遛。

	RandomAccessFile	BufferedReader	BufferedLogReader
第一次	9580ms	63ms	96ms
第二次	9379ms	77ms	90ms
第三次	9340ms	66ms	77ms

通过测试发现，BufferedLogReader性能跟BufferedReader在一个数量级，但要略差一点，主要是decode()方法中new String()设置字符集比不设置要慢一点（亲自测试过），但相较于RandomAccessFile还是有较大提升。

RandomAccessFile + Memory Map

关于内存映射，可以参考这篇文章：一文搞懂内存映射(Memory Map)原理

简单来说，read系统调用是先将文件从磁盘拷贝到内核空间，然后再从内核空间拷贝到用户空间，这个过程实际上是需要两次数据拷贝；内存映射后在缺页中断处理时直接将文件从磁盘拷贝到用户空间，只需要一次数据拷贝。所以内存映射比read系统调用的效率要高。代码实现如下：

import sun.misc.Cleaner;import java.io.*;
import java.lang.reflect.Method;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.security.AccessController;
import java.security.PrivilegedAction;public class MemoryMapLogReader extends LogReader implements Closeable {private int position;private int limit = -1;private MappedByteBuffer buffer;private FileChannel channel;private RandomAccessFile raf;public MemoryMapLogReader(String pathName) {init(new File(pathName), DEFAULT_CHARSET);}public MemoryMapLogReader(String pathName, String charsetName) {init(new File(pathName), charsetName);}public MemoryMapLogReader(File file) {init(file, DEFAULT_CHARSET);}public MemoryMapLogReader(File file, String charsetName) {init(file, charsetName);}private void init(File file, String charsetName) {Charset.forName(charsetName); // 检查字符集是否合法this.charsetName = charsetName;try {raf = new RandomAccessFile(file, "r");channel = raf.getChannel();buffer = channel.map(FileChannel.MapMode.READ_ONLY, raf.getFilePointer(), channel.size());limit = buffer.limit();} catch (IOException e) {e.printStackTrace();throw new RuntimeException(e.getMessage());}}@Overridepublic long getFilePointer() {try {return raf.getFilePointer(); // 一直是0, 说明采用内存映射时这个方法不起作用} catch (IOException e) {e.printStackTrace();throw new RuntimeException(e.getMessage());}}@Overridepublic void seek(long pos) {try {raf.seek(pos); // 采用内存映射时这个方法不起作用} catch (IOException e) {e.printStackTrace();throw new RuntimeException(e.getMessage());}}@Overridepublic String readLine() {if (position >= limit)return null;byte[] arr = new byte[336];int arrPos = 0;while (true) {byte b = buffer.get(position++);switch (b) {case 10: //Unix or Linux line separatorreturn decode(arr, arrPos);case 13: //Windows or Mac line separatorif (position < limit)judgeMacOrWindows();return decode(arr, arrPos);default: // not line separatorif (arrPos >= arr.length)arr = grow(arr);arr[arrPos++] = b;if (position >= limit)return decode(arr, arrPos);}}}private void judgeMacOrWindows() {byte b1 = buffer.get(position++);if (b1 != 10) // Mac line separatorposition--;}@Overridepublic void close() throws IOException {if (raf != null)raf.close();if (channel != null)channel.close();if (buffer != null)buffer.clear(); // 并不会真正清理buffer里的数据, 只是改变内部数组指针位置, 详情请看源码注释clean(); // 这个才会真正清理buffer里的数据}@SuppressWarnings({ "unchecked", "rawtypes" })private void clean() {AccessController.doPrivileged((PrivilegedAction) () -> {try {Method getCleanerMethod = buffer.getClass().getMethod("cleaner");getCleanerMethod.setAccessible(true);Cleaner cleaner =(Cleaner) getCleanerMethod.invoke(buffer, new Object[0]);cleaner.clean();} catch(Exception e) {e.printStackTrace();}return null;});}
}

同样，实践是检验真理的唯一办法，测试数据如下：

	RandomAccessFile	BufferedReader	BufferedLogReader	MemoryMapLogReader
第一次	9580ms	63ms	96ms	72ms
第二次	9379ms	77ms	90ms	59ms
第三次	9340ms	66ms	77ms	45ms

从测试结果来看，memory map比buffer要快一点点。尤其是文件比较大时，memory map的优势会更明显。

不过，对于我的使用场景来说，MemoryMapLogReader不太适用，因为日志文件并不是一成不变的，而是在不断写入日志，意味着我需要不断调用channel.map(FileChannel.MapMode.READ_ONLY, raf.getFilePointer(), channel.size())进行内存映射，每次映射的数据量比较少，频繁映射可能会带来额外的开销，也会增加代码的复杂度。

另一个问题，在使用内存映射时，RandomAccessFile的getFilePointer()会失效，不管读了多少字节都返回0。虽然可以自己统计已读字节来实现这个功能，但无疑又增加了代码复杂度。

综上所诉，基于我的使用场景，我选择BufferedLogReader 。