spark读取文件源码分析-2

文章目录

1. job1产生时机源码分析
- 1. DataSoure.getOrInferFileFormatSchema()
- 2. ParquetFileFormat.inferSchema
- - 1. 简化后代码
  - 2. parquetOptions.mergeSchema 为false
  - 3. isParquetSchemaRespectSummaries 默认值为false
  - 4. filesByType 信息
- 3. ParquetFileFormat.mergeSchemasInParallel
2. job1的调用总结
3. spark read 总结

本篇接上一篇,这一篇主要是针对job1的源码分析

1. job1产生时机源码分析

job1是在job0执行完在driver端往下执行的的时候发生的，所以先来复习一下job0的过程。


DataFrameReader.load()
DataFrameReader.loadV1Source()
DataSoure.resolveRelation()
DataSource.getOrInferFileFormatSchema()
new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)
InMemoryFileIndex.refresh0()
InMemoryFileIndex.listLeafFiles()
InMemoryFileIndex.bulkListLeafFiles()

job1产生的调用是在DataSoure.getOrInferFileFormatSchema()方法中。
再贴一遍这个方法

1. DataSoure.getOrInferFileFormatSchema()

private def getOrInferFileFormatSchema(format: FileFormat,fileStatusCache: FileStatusCache = NoopCache): (StructType, StructType) = {// the operations below are expensive therefore try not to do them if we don't need to, e.g.,// in streaming mode, we have already inferred and registered partition columns, we will// never have to materialize the lazy val below// job0 这里定义的是lazy变量，最终使用的时候才会初始化lazy val tempFileIndex = {val allPaths = caseInsensitiveOptions.get("path") ++ pathsval hadoopConf = sparkSession.sessionState.newHadoopConf()val globbedPaths = allPaths.toSeq.flatMap { path =>val hdfsPath = new Path(path)val fs = hdfsPath.getFileSystem(hadoopConf)val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)SparkHadoopUtil.get.globPathIfNecessary(fs, qualified)}.toArray//job0 这个地方初始化了InMemoryFileIndex 对象，也就是在这里形成了第一个jobnew InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)}val partitionSchema = if (partitionColumns.isEmpty) {// Try to infer partitioning, because no DataSource in the read path provides the partitioning// columns properly unless it is a Hive DataSource// job0 在这里第一次真正使用lazy的tempFileIndex变量，也就促使了InMemoryFileIndex 的初始化。combineInferredAndUserSpecifiedPartitionSchema(tempFileIndex)} else {// maintain old behavior before SPARK-18510. If userSpecifiedSchema is empty used inferred// partitioningif (userSpecifiedSchema.isEmpty) {val inferredPartitions = tempFileIndex.partitionSchemainferredPartitions} else {val partitionFields = partitionColumns.map { partitionColumn =>userSpecifiedSchema.flatMap(_.find(c => equality(c.name, partitionColumn))).orElse {val inferredPartitions = tempFileIndex.partitionSchemaval inferredOpt = inferredPartitions.find(p => equality(p.name, partitionColumn))if (inferredOpt.isDefined) {logDebug(s"""Type of partition column: $partitionColumn not found in specified schema|for $format.|User Specified Schema|=====================|${userSpecifiedSchema.orNull}||Falling back to inferred dataType if it exists.""".stripMargin)}inferredOpt}.getOrElse {throw new AnalysisException(s"Failed to resolve the schema for $format for " +s"the partition column: $partitionColumn. It must be specified manually.")}}StructType(partitionFields)}}// 到这里job0已经执行完并且返回回来了val dataSchema = userSpecifiedSchema.map { schema =>StructType(schema.filterNot(f => partitionSchema.exists(p => equality(p.name, f.name))))}.orElse {// job1 始于这里的调用format.inferSchema(sparkSession,caseInsensitiveOptions,tempFileIndex.allFiles())}.getOrElse {throw new AnalysisException(s"Unable to infer schema for $format. It must be specified manually.")}// We just print a waring message if the data schema and partition schema have the duplicate// columns. This is because we allow users to do so in the previous Spark releases and// we have the existing tests for the cases (e.g., `ParquetHadoopFsRelationSuite`).// See SPARK-18108 and SPARK-21144 for related discussions.try {SchemaUtils.checkColumnNameDuplication((dataSchema ++ partitionSchema).map(_.name),"in the data schema and the partition schema",equality)} catch {case e: AnalysisException => logWarning(e.getMessage)}(dataSchema, partitionSchema)}

在这个方法中会先触发job0，在job0返回后会走下面的步骤，job0主要是文件分析，收集总共有多少个文件，每个文件的block信息等。
job1的主要功能是做data-schema的提取,job1的触发在这一段，下面的代码会进入到orElse{}部分

    // 到这里job0已经执行完并且返回回来了val dataSchema = userSpecifiedSchema.map { schema =>StructType(schema.filterNot(f => partitionSchema.exists(p => equality(p.name, f.name))))}.orElse {// job1 始于这里的调用format.inferSchema(sparkSession,caseInsensitiveOptions,tempFileIndex.allFiles())}.getOrElse {throw new AnalysisException(s"Unable to infer schema for $format. It must be specified manually.")}

对应的format的类型是ParquetFileFormat，所以就是调用了ParquetFileFormat.inferSchema()

2. ParquetFileFormat.inferSchema

  /*** When possible, this method should return the schema of the given `files`.  When the format* does not support inference, or no valid files are given should return None.  In these cases* Spark will require that user specify the schema manually.*/override def inferSchema(sparkSession: SparkSession,parameters: Map[String, String],files: Seq[FileStatus]): Option[StructType] = {val parquetOptions = new ParquetOptions(parameters, sparkSession.sessionState.conf)// Should we merge schemas from all Parquet part-files?val shouldMergeSchemas = parquetOptions.mergeSchemaval mergeRespectSummaries = sparkSession.sessionState.conf.isParquetSchemaRespectSummariesval filesByType = splitFiles(files)// Sees which file(s) we need to touch in order to figure out the schema.//// Always tries the summary files first if users don't require a merged schema.  In this case,// "_common_metadata" is more preferable than "_metadata" because it doesn't contain row// groups information, and could be much smaller for large Parquet files with lots of row// groups.  If no summary file is available, falls back to some random part-file.//// NOTE: Metadata stored in the summary files are merged from all part-files.  However, for// user defined key-value metadata (in which we store Spark SQL schema), Parquet doesn't know// how to merge them correctly if some key is associated with different values in different// part-files.  When this happens, Parquet simply gives up generating the summary file.  This// implies that if a summary file presents, then:////   1. Either all part-files have exactly the same Spark SQL schema, or//   2. Some part-files don't contain Spark SQL schema in the key-value metadata at all (thus//      their schemas may differ from each other).//// Here we tend to be pessimistic and take the second case into account.  Basically this means// we can't trust the summary files if users require a merged schema, and must touch all part-// files to do the merge.val filesToTouch =if (shouldMergeSchemas) {// Also includes summary files, 'cause there might be empty partition directories.// If mergeRespectSummaries config is true, we assume that all part-files are the same for// their schema with summary files, so we ignore them when merging schema.// If the config is disabled, which is the default setting, we merge all part-files.// In this mode, we only need to merge schemas contained in all those summary files.// You should enable this configuration only if you are very sure that for the parquet// part-files to read there are corresponding summary files containing correct schema.// As filed in SPARK-11500, the order of files to touch is a matter, which might affect// the ordering of the output columns. There are several things to mention here.////  1. If mergeRespectSummaries config is false, then it merges schemas by reducing from//     the first part-file so that the columns of the lexicographically first file show//     first.////  2. If mergeRespectSummaries config is true, then there should be, at least,//     "_metadata"s for all given files, so that we can ensure the columns of//     the lexicographically first file show first.////  3. If shouldMergeSchemas is false, but when multiple files are given, there is//     no guarantee of the output order, since there might not be a summary file for the//     lexicographically first file, which ends up putting ahead the columns of//     the other files. However, this should be okay since not enabling//     shouldMergeSchemas means (assumes) all the files have the same schemas.val needMerged: Seq[FileStatus] =if (mergeRespectSummaries) {Seq.empty} else {filesByType.data}needMerged ++ filesByType.metadata ++ filesByType.commonMetadata} else {// Tries any "_common_metadata" first. Parquet files written by old versions or Parquet// don't have this.filesByType.commonMetadata.headOption// Falls back to "_metadata".orElse(filesByType.metadata.headOption)// Summary file(s) not found, the Parquet file is either corrupted, or different part-// files contain conflicting user defined metadata (two or more values are associated// with a same key in different files).  In either case, we fall back to any of the// first part-file, and just assume all schemas are consistent..orElse(filesByType.data.headOption).toSeq}ParquetFileFormat.mergeSchemasInParallel(filesToTouch, sparkSession)}

这个方法看着很长，其实注释很多，主要就是判断是否需要读取每个parquet文件的schema来进行schema的判断，以及具体的判断方式
把代码简化一下。

1. 简化后代码

override def inferSchema(sparkSession: SparkSession,parameters: Map[String, String],files: Seq[FileStatus]): Option[StructType] = {val parquetOptions = new ParquetOptions(parameters, sparkSession.sessionState.conf)// 是否要从所有的parquet的part-files中merge schema, 这个值是默认是falseval shouldMergeSchemas = parquetOptions.mergeSchema// merge的话，是否所有的parquet文件的summary信息都是一致的,默认是false，就是不一致,这个变量只有在shouldMergeSchemas 为true的时候才会用上val mergeRespectSummaries = sparkSession.sessionState.conf.isParquetSchemaRespectSummariesval filesByType = splitFiles(files)val filesToTouch =// 这里是falseif (shouldMergeSchemas) {val needMerged: Seq[FileStatus] =if (mergeRespectSummaries) {Seq.empty} else {filesByType.data}needMerged ++ filesByType.metadata ++ filesByType.commonMetadata} else {会走到这里来filesByType.commonMetadata.headOption.orElse(filesByType.metadata.headOption).orElse(filesByType.data.headOption).toSeq}最终调用这里来ParquetFileFormat.mergeSchemasInParallel(filesToTouch, sparkSession)}

2. parquetOptions.mergeSchema 为false

parquetOptions.mergeSchema对应的默认值是false

  val PARQUET_SCHEMA_MERGING_ENABLED = buildConf("spark.sql.parquet.mergeSchema").doc("When true, the Parquet data source merges schemas collected from all data files, " +"otherwise the schema is picked from the summary file or a random data file " +"if no summary file is available.").booleanConf.createWithDefault(false)

3. isParquetSchemaRespectSummaries 默认值为false

sparkSession.sessionState.conf.isParquetSchemaRespectSummaries默认的值是false

val PARQUET_SCHEMA_RESPECT_SUMMARIES = buildConf("spark.sql.parquet.respectSummaryFiles").doc("When true, we make assumption that all part-files of Parquet are consistent with " +"summary files and we will ignore them when merging schema. Otherwise, if this is " +"false, which is the default, we will merge all part-files. This should be considered " +"as expert-only option, and shouldn't be enabled before knowing what it means exactly.").booleanConf.createWithDefault(false)

4. filesByType 信息

filesByType 是所有文件的一个封装，最后生成的filesToTouch 的内容是

LocatedFileStatus{path=hdfs://test.com:9000/user/daily/20200828/part-00000-0e0dc5b5-5061-41ca-9fa6-9fb7b3e09e98-c000.snappy.parquet; isDirectory=false; length=193154555; replication=3; blocksize=134217728; modification_time=1599052749676; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

也就是filesByType所有文件中的第一个，然后使用这个文件作为参数来调用ParquetFileFormat.mergeSchemasInParallel(filesToTouch, sparkSession)方法

3. ParquetFileFormat.mergeSchemasInParallel

/*** Figures out a merged Parquet schema with a distributed Spark job.** Note that locality is not taken into consideration here because:**  1. For a single Parquet part-file, in most cases the footer only resides in the last block of*     that file.  Thus we only need to retrieve the location of the last block.  However, Hadoop*     `FileSystem` only provides API to retrieve locations of all blocks, which can be*     potentially expensive.**  2. This optimization is mainly useful for S3, where file metadata operations can be pretty*     slow.  And basically locality is not available when using S3 (you can't run computation on*     S3 nodes).*/def mergeSchemasInParallel(filesToTouch: Seq[FileStatus],sparkSession: SparkSession): Option[StructType] = {val assumeBinaryIsString = sparkSession.sessionState.conf.isParquetBinaryAsStringval assumeInt96IsTimestamp = sparkSession.sessionState.conf.isParquetINT96AsTimestampval serializedConf = new SerializableConfiguration(sparkSession.sessionState.newHadoopConf())// !! HACK ALERT !!//// Parquet requires `FileStatus`es to read footers.  Here we try to send cached `FileStatus`es// to executor side to avoid fetching them again.  However, `FileStatus` is not `Serializable`// but only `Writable`.  What makes it worse, for some reason, `FileStatus` doesn't play well// with `SerializableWritable[T]` and always causes a weird `IllegalStateException`.  These// facts virtually prevents us to serialize `FileStatus`es.//// Since Parquet only relies on path and length information of those `FileStatus`es to read// footers, here we just extract them (which can be easily serialized), send them to executor// side, and resemble fake `FileStatus`es there.val partialFileStatusInfo = filesToTouch.map(f => (f.getPath.toString, f.getLen))// Set the number of partitions to prevent following schema reads from generating many tasks// in case of a small number of parquet files.val numParallelism = Math.min(Math.max(partialFileStatusInfo.size, 1),sparkSession.sparkContext.defaultParallelism)val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFiles// Issues a Spark job to read Parquet schema in parallel.val partiallyMergedSchemas =sparkSession.sparkContext.parallelize(partialFileStatusInfo, numParallelism).mapPartitions { iterator =>// Resembles fake `FileStatus`es with serialized path and length information.val fakeFileStatuses = iterator.map { case (path, length) =>new FileStatus(length, false, 0, 0, 0, 0, null, null, null, new Path(path))}.toSeq// Reads footers in multi-threaded manner within each taskval footers =ParquetFileFormat.readParquetFootersInParallel(serializedConf.value, fakeFileStatuses, ignoreCorruptFiles)// Converter used to convert Parquet `MessageType` to Spark SQL `StructType`val converter = new ParquetToSparkSchemaConverter(assumeBinaryIsString = assumeBinaryIsString,assumeInt96IsTimestamp = assumeInt96IsTimestamp)if (footers.isEmpty) {Iterator.empty} else {var mergedSchema = ParquetFileFormat.readSchemaFromFooter(footers.head, converter)footers.tail.foreach { footer =>val schema = ParquetFileFormat.readSchemaFromFooter(footer, converter)try {mergedSchema = mergedSchema.merge(schema)} catch { case cause: SparkException =>throw new SparkException(s"Failed merging schema of file ${footer.getFile}:\n${schema.treeString}", cause)}}Iterator.single(mergedSchema)}}.collect()if (partiallyMergedSchemas.isEmpty) {None} else {var finalSchema = partiallyMergedSchemas.headpartiallyMergedSchemas.tail.foreach { schema =>try {finalSchema = finalSchema.merge(schema)} catch { case cause: SparkException =>throw new SparkException(s"Failed merging schema:\n${schema.treeString}", cause)}}Some(finalSchema)}}

ParquetFileFormat.mergeSchemasInParallel有两个参数，一个是sparksession,另一个是filesToTouch，存储的是一个文件列表信息，上面也提到默认情况下实际上传进来的只有一个文件。
这个方法的作用就是创建一个job1，根据传进来的文件来解析出来schema信息,对应的job1的内容有:

    val partialFileStatusInfo = filesToTouch.map(f => (f.getPath.toString, f.getLen))val numParallelism = Math.min(Math.max(partialFileStatusInfo.size, 1),sparkSession.sparkContext.defaultParallelism)val ignoreCorruptFiles = sparkSession.sessionState.conf.ignoreCorruptFilesval partiallyMergedSchemas =sparkSession.sparkContext.parallelize(partialFileStatusInfo, numParallelism).mapPartitions { iterator =>......Iterator.single(mergedSchema)}.collect()

job1的并行度numParallelism

    val numParallelism = Math.min(Math.max(partialFileStatusInfo.size, 1),sparkSession.sparkContext.defaultParallelism)

partialFileStatusInfo.size是传进来的参数，这里是1
sparkSession.sparkContext.defaultParallelism返回的是1,实际上这个在不同的环境下值是不同的，但是肯定不会小于1

所以 numParallelism 的值就是1，也就是job1的partition数量是1

2. job1的调用总结

方法的调用链是


DataFrameReader.load()
DataFrameReader.loadV1Source()
DataSoure.resolveRelation()
DataSource.getOrInferFileFormatSchema()
ParquetFileFormat.inferSchema()
ParquetFileFormat.mergeSchemasInParallel()

3. spark read 总结

再回顾一下我们的代码。

public class UserProfileTest {static String filePath = "hdfs:///user/daily/20200828/*.parquet";public static void main(String[] args) {SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("user_profile_test").set(ConfigurationOptions.ES_NODES, "").set(ConfigurationOptions.ES_PORT, "").set(ConfigurationOptions.ES_MAPPING_ID, "uid");//主要想要考察一下这个地方为什么会产生更多的jobSparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();Dataset<Row> userProfileSource = sparkSession.read().parquet(filePath);userProfileSource.count();userProfileSource.write().parquet("hdfs:///user/daily/result2020082808/");}
}

当filePath命中的path数量大于32的话便会产生一个单独的job来进行文件的递归查找，找到所有复合条件的file信息（包括block size等信息，为后面的schema识别和rdd操作的partition做准备）
在找到所有的文件之后对应的读取文件如果是parquet的话会创建一个新的job来解析对应的schema。

基于上面的解析，假如我们对上面代码中的filepath进行修改，假如设置filePath

static String filePath = "hdfs://bj3-dev-search-01.tencn:9000/user/daily/20200828/part-00057-0e0dc5b5-5061-41ca-9fa6-9fb7b3e09e98-c000.snappy.parquet";

那么程序运行的时候就没有job0了。