from: https://www.kaggle.com/pavansanagapati/covariate-shift-what-is-it

Covariate Shift – What is it ?

Introduction

You may have heard from various people that data science competitions are a good way to learn data science, but they are not as useful in solving real world data science problems. Why do you think this is the case?

One of the differences lies in the quality of data that has been provided. In Data Science Competitions, the datasets are carefully curated. Usually, a single large dataset is split into train and test file. So, most of the times the train and test have been generated from the same distribution.

But this is not the case when dealing with real world problems, especially when the data has been collected over a long period of time. In such cases, there may be multiple variables / environment changes might have happened during that period. If proper care is not taken then, the training dataset cannot be used to predict anything about the test dataset in a usable manner.

In this kernel, we will see the different types of problems or Dataset Shift that we might encounter in the real world. Specifically, we will be talking in detail about one particular kind of shift in the Dataset (Covariate shift), the existing methods to deal with this kind of shift and an in depth demonstration of a particular method to correct this shift.

Table of Contents

  • 1. What is Dataset Shift?
  • 2. What causes Dataset Shift?
  • 3. Types of Dataset Shift
  • 4. Covariate Shift
  • 5. Identification
  • 6. Treatment
    • 6.1 Dropping of drifting features
    • 6.2 Importance Weight using Density Ratio Estimation

1. What is Dataset Shift?

Every time you participate in a competition, your journey will look quite similar to the one shown in the figure below.Let me explain this with the help of a scenario depicted in the picture below. You are given a train and a test file in a competition. You complete the preprocessing, the feature engineering and the cross validation part on the model created but you do not get the same result as the one you get on the cross-validation. No matter what validation strategy you try, it seems like you are bound to get different results in comparison to the cross validation.

What can be a possible reason for this failure? So, if you carefully notice the first picture, you will find that you did all the manipulation by just looking at the train file. Therefore, you completely ignored the information contained in the test file.

Now take a look back on the second picture, you will notice that the training file contains information about male and females of fairly younger age while the test file contains information about people of older age. Therefore it means that the distribution of data contained in the train and test file is significantly different.

So, if you build your model based on the data set containing information about people having lower age and predict on a data set containing higher values of age, that will definitely give you a low score. The reason is that there will a wide gap in the interest and the activities between these two groups. So your model will fail in these conditions.

This change in the distribution of data contained in train and test file is called dataset shift (or drifting).

2. What causes Dataset Shift?

Try to think some of the examples, where you can encounter the problem of dataset shift.

Basically, in the real world, dataset shift mainly occurs because of the change of environments (popularly called as non-stationary environment), where the environment can be referred as location, time, etc.

Let us consider an example. We collected the sales of various item during the period of July-September. Now your job is to predict the sales during the period of Diwali. The visual representation of sales in the train (blue line) and test (black line) file would be similar to the image shown below.

Clearly, the sales during the time of Diwali would be much higher as compared to routine days. Therefore we can say that it is the situation of dataset shift, which occurred due to change of time period between our train and test file.

But our machine learning algorithms work by ignoring these changes. They presume that the train and test environments match and even if they don’t, it assumes that it makes no difference if the environment changes.

Now take a look back at both of the examples that we discussed above. Is there any difference between them?

Yes, in the first scenario, there was a shift in the age (independent variable or predictor) of the population due to which we were getting wrong predictions. While in the latter one, there was a shift in the sales (target variable) of the items. This brings the next topic to the table – Different types of Dataset shifts.

3. Types of Dataset Shift

Dataset shift could be divided into three types:

  • Shift in the independent variables (Covariate Shift)
  • Shift in the target variable (Prior probability shift)
  • Shift in the relationship between the independent and the target variable (Concept Shift)

In this kernel, we will discuss only covariate shift in this article since the other two topics are still an active research area and there has not been any substantial work to mitigate these problems.

We will also see the methods to identify Covariate shift and the proper measures that can be taken in order to improve the predictions.

4. Covariate Shift

Covariate shift refers to the change in the distribution of the input variables present in the training and the test data. It is the most common type of shift and it is now gaining more attention as nearly every real-world dataset suffers from this problem.

First, let us try to understand how does the change in distribution creates a problem for us. Take a look at the image shown below.

If you carefully notice the image given above, our learning function tries to fit the training data. But here, we can see that the distribution of training and test is different, so predicting using this learned function will definitely give us wrong predictions.

So our first step should be to identify this shift in the distribution. Let’s try and understand it.

5. Identification

Here, I have used a quick and dirty machine learning technique to check whether there is a shift between the training data and the test data.

For this purpose, I have used Sberbank Russian Housing Market dataset.

The basic idea to identify shift

If there exists a shift in the dataset, then on mixing the train and test file, you should still be able to classify an instance of the mixed dataset as train or test with reasonable accuracy. Why?

Because, if the features in both the dataset belong to different distributions then, they should be able to separate the dataset into train and test file significantly.

Let’s try to make it simple. Take a look at the distribution of the feature ‘id’ in both the dataset.

By looking at the above distribution, we can clearly see that after a certain value (=30,473), all the instances will belong to test dataset.

So if we create a dataset which is a mixture of training and test instances, where we have labelled each instance of training data as ‘training’ and test as ‘test’ before mixing.

In this new dataset, if we just look at the feature ‘id’, we can clearly classify any instance that whether it belongs to training data or test data. Therefore, we can conclude that ‘id’ is a drifting feature for this dataset.

So this was fairly easy. But we can’t visualise every variable and check whether it is drifting or not.

For that purpose, let us try to code this in Python as a simple classification problem and identify the drifting features.

Steps to identify drift

The basic steps that we will follow are:

  • Preprocessing: This step involves imputing all missing values and label encoding of all categorical variables.
  • Creating a random sample of your training and test data separately and adding a new feature origin which has value train or test depending on whether the observation comes from the training dataset or the test dataset.
  • Now combine these random samples into a single dataset. Note that the shape of both the samples of training and test dataset should be nearly equal, otherwise it can be a case of an unbalanced dataset.
  • Now create a model taking one feature at a time while having ‘origin’ as the target variable on a part of the dataset (say ~75%).
  • Now predict on the rest part(~25%) of the dataset and calculate the value of AUC-ROC.
  • Now if the value of AUC-ROC for a particular feature is greater than 0.80, we classify that feature as drifting.
  • Note that we generally take 0.80 as the threshold value, but the value can be altered based on the situation.

So that is enough of theory, now let’s code this and find which of the features are drifting in this problem.

Code
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-3adbf147ef87> in <module>()
 18 from sklearn.ensemble import RandomForestClassifier  19 ---> 20 from sklearn.cross_validation import cross_val_score  21  22 from sklearn.preprocessing import LabelEncoder ModuleNotFoundError: No module named 'sklearn.cross_validation'

In [2]:
# Reading the train and test filestrain = pd.read_csv('../input/train.csv') test = pd.read_csv('../input/test.csv') 

In [3]:
train.shape,test.shape 

Out[3]:
((30471, 292), (7662, 291))

In [4]:
train.head()

Out[4]:
  id timestamp full_sq life_sq floor max_floor material build_year num_room kitch_sq state product_type sub_area area_m raion_popul green_zone_part indust_part children_preschool preschool_quota preschool_education_centers_raion children_school school_quota school_education_centers_raion school_education_centers_top_20_raion hospital_beds_raion healthcare_centers_raion university_top_20_raion sport_objects_raion additional_education_raion culture_objects_top_25 culture_objects_top_25_raion shopping_centers_raion office_raion thermal_power_plant_raion incineration_raion oil_chemistry_raion radiation_raion railroad_terminal_raion big_market_raion nuclear_reactor_raion ... cafe_sum_3000_min_price_avg cafe_sum_3000_max_price_avg cafe_avg_price_3000 cafe_count_3000_na_price cafe_count_3000_price_500 cafe_count_3000_price_1000 cafe_count_3000_price_1500 cafe_count_3000_price_2500 cafe_count_3000_price_4000 cafe_count_3000_price_high big_church_count_3000 church_count_3000 mosque_count_3000 leisure_count_3000 sport_count_3000 market_count_3000 green_part_5000 prom_part_5000 office_count_5000 office_sqm_5000 trc_count_5000 trc_sqm_5000 cafe_count_5000 cafe_sum_5000_min_price_avg cafe_sum_5000_max_price_avg cafe_avg_price_5000 cafe_count_5000_na_price cafe_count_5000_price_500 cafe_count_5000_price_1000 cafe_count_5000_price_1500 cafe_count_5000_price_2500 cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000 price_doc
0 1 2011-08-20 43 27.0 4.0 NaN NaN NaN NaN NaN NaN Investment Bibirevo 6.407578e+06 155572 0.189727 0.000070 9576 5001.0 5 10309 11065.0 5 0 240.0 1 0 7 3 no 0 16 1 no no no no no no no ... 639.68 1079.37 859.52 5 21 22 16 3 1 0 2 4 0 0 21 1 13.09 13.31 29 807385 52 4036616 152 708.57 1185.71 947.14 12 39 48 40 9 4 0 13 22 1 0 52 4 5850000
1 2 2011-08-23 34 19.0 3.0 NaN NaN NaN NaN NaN NaN Investment Nagatinskij Zaton 9.589337e+06 115352 0.372602 0.049637 6880 3119.0 5 7759 6237.0 8 0 229.0 1 0 6 1 yes 1 3 0 no no no no no no no ... 631.03 1086.21 858.62 1 11 11 4 2 1 0 1 7 0 6 19 1 10.26 27.47 66 2690465 40 2034942 177 673.81 1148.81 911.31 9 49 65 36 15 3 0 15 29 1 10 66 14 6000000
2 3 2011-08-27 43 29.0 2.0 NaN NaN NaN NaN NaN NaN Investment Tekstil'shhiki 4.808270e+06 101708 0.112560 0.118537 5879 1463.0 4 6207 5580.0 7 0 1183.0 1 0 5 1 no 0 0 1 no no no yes no no no ... 697.44 1192.31 944.87 2 9 17 9 3 1 0 0 11 0 0 20 6 13.69 21.58 43 1478160 35 1572990 122 702.68 1196.43 949.55 10 29 45 25 10 3 0 11 27 0 4 67 10 5700000
3 4 2011-09-01 89 50.0 9.0 NaN NaN NaN NaN NaN NaN Investment Mitino 1.258354e+07 178473 0.194703 0.069753 13087 6839.0 9 13670 17063.0 10 0 NaN 1 0 17 6 no 0 11 4 no no no no no no no ... 718.75 1218.75 968.75 0 5 14 10 3 0 0 1 2 0 0 18 3 14.18 3.89 8 244166 22 942180 61 931.58 1552.63 1242.11 4 7 21 15 11 2 1 4 4 0 0 26 3 13100000
4 5 2011-09-05 77 77.0 4.0 NaN NaN NaN NaN NaN NaN Investment Basmannoe 8.398461e+06 108171 0.015234 0.037316 5706 3240.0 7 6748 7770.0 9 0 562.0 4 2 25 2 no 0 10 93 no no no yes yes no no ... 853.03 1410.45 1131.74 63 266 267 262 149 57 4 70 121 1 40 77 5 8.38 10.92 689 8404624 114 3503058 2283 853.88 1411.45 1132.66 143 566 578 552 319 108 17 135 236 2 91 195 14 16331452

Preprocessing

In [5]:
## Handling missing values
for i in train.columns: if train[i].dtype == 'object': train[i] = train[i].fillna(train[i].mode().iloc[0]) if (train[i].dtype == 'int' or train[i].dtype == 'float'): train[i] = train[i].fillna(np.mean(train[i])) for i in test.columns: if test[i].dtype == 'object': test[i] = test[i].fillna(test[i].mode().iloc[0]) if (test[i].dtype == 'int' or test[i].dtype == 'float'): test[i] = test[i].fillna(np.mean(test[i])) 

In [6]:
## Label    encoding
number = LabelEncoder()
for i in train.columns: if (train[i].dtype == 'object'): train[i] = number.fit_transform(train[i].astype('str')) train[i] = train[i].astype('object') for i in test.columns: if (test[i].dtype == 'object'): test[i] = number.fit_transform(test[i].astype('str')) test[i] = test[i].astype('object') 

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-09f525e50b42> in <module>()
 1 ## Label encoding ----> 2 number = LabelEncoder()  3 for i in train.columns:  4 if (train[i].dtype == 'object'):  5 train[i] = number.fit_transform(train[i].astype('str')) NameError: name 'LabelEncoder' is not defined

In [7]:
# Creating a new feature origin
train['origin'] = 0 test['origin'] = 1 training = train.drop('price_doc',axis=1) #droping target variable 

In [8]:
## Taking sample from training and test data
training = training.sample(7662, random_state=12) testing = test.sample(7000, random_state=11) 

In [9]:
## Combining random samples
combine = training.append(testing) y = combine['origin'] combine.drop('origin',axis=1,inplace=True) 

In [10]:
## Modelling
model = RandomForestClassifier(n_estimators = 50, max_depth = 5,min_samples_leaf = 5) drop_list = [] for i in combine.columns: score = cross_val_score(model,pd.DataFrame(combine[i]),y,cv=2,scoring='roc_auc') if (np.mean(score) > 0.8): drop_list.append(i) print(i,np.mean(score)) 

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-7e6054cbafe1> in <module>()
 3 drop_list = []  4 for i in combine.columns: ----> 5 score = cross_val_score(model,pd.DataFrame(combine[i]),y,cv=2,scoring='roc_auc')  6 if (np.mean(score) > 0.8):  7 drop_list.append(i) NameError: name 'cross_val_score' is not defined

In [11]:
# List   of   drifting   features
drop_list

Out[11]:
[]

Here we have classified nine features as drifting

So, now the important question is how to treat them effectively such that we can improve our predictions.

6. Treatment

There are different techniques by which we can treat these features in order to improve our model. Let us discuss some of them.

  • Dropping of drifting features
  • Importance weight using Density Ratio Estimation So let’s try to understand one by one

6.1 Dropping of drifting features

This method is quite simple, as in this, we basically drop the features which are being classified as drifting. But just give it a thought, that simply dropping features might result in some loss of information.

To deal with this, we have defined a simple rule.

Features having a drift value greater than 0.8 and are not important in our model, we drop them.

So, let’s try this in our problem.

Here, I have used a basic random forest model just to check which features are important.

Code
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-a567de6a5f2a> in <module>()
 4 # Using a basic random forest model with all the features  5 rf = RandomForestRegressor(n_estimators=200, max_depth=6,max_features=10) ----> 6 rf.fit(training.drop('price_doc',axis=1),training['price_doc'])  7 pred = rf.predict(testing)  8 columns = ['price_doc'] /opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)  250  251 # Validate or convert input data --> 252 X = check_array(X, accept_sparse="csc", dtype=DTYPE)  253 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)  254 if sample_weight is not None: /opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)  520 try:  521 warnings.simplefilter('error', ComplexWarning) --> 522 array = np.asarray(array, dtype=dtype, order=order)  523 except ComplexWarning:  524 raise ValueError("Complex data not supported\n" /opt/conda/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)  499  500 """ --> 501 return array(a, dtype, copy=False, order=order)  502  503 ValueError: could not convert string to float: '2011-08-20'

On submitting this file on Kaggle, we are getting a rmse score of 0.40116 on private leaderboard.

So, let’s check first 20 important features for this model.

Code
---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-13-e6b27648cf19> in <module>()
 1 # plotting importances  2 features = training.drop('price_doc',axis=1).columns.values ----> 3 imp = rf.feature_importances_  4 indices = np.argsort(imp)[::-1][:20]  5 #plot /opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py in feature_importances_(self)  373 feature_importances_ : array, shape = [n_features]  374 """ --> 375 check_is_fitted(self, 'estimators_')  376  377 all_importances = Parallel(n_jobs=self.n_jobs, /opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)  940  941 if not all_or_any([hasattr(estimator, attr) for attr in attributes]): --> 942 raise NotFittedError(msg % {'name': type(estimator).__name__})  943  944 NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

Now if we compare the feature importance list and drop_list

In [14]:
drop_list

Out[14]:
[]

Now, if we compare our drop list and feature importance, we will find that the features ‘life_sq’ and ‘kitch_sq’ are common.

So, we will keep these two features in our model, while dropping the rest of the drifting features.

NOTE: Before dropping any feature, just make sure you if there any possibility to create a new feature from it.

Let’s try this and check whether it improves our prediction or not.

Code
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-ad4bb45c1add> in <module>()
 3 drift_test = testing.drop(['id','hospital_beds_raion','cafe_sum_500_min_price_avg','cafe_sum_500_max_price_avg','cafe_avg_price_500'], axis=1)  4 rf = RandomForestRegressor(n_estimators=200, max_depth=6,max_features=10) ----> 5 rf.fit(drift_train.drop('price_doc',axis=1),training['price_doc'])  6 pred = rf.predict(drift_test)  7 columns = ['price_doc'] /opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)  250  251 # Validate or convert input data --> 252 X = check_array(X, accept_sparse="csc", dtype=DTYPE)  253 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)  254 if sample_weight is not None: /opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)  520 try:  521 warnings.simplefilter('error', ComplexWarning) --> 522 array = np.asarray(array, dtype=dtype, order=order)  523 except ComplexWarning:  524 raise ValueError("Complex data not supported\n" /opt/conda/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)  499  500 """ --> 501 return array(a, dtype, copy=False, order=order)  502  503 ValueError: could not convert string to float: '2011-08-20'

On submission of this file on Kaggle, we got a rmse score of 0.39759 on the private leaderboard.

This means our objective is met .We have successfully improved our performance using this technique.

6.2 Importance weight using Density Ratio Estimation

In this method, the approach to importance estimation would be to first estimate the training and test densities separately and then estimate the importance by taking the ratio of the estimated densities of test and train.

Then these densities act as weights for each instance in the training data.

But giving weights to each instance based on the density ratio could be a rigorous task in higher dimensional data sets. I tried this method on an i3 processor with 12 GB RAM and it took around 48 minutes to calculate the ratio density for a single feature. Also, I could not find any improvement in the score on applying the weights to the training data.

Also scaling this feature for 200 features would be a very time-consuming task.

Therefore, this method is only good up to research papers but the application of this in the real world is still questionable. Also, this is an active area of research.

Conclusion

I hope that now we have a better understanding about drift, how we can identify it and treat it effectively. It has now become a common problem in real world dataset. So we should develop a habit to check this every time while solving problems, and surely it will give us positive results.

转载于:https://www.cnblogs.com/Arborday/p/10888639.html

[转] Covariate shift Internal covariate shift相关推荐

  1. Batch normalization:accelerating deep network training by reducing internal covariate shift的笔记

    说实话,这篇paper看了很久,,到现在对里面的一些东西还不是很好的理解. 下面是我的理解,当同行看到的话,留言交流交流啊!!!!! 这篇文章的中心点:围绕着如何降低  internal covari ...

  2. 读文献——《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》

    在自己阅读文章之前,通过网上大神的解读先了解了一下这篇文章的大意,英文不够好的惭愧... 大佬的文章在https://blog.csdn.net/happynear/article/details/4 ...

  3. 批归一化《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》

    批归一化<Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift ...

  4. Batch Normalization + Internal Covariate Shift(论文理解)

    在看压缩神经网络相关的论文中,总是能看见 Batch Normalization,在网上找了很多博客看,但是一直处于一种似懂非懂的状态.于是去找了原论文<Batch Normalization: ...

  5. Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift 论文笔记

    0 摘要 当前神经网络层之前的神经网络层的参数变化,会引起神经网络每一层输入数据的分布产生变化,这使得训练一个深度神经网络变得复杂.通过设置较小的学习率以及更谨慎的初始化参数减慢了训练,并且由于非线性 ...

  6. 【BN】《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》

    ICML-2015 在 CIFAR-10 上的小实验可以参考如下博客: [Keras-Inception v2]CIFAR-10 文章目录 1 Background and Motivation 2 ...

  7. 论文阅读Batch Normalization: Accelerating Deep Network Training byReducing Internal Covariate Shift

    论文阅读Batch Normalization: Accelerating Deep Network Training byReducing Internal Covariate Shift 全文翻译 ...

  8. Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift(BN)

    internal covariate shift(ics):训练深度神经网络是复杂的,因为在训练过程中,每层的输入分布会随着之前层的参数变化而发生变化.所以训练需要更小的学习速度和careful参数初 ...

  9. 《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》阅读笔记与实现

    今年过年之前,MSRA和Google相继在ImagenNet图像识别数据集上报告他们的效果超越了人类水平,下面将分两期介绍两者的算法细节. 这次先讲Google的这篇<Batch Normali ...

最新文章

  1. New Android Application 的介绍
  2. 《JavaScript面向对象精要》——1.8 原始封装类型
  3. mingw编译wxwidgets
  4. 基于face++的人脸识别(上)
  5. D3js(五):tooltips
  6. 8款最受欢迎的HTML5/CSS3应用及代码
  7. photoshop最全快捷键列表
  8. 推荐一个wpfsliverlight的图表控件
  9. JDI考虑让日本国内部分工厂停工 因苹果公司需求低迷
  10. C语言之字符串探究(四):读越界、写越界
  11. html背景图片垂直居中,css — 定位、背景图、水平垂直居中
  12. Jquery中val、text、html的区别
  13. 容大热敏打印机打印纸张出半截,测试页不出嗡嗡响
  14. 层次分析法步骤及源代码
  15. BZOJ 4568 幸运数字
  16. 2019年终总结与展望
  17. phython入门开始
  18. 腾讯 roomservice php,后台系统搭建记录 - 腾讯Web前端 IMWeb 团队社区 | blog | 团队博客...
  19. 云服务器系统分区,云服务器系统盘可以分区
  20. 微信群满100人后无法扫码进群?你们要的解决方法来了!

热门文章

  1. 国际时区 TimeZone ID列表
  2. 买二手房防骗攻略 ,让我们一起饿死无良中介
  3. 面向 JavaScript 开发人员的 5 大物联网库
  4. html大作业网页代码 化妆品购物商城网站设计——电商类化妆品购物商城(1页) HTML+CSS+JavaScript 关于电商类的HTML网页设计-----化妆品
  5. 三菱FA产品QD62D型高速计数器模块的应用
  6. 安装nodemon包
  7. 点亮ESP32自带的小灯
  8. Mac生成ssh密钥
  9. Linux sort用法
  10. 论坛20071204升级公告