학습목표

  1. dataframe row 선택하기
In [1]:
import pandas as pd
In [2]:
# data 출처: https://www.kaggle.com/hesh97/titanicdataset-traincsv/data
train_data = pd.read_csv('./train.csv')
train_data.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

dataframe slicing

  • dataframe의 경우 기본적으로 [] 연산자가 column 선택에 사용
  • 하지만, slicing은 row 레벨로 지원
In [4]:
train_data[0:5]
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

row 선택하기

  • Seires의 경우 []로 row 선택이 가능하나, DataFrame의 경우는 기본적으로 column을 선택하도록 설계
  • .loc, .iloc함수로 row 선택 가능
    • loc - 인덱스 자체를 사용
    • iloc - 0 based index로 사용
    • 이 두 함수는 ,를 사용하여 column 선택도 가능
In [5]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [6]:
import numpy as np
train_data.index =np.arange(100, 991)
In [8]:
train_data.tail()
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
986 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
987 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
988 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
989 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
990 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
In [10]:
train_data.loc[986]]
#표시된 인덱스로 따지는 loc
Out[10]:
PassengerId                      887
Survived                           0
Pclass                             2
Name           Montvila, Rev. Juozas
Sex                             male
Age                             27.0
SibSp                              0
Parch                              0
Ticket                        211536
Fare                            13.0
Cabin                            NaN
Embarked                           S
Name: 986, dtype: object
In [14]:
train_data.iloc[890]
#실제 인덱스 위치를 따지는 iloc
Out[14]:
PassengerId                    891
Survived                         0
Pclass                           3
Name           Dooley, Mr. Patrick
Sex                           male
Age                           32.0
SibSp                            0
Parch                            0
Ticket                      370376
Fare                          7.75
Cabin                          NaN
Embarked                         Q
Name: 990, dtype: object
In [16]:
train_data.loc[[986, 100, 300]]
Out[16]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
986 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
100 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 NaN S
300 201 0 3 Vande Walle, Mr. Nestor Cyriel male 28.0 0 0 345770 9.50 NaN S

row, column 동시 선택

In [19]:
train_data.loc[[986, 100, 110, 900], ['Survived', 'Name', 'Sex', 'Age']]
Out[19]:
Survived Name Sex Age
986 0 Montvila, Rev. Juozas male 27.0
100 0 Braund, Mr. Owen Harris male 22.0
110 1 Sandstrom, Miss. Marguerite Rut female 4.0
900 0 Ponesell, Mr. Martin male 34.0
In [21]:
train_data.loc[[986, 100, 110, 900], ['Survived', 'Sex', 'Name', 'Age']]
Out[21]:
Survived Sex Name Age
986 0 male Montvila, Rev. Juozas 27.0
100 0 male Braund, Mr. Owen Harris 22.0
110 1 female Sandstrom, Miss. Marguerite Rut 4.0
900 0 male Ponesell, Mr. Martin 34.0
In [22]:
train_data.columns
Out[22]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [26]:
train_data.iloc[[800, 100, 110, 870], [1, 4, 5]]
#loc로 하면 에러남, 칼럼을 지정함
Out[26]:
Survived Sex Age
900 0 male 34.0
200 0 female 28.0
210 0 male 47.0
970 0 male 26.0