[데이터 분석 라이브러리] 파이썬 판다스 Series, DataFrame

데이터분석 🔎

[데이터 분석 라이브러리] 파이썬 판다스 Series, DataFrame

23.8 2022. 11. 25. 06:27

3. 데이터 조작 및 분석을 위한 Pandas 기본

01. Pandas란

02. Series 데이터

03. 데이터프레임

01. Pandas 란

Pandas : 구조화된 데이터를 효과적으로 처리하고 저장하기 위한 파이썬 라이브러리

Pandas 특징

- Array 계산에 특화된 Numpy를 기반으로 설계되었다.

- 파일(또는 URL)로부터 표로 나타낸 데이터를 불러오는 데 가장 접근하기 쉽고 거의 완벽한 기능을 제공한다.

- 시리즈, 데이터프레임이라는 구조를 사용하여 서로 다른 형태 및 시계열 데이터의 복잡한 테이블을 다룰 수 있게 해준다.

- 데이터 핸들링(자르기, 빠진 요소 삭제/추가, 명칭 변경, 합치기 등)과 시각화에 편리한 기능을 제공한다.

- 관행적으로 pd라고 불러온다.

import pandas as pd

data = pd.Series([1, 2, 3, 4])
print(data)

02. Series 데이터

Series : Series는 Pandas의 기본 객체 중 하나이다.

Series 특징

- Numpy의 ndarray를 기반으로 Index와 Value를 가지고 있다. (series의 value의 type을 출력 결과 : numpy.ndarray)

- 인덱싱 기능을 총해 1차원 배열을 나타낸다.

- Index와 Value로 구성된다는 점이 Value만 갖는 List와의 차이점이다.

- Index는 기본값으로 0, 1, 2, 3..과 같이 자동으로 생성된다.

- 같은 타입의 0개 이상의 데이터를 가질 수 있다.

import pandas as pd

data = pd.Series([1, 2, 3, 4])
print(data)

# index, value
# 0 1
# 1 2
# 2 3
# 3 4
#dtype :int64


print(type(data)) # <class 'pandas.core.series.Series'>

print(data.value) # [1 2 3 4]

print(type(data.values)) #<class 'numpy.ndarray'>

dtype 인자를 통해 데이터 타입을 지정할 수 있다.

data = pd.Series([1, 2, 3, 4], dtype = "float")
print(data.dtype) #float64

인덱스를 지정할 수 있고 인덱스로 접근 가능하다.

data = pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])
data['c'] = 5 #인덱스로 접근하여 요소 변경 가능

**그러면 인덱스는 어떻게 바꾸지?

파이썬의 Dictionary를 활용하여 Series를 생성할 수 있다.

population_dict = {
	'china' : 141500,
    'japen' : 12718,
    'korea' : 5180,
    'usa' : 32676
}

population = pd.Series(population_dict)

03. 데이터프레임

데이터프레임 : 여러 개의 Series가 모여서 행과 열을 이룬 데이터

Series로 DataFrame 만들기

gdp_dict = {
	'china' : 140925000,
    'japen' : 516700000,
    'korea' : 169320000,
    'usa' : 2041280000,
}

gdp = pd.Series(gdp_dict)

population_dict = {
	'china' : 141500,
    'japen' : 12718,
    'korea' : 5180,
    'usa' : 32676
}

population = pd.Series(population_dict)

county = pd.DataFrame({
	'gdp' : gdp,
    'population' : population
})

Dictionary를 활용하여 DataFrame 만들기

data = {
	'county' : ['china' , 'japen' ,  'korea' ,  'usa'],
    'gdp' : [ 140925000, 516700000, 169320000, 2041280000],
    'population' : [141500, 12718, 5180, 32676]
}

county = pd.DataFrame(data)

#인덱스를 따로 지정해주지 않으면 0, 1, 2, 3... 과 같이 인덱스가 생성된다.
country = country.set_index('country)

Dictionay, Series, DataFrame 정리

Dictionary

dic= {key : value}

Series

series1 = pd.Series(data)

series2= Series([1,2,3,4])

series3 = Series([1,2,3,4]. index = ['a', 'b', 'c', 'd'])

데이터프레임

df1 = pd.DataFrame({

's1' : series1,

's2' : series2

})

df2 = pd.DataFrame(dic)

DataFrame 속성 확인하기

print(country.shape)  #(4,2)
print(country.size) #8
print(country.ndim) #2
print(country.values)

DataFrame의 index와 column에 이름 지정

country.index.name = "Country" #인덱스에 이름을 지정
country.columns.name = "Info" #컬럼에 이름 지정

print(country.index)
print(country.columns)

데이터 프레임 저장 및 불러오기

country.to_csv("./country.csv")
country.to_excel("country.xlsx")

country = pd.read_csv("./country.csv")
country = pd.read_excel("country.xlsx")

728x90

'데이터분석 🔎' 카테고리의 다른 글

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 0: invalid start byte 오류 (0)	2022.12.31
[데이터 분석을 위한 라이브러리] 파이썬 numpy, numpy와 list차이 (0)	2022.11.24
[데이터 분석을 위한 라이브러리] 파이썬 모듈이란? (0)	2022.11.24
데이터 분석을 위한 라이브러리 (0)	2022.11.24
geopandas 설치 에러 - python setup.py egg_info Check the logs for full command output, failed with initial frozen solve. Retrying with flexible solve. (3)	2021.08.09

현재글[데이터 분석 라이브러리] 파이썬 판다스 Series, DataFrame

기록은 기억을 지배한다

Today :
Yesterday :

wave