Pandas Data Structure - DataFrame
A DataFrame is a tabular data structure that contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). A DataFrame has both row and column indexes, and it can be thought of as a dictionary of Series (sharing the same index).
The constructor for a DataFrame is as follows:
pandas.DataFrame( data, index, columns, dtype, copy)
Parameter descriptions:
- data: A collection of data (ndarray, series, map, lists, dict, etc.).
- index: Index values, or row labels.
- columns: Column labels, defaulting to RangeIndex (0, 1, 2, ..., n).
- dtype: Data type.
- copy: Copy data, default is False.
A Pandas DataFrame is a two-dimensional array structure, similar to a two-dimensional array.
Example - Creating from Lists
import pandas as pd
data = [['Google',10],['tutorialpro',12],['Wiki',13]]
df = pd.DataFrame(data,columns=['Site','Age'],dtype=float)
print(df)
Output:
Site Age
0 Google 10.0
1 tutorialpro 12.0
2 Wiki 13.0
The following example creates from ndarrays, where the lengths of the ndarrays must be the same. If an index is passed, its length should be equal to that of the arrays. If no index is passed, the default index will be range(n), where n is the length of the array.
For ndarrays, refer to: NumPy Ndarray Object
Example - Creating from ndarrays
import pandas as pd
data = {'Site':['Google', 'tutorialpro', 'Wiki'], 'Age':[10, 12, 13]}
df = pd.DataFrame(data)
print (df)
Output:
Site Age
0 Google 10
1 tutorialpro 12
2 Wiki 13
From the above output, it can be seen that a DataFrame is a table with rows and columns.
You can also create from a dictionary (key/value), where the dictionary keys are the column names:
Example - Creating from a Dictionary
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print (df)
Output:
a b c
0 1 2 NaN
1 5 10 20.0
Missing parts of the data are NaN.
Pandas can return specified rows of data using the loc
attribute. If no index is set, the first row index is 0, the second is 1, and so on:
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
# Load data into a DataFrame object
df = pd.DataFrame(data)
# Return the first row
print(df.loc[0])
# Return the second row
print(df.loc[1])
Output:
calories 420
duration 50
Name: 0, dtype: int64
calories 380
duration 40
Name: 1, dtype: int64
Note: The returned result is a Pandas Series.
Multiple rows can also be returned using the [[ ... ]]
format, where ...
are the row indexes separated by commas:
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
# Load data into a DataFrame object
df = pd.DataFrame(data)
# Return the first and second rows
print(df.loc[[0, 1]])
Output:
calories duration
0 420 50
1 380 40
Note: The returned result is a Pandas DataFrame.
We can also specify index values, as shown in the following example:
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
# Load data into a DataFrame object
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
# Return the row with index "day1"
print(df.loc["day1"])
Output:
calories 420
duration 50
Name: day1, dtype: int64
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index=["day1", "day2", "day3"])
print(df)
Output result:
calories duration
day1 420 50
day2 380 40
day3 390 45
Pandas can use the loc
attribute to return a specific row corresponding to the specified index:
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index=["day1", "day2", "day3"])
# Specify index
print(df.loc["day2"])
Output result:
calories 380
duration 40
Name: day2, dtype: int64