Easy Tutorial
❮ Pandas Cleaning Pandas Tutorial ❯

Pandas Data Structure - DataFrame

A DataFrame is a tabular data structure that contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). A DataFrame has both row and column indexes, and it can be thought of as a dictionary of Series (sharing the same index).

The constructor for a DataFrame is as follows:

pandas.DataFrame( data, index, columns, dtype, copy)

Parameter descriptions:

A Pandas DataFrame is a two-dimensional array structure, similar to a two-dimensional array.

Example - Creating from Lists

import pandas as pd

data = [['Google',10],['tutorialpro',12],['Wiki',13]]

df = pd.DataFrame(data,columns=['Site','Age'],dtype=float)

print(df)

Output:

       Site   Age
0    Google  10.0
1  tutorialpro  12.0
2      Wiki  13.0

The following example creates from ndarrays, where the lengths of the ndarrays must be the same. If an index is passed, its length should be equal to that of the arrays. If no index is passed, the default index will be range(n), where n is the length of the array.

For ndarrays, refer to: NumPy Ndarray Object

Example - Creating from ndarrays

import pandas as pd

data = {'Site':['Google', 'tutorialpro', 'Wiki'], 'Age':[10, 12, 13]}

df = pd.DataFrame(data)

print (df)

Output:

       Site  Age
0    Google   10
1  tutorialpro   12
2      Wiki   13

From the above output, it can be seen that a DataFrame is a table with rows and columns.

You can also create from a dictionary (key/value), where the dictionary keys are the column names:

Example - Creating from a Dictionary

import pandas as pd

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

df = pd.DataFrame(data)

print (df)

Output:

   a   b     c
0  1   2   NaN
1  5  10  20.0

Missing parts of the data are NaN.

Pandas can return specified rows of data using the loc attribute. If no index is set, the first row index is 0, the second is 1, and so on:

Example

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

# Load data into a DataFrame object
df = pd.DataFrame(data)

# Return the first row
print(df.loc[0])
# Return the second row
print(df.loc[1])

Output:

calories    420
duration     50
Name: 0, dtype: int64
calories    380
duration     40
Name: 1, dtype: int64

Note: The returned result is a Pandas Series.

Multiple rows can also be returned using the [[ ... ]] format, where ... are the row indexes separated by commas:

Example

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

# Load data into a DataFrame object
df = pd.DataFrame(data)

# Return the first and second rows
print(df.loc[[0, 1]])

Output:

calories  duration
0       420        50
1       380        40

Note: The returned result is a Pandas DataFrame.

We can also specify index values, as shown in the following example:

Example

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

# Load data into a DataFrame object
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

# Return the row with index "day1"
print(df.loc["day1"])

Output:

calories    420
duration     50
Name: day1, dtype: int64
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index=["day1", "day2", "day3"])

print(df)

Output result:

       calories  duration
day1        420        50
day2        380        40
day3        390        45

Pandas can use the loc attribute to return a specific row corresponding to the specified index:

Example

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index=["day1", "day2", "day3"])

# Specify index
print(df.loc["day2"])

Output result:

calories    380
duration     40
Name: day2, dtype: int64
❮ Pandas Cleaning Pandas Tutorial ❯