Real Estate Market Analysis Uncovering Price Drivers in Mexico and Brazil
An analysis of the real estate markets in Mexico and Brazil to identify the key factors that influence property prices.
This project analyzes an e-commerce dataset to understand and predict product prices in the real estate markets of Mexico and Brazil. The process involves several key stages:
-
Data Loading & Cleaning: The dataset is loaded, and missing values in critical fields like
category_name
andbrand_name
are handled. Thecategory_name
is broken down into three sub-categories for a more granular analysis. -
Exploratory Data Analysis (EDA): The analysis reveals that the
price
variable is highly skewed. To normalize it, a log transformation (log(price+1)
) is applied. The notebook then examines the distributions of various features, including item condition, shipping status, top brands, and categories. - Feature Engineering: To prepare the data for modeling, categorical and text-based features are converted into a numerical format:
- Text:
CountVectorizer
is used for the itemname
, andTfidfVectorizer
is used for theitem_description
. - Categorical:
LabelBinarizer
(similar to one-hot encoding) is applied to thebrand_name
and the newly created sub-categories.
- Text:
-
Modeling: All engineered features are combined into a single sparse matrix. A
Ridge
linear regression model is trained on this data to predict the log-transformed price. - Evaluation: The model’s performance is measured using the Root Mean Squared Logarithmic Error (RMSLE), a standard metric for this type of regression problem.
In summary, this notebook demonstrates how to build a machine learning model to predict product prices based on features like their name, description, brand, and category.