Skip to content

This EDA introduces two datasets with hotel booking data from a resort hotel (H1) and a city hotel (H2), including 31 variables across 40,060 and 79,330 bookings, respectively. Covering bookings from July 2015 to August 2017, the datasets include both completed and canceled reservations, with all identifying information removed. These datasets

Notifications You must be signed in to change notification settings

ariosramirez/Hotel-booking-demand-challenge

Repository files navigation

Hotel booking demand challenge

To run docker

docker build . -t jupyter 
docker run -it  -p 8888:8888 -v $(pwd):/app jupyter

Open Jupyter on browser and Enjoy Jupyter with widgets ᕙ(`▿´)ᕗ


Challenge description

Let’s use hotel bookings data from Antonio, Almeida, and Nunes (2019) to predict which hotel stays included children and/or babies, based on the other characteristics of the stays such as which hotel the guests stay at, how much they pay, etc.

One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 23 variables describing the 19248 observations of H1 and 30752 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or custumer identification were deleted.

Take-Home Goals

Part 1

During Part I, you should perform an Exploratory Data Analysis highlighting key findings:

  • Data Quality Check: You must check the quality of the given dataset to make an assessment of how appropriate it is for later Data Science tasks. Propose a set of corrective actions over the data if any.
  • Report insights and conclusions: Describe the results obtained during the exploratory analysis and provide conclusions from a business perspective, supported by plots / tables / metrics.
  • Expected:
    • Make at least 10 plots with any ploting library (plotly, matplotlib, seaborn, etc.)
    • Write down the conclusions, in a clear manner, of every plot in this notebook

Part 2

In Part II you should define and train a model to predict which actual hotel stays included children/babies, and which did not:

  • Feature extraction: Indicate some possible candidates of features that could properly describe the hotels, either from the given columns or from their transformations.

    • Expected:
      • Create one scikit-learn pipeline inside a file called pipelines.py
      • Create at least three scikit-learn transformers inside a file called transformers.py and use them inside the pipeline from previous step. This transformers should add new features or clean the original dataframe of this take-home
        • Feature example: Compute "total_nights" feature. This is the sum of stays_in_week_nights + stays_in_weekend_nights
        • Cleaning example: Transform string values. '0' to int type 0
      • Import pipeline and run the transformations inside this notebook
  • Machine Learning modeling: Fit models with the given data. Pay attention to the entire process to avoid missing any crucial step. You could use the children column as target.

    • Expected:
      • Use the dataset with the new features generated to train at least three different machine learning models and generate metrics about their performance.

Part 3

Finally, on Part III you should present the key findings, conclusions and results to non-technical stakeholders.

  • Expected:
    • Create a summary of all the findings in part 1
    • Create an explanation of the features added in part 2
    • Create a summary of the model metrics
    • These explanations should be at high level and understood by a non-technical person
    • You can add all the summaries and explanations at the end of this notebook, it can be done in markdown format or any other external resource like a ppt presentation, pdf document, etc. Whatever works best for you!

Requirements

  • Python 3.x & Pandas 1.x
  • Paying attention to the details and narrative is far way more important than extensive development.
  • Once you complete the assessment, please send a ZIP file of the folder with all the resources used in this work (e.g. Jupyter notebook, Python scripts, text files, images, etc) or share the repository link.
  • Virtualenv, requirements or Conda environment for isolation.
  • Have a final meeting with the team to discuss the work done in this notebook and answer the questions that could arise.
  • Finally, but most important: Have fun!

Nice to have aspects

  • Code versioning with Git (you are free to publish it on your own Github/Bitbucket account!).
  • Show proficiency in Python: By showing good practices in the structure and documentation, usage of several programming paradigms (e.g. imperative, OOP, functional), etc.
  • Shap Model explanability: explain feature importance with the use of shapley values

Key Insights, Findings, and Conclusions

Insights

  • Insight: Families with children tend to book certain room types and have higher special requests, which indicates a preference for more spacious and accommodating environments. This suggests that hotel management should consider these preferences in their service offerings to better cater to families.
  • Insight: There is a consistent pattern of room type changes for families with children, highlighting the need for hotels to potentially revise their room assignment strategies to better meet the needs of these bookings.

Findings

  • Finding: Across all models, the most influential features in predicting whether a booking includes children were total_guests, total_nights, and average_daily_rate. This emphasizes the importance of these features in understanding family bookings.
  • Finding: Feature importance analysis from the RandomForest model highlighted that the most significant predictors for bookings with children are total_guests, total_nights, and average_daily_rate.

Conclusions

  • Conclusion: The XGBClassifier model emerged as the best performing model, slightly outperforming others in terms of AUC-ROC score. This makes it a suitable choice for predicting bookings with children, providing a reliable tool for hotel management to anticipate and cater to the needs of families.
  • Conclusion: Families with children tend to book certain room types and have higher special requests, which are important factors for hotel management to consider in their service offerings.

Future Work

  • Future Work: Further improvement can be achieved by exploring additional features, tuning model hyperparameters, and possibly integrating external data sources to enhance predictive performance. Future research could also investigate other factors influencing family bookings, such as seasonal trends or promotional impacts, to provide even deeper insights.

About

This EDA introduces two datasets with hotel booking data from a resort hotel (H1) and a city hotel (H2), including 31 variables across 40,060 and 79,330 bookings, respectively. Covering bookings from July 2015 to August 2017, the datasets include both completed and canceled reservations, with all identifying information removed. These datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages