Events  Deals  Jobs  SF Climate Week 2024 
    Sign in  
 
 
Projects incl. Enigma, Predicting Consumer Credit Default, Uber Driver Optimization System, Yelp Nearby.
Tue, Sep 27, 2016 @ 06:00 PM   FREE   NYC Data Science Academy, 500 8th Ave, Ste 905
 
   
 
 
              

      
 
Sign up for our awesome New York
Tech Events weekly email newsletter.
   
LOCATION
EVENT DETAILS

Co-Hosted by NYC Open Data meetup and NYC Data Science Academy


During this event, you will see some of the best capstone projects created by NYC Data Science Academy 12-week Data Science boot camp students. One of solutions/projects is ranked #1 on Kaggle.

Presented by NYC Data Science Academy students who are gonna finish 12-week full-time program on Sept 23rd. Join us on Sept 23rd for full daycareer dayto hire or learn more about their work!The employers, potential students and members of our community are all very welcomed!

If you are hiring, you will be able to learn a set of high quality fully reproducibleMachine learning and big data projects and chat with their creators! Please talkwith the presenters since they are all looking for Data Scientist role now.

If you are interested in our program, you will also have an opportunity to meet our boot camp students and find out more about what it is like to be a student at NYC Data Science Academy and gain an overview of the program.

++++++++++++++++++++++++


Event schedule:


6:00 pm -6:30 pm Check in, mingle, enjoy food & drinks

6:30 pm- 8:00pm Presentations from 6 groups

15 min on project 1, 2, 3

8 min on project 4,5,6, 7

8:00 pm - 9:00 pm Network and meet our students


====================

Project 1:Predicting Consumer Credit Default - the Kaggle Challenge

#1 ranked solution in Kaggle




Team Description and Members:

Team Eigenauts are adynamic team of Data Scientists that believe in true collaboration and execution excellence, ready to take on whatever challenges that come their way. The team is composed of an eclectic mix of a seasoned hands-on Executive, a Management Consultant, a creative problem solver and a theory-oriented problem solver. They do what it takes to excel in their chosen fields and are relentless in their pursuit of taking on innovative projects thatexemplifytheir passion in working with data.

1. Bernard Ong ---Ambitious executive and data scientist with a track record of driving innovative technology projects and programs to successful implementation. Blends machine learning skills with great domain knowledge to drive strategy and execution excellence. Background includes managing multimillion-dollar portfolios, turnaround initiatives, and operational development. Champion of process improvements and business solutions, as well as release and delivery management. Skilled on machine learning models and algorithms, predictive analytics, R and Python development.

2. Emma (Jielei) Zhu --- Just graduated from New York University with a B.A. in Psychology and Computer Science. Emma was able to explore her interests to the fullest by taking classes in Statistics, Psychology, Neuroscience, Algorithms, Machine Learning, Philosophy and Game Design. She is a savvy programmer with proficiency in Python, MATLAB, R,Javaand C. Using these languages, she has done projects on human perception (publication pending), datavisualisations, sentiment analysis, and star rating predictions.

3. Nanda Rajarathinam --- aSenior Data & Analytics professional with 11+ years of experience in design, development, support and testing of enterprise applications in Data Warehousing & Business Intelligence engagements. Nanda possesses expertise in data modeling and application development, covering technologies such as Hadoop, Map Reduce, HDFS, Hive, Pig , Oozie, Sqoop, Python, and R. Nanda is highly experienced in providing technical and project leadership. He is passionate in areas of supervised and unsupervised Machine Learning algorithms

4. Trinity (Miaozhi) Yu ---Miaozhirecently received her Masters degree in Mathematics from New York University. Beforethatshe received a Bachelors Degree in both Mathematics and Statistics with a minor in Physics from UIUC. Her research interests lie in random graphsand mathematical simulation of neuron networks. During her time at NYCDSA, she built a shiny app showing NYC condo sales history, which aims to help users find their ideal home and give information about prices trend in each borough in NYC. For her final project, she and her team competedonthe Kaggle machine learning competition by utilizing voting/stacking classifier and Bayes optimizer on models including GBM, XGBoost, Random Forest, and hit #1 out of 925 teams.

Project Description:

The team participated in an extremely challenging Kaggle closed competition to predict consumer credit default. Banks continue to look for the best credit scoring algorithm and one of its tenets is to predict the probability of an individual defaulting on their loanor undergoing financial distress. With this information, banks can make better decisions and borrowers can also do better financial planning to mitigate possible default status in the future. This challenge allowed the team to employ the best of machine learning, using ensemble models and algorithms (like XGBoost, Gradient Boosting, Random Forest, RestrictedBoltzmanMachine Neural Networks, Adaboost) and advanced stacking techniques and voting classifiers to accurately predict the probability of default, focusing on highly tuned models. The focus was on delivering on quality, notquantityof models used. They were measured and ranked strictly using ROC AUC scores across 925 teams that participated. The team focused not just on the technology, but also just as critically, ensured that the right process discipline, teamwork, and leadership were exuded at all levels. The team used a very disciplined Agile process designed for machine learning to ensure that they executed a multitude of tasks in parallel and in small chunks. They exercised workflow strategies and tactics to fail fast and iterate fast to maximize productivity and results. Using sophisticated Bayesian Optimizers, the team was able to garner the best hyperparameters settings to cut down on overall manual testing and cross-validation time, supercharging the team process to climb the rankings consistently. The team ultimately achieved the highest AUC score possible, attaining high rankings in the Kaggle challenge, through six highly tuned models.With the suite of tools utilized, the synergies and teamwork, and a process that helped boost productivity,the team not only reach the top tiers of thescoreboard,but also smashed through to garnerthe #1 top ranking in the challenge.

Capstone Project Blog Link:

http://blog.nycdatascience.com/student-works/kaggle-predict-consumer-credit-default/

-----------

Project 2: How to determine the keys togethigh-star reviews of a restaurant in Yelp

Given byContributed by Jiaxu Luo, Charles Leung,DanliZeng and Samriddhi Shakya.

Blog link:

http://blog.nycdatascience.com/student-works/yelp-dataset-challenge/?preview_id=15161&preview_nonce=f945eccb7b&post_format=standard&_thumbnail_id=-1&preview=true

Project Summary

Our project is to determine the keys togethigh-star reviews of a restaurant in Yelp by applying predictive machine learning approaches to identify the common attributes of restaurant that have high stars. To help business understand more of their customers, we also extracted a wealth of information with natural language processing to get a glimpse of topics customers favored and complained. We made a Shiny app for the business owners to get insight from visualization so as to improve business operation and customer analysis.

Restaurant ratings from Yelp are often viewed as a reputation metric for local food service businesses and are contributed by the customers' impressions and opinions they form after using a certain product or service. Based on a research conducted by Cornell University, in its early stages,a business is at the mercy of the subjectivity of its customers.More astonishingly, 27% of new Yelp restaurants failed within the first year and nearly 60 % closed by year three. For business owners from start-up, determining the data attribute associated with high star reviews could serve as a proxy for the factors of restaurant success in general. Based on the data provided by Yelp, We built up a Shiny app to help those business owners or potential business owners in Greater Phoenix, AZ metropolitan area to identify some of the most important factors that really affects the stars of a businessinYelp.

These important factors could be either inherent attributes of business, like open hours, noise level, or some subjective factors induced by the customers. The attributes of importance to the stars of a business could be excavated using a predictive model for the purpose of inference while the latter could be mined through the texts, which are a wealth of information and are more informative.

The data used in this project is part of the Yelp Dataset Challenge, which includes business information, reviews, tips (shorter reviews), user information and check-ins. Business objects list name, location, opening hours, category, average star rating, the number of reviews about the business and a series of attributes like noise level or reservations policy. Review objects list a star rating, the review text, the review date, and the number of votes that the review has received.

We filtered the business by category to keep only those businesses (9427) in the restaurant category and reviews (622446) related to those businesses . The texts from those restaurants reviews will form the basic corpus of this project.

-----------------
Project 3:predict New York City taxi demand Given by Shuo Zhang, Bin Fang, Jingyu Zhang,YunrouGong.

Description:In this capstone project, we present a few different models to predict the number of taxi pickups that will occur at a specific time and location in New York City, these predictions could inform taxi dispatchers (i.e. Uber) and drivers on where to position theirtaxies. We implemented and evaluated four different regression models: multiple-linear regression, ridge regression, random forest andxgboost. Performing bayesian optimization, we were able to achieve positive results.Furthermorewe performed and evaluated ensemble model by thecombinationof two strong models: random forest andxgboost. Our best-performing model,xgboost, achieved a root-mean-square-error (RMSE) of 35.01 and coefficient of determination (R^2) 0.98, a significant improvement over weaker models (multiple-linear regression) RMSE of 125.56 and R^2 of 0.75. Our final prediction of the incoming week is visualized, summarized and analyzed by the shiny interactive application, which can help user to compare the number of pickups of three different models (random forest,xgboost, ensemble) and across different locations in a given time zone,and also visualize the trend of the number of pickups in a 24-hour cycle across different locations within New York City.

----------

Project 4: Who Wakes Up? Understanding Glasgow Coma Scale Scores and Predicting Recovery

Given byWill Bartlett

I looked at time series patterns of Glasgow Coma Scale (GCS) scores from the Mount Sinai Neurosurgery Department ICU. GCS scores evaluate a patient's responsiveness usually after traumatic braininjury,and are generally taken atone hourintervals over the course of a hospital stay. GCS scores are made up of three components--one sub-score forverbalresponse, one foreye opening, and one for motor response. The project had three goals: evaluate the GCS score as a metric, look for any generalizable patterns relating longitudinal activity to recovery, and determine how far in advance we can predict recovery using only GCS patterns. I found that despite its widespread use in medicine, the GCS score is highly redundant and overly complex; sub-score breakdowns add minimal additional variation to the metric, and there is essentially no independent movement between them. Additionally, the data revealed that most recovery occurswithin 25 hoursof a patient hitting their minimum score. Certain tree-based models (XGBoost) can predict binary recovery (Discharge GCS>11) with 90% accuracy as early as 12 hours in.

---------------

Project 5: Where to go, driver? -- Uber matching system optimization

Given by Christian Holmes andShuheng(Shawn) Li

Teamubermachineapplied Hive and Python Spark to handle over 200 gigabytes of data, in conjunction with Google maps APIs, weather conditions, traffic conditions and restaurant information. They built a dynamic system to forecast the demand of for-hire vehicles in New York State to optimize Uber matching system. Using Python Framework Flask, they turned this project into a recommendation App which helps drivers get more fares and avoid traffic.

------------------ Project 6:Museo: A recommendation system for museum selection

Given by Anne Chen

As someone who loves visiting museum a lot, it takestimeto locate the right museums for me. Thus, I built an app to allowuserto explore museums based on their preferences or they can filter museums that meet their needs. 1600 museum data were scraped from TripAdvisor, and textural data such as review, quote, and museum description were scored with sentiment analysis. In total, 220 features are used to compute cosine similarity in my recommendation system. Please visit my app here:

http://216.230.228.88:3838/bootcamp006_project/Project5-Capstone/Museo/app/

---------------

Project 7: Pump it up, Drill it down: An analysis of Tanzania's Water Projects

Given by Linlin Cheng

There is a water crisis in Tanzania: safe water source is scarce, and waterborne diseases are prevalent. Thousands of individuals and agencies have stepped in to build water points to help, but how effective are they? This project combines machine learning techniques with data visualization to point out potential causes of malfunctioning projects, identify thepossible success of potential projects, and redirect funds to the places where they are in dire need and can be spent most efficiently.

 
 
 
 
© 2024 GarysGuide      About    Feedback    Press    Terms