Machine learning can pose a lot of challenges, from data cleaning and labeling to feature extraction, hyperparameter tuning, and testing.
Here are 10 tools to make your job easier as a data scientist, inspired by this Twitter thread. We cover 5 products and 5 open-source (GitHub) packages.
1. Obviously AI for No-Code AutoML
No-code AI tools like Obviously.AI make it possible to deploy AI in minutes instead of months, allowing data scientists to rapidly build models and experiment, instead of getting stuck in the weeds.
Deloitte's research also showed that traditional machine learning projects cost from US$250,000 to a whopping US$20 million.
By drastically decreasing both time-to-value and costs, no-code AutoML is a no-brainer for this list.
2. Labelbox for data labeling
AI is a data-hungry beast. Labelbox is a training data platform that solves fast labeling and data management, so you can feed your models with labeled data. This is especially useful for tasks like image classification, object detection, and segmentation.
Considering the famous statistic that data scientists spend 80% of their time on data wrangling, including data labeling, products like these can be powerful.
Many data scientists spend months on data labeling before ever getting into modeling, so try this out if that rings a bell!
3. Apify for data collection
In the real-world, AI-ready data is rarely sitting in some clean CSV somewhere. Often, we have to go out and get data ourselves, and a powerful way to do that is with web scraping.
Apify lets you scrape data from any website, including Facebook, Instagram, Google Search results, and more, to build better models.
If you've ever tried to scrape data manually, you'll know what a headache that can be.
4. Trifacta for data preparation
Another common hurdle on the journey to AI modeling is low data quality.
Trifacta is an end-to-end data preparation solution, including data quality, data transformation, and data pipeline features.
Trifacta boasts over 10,000 customers, so you’ll be in good company.
5. Data Ladder for Data Quality Management
Data Ladder is an incredibly feature-rich data quality management tool, including data matching, data preparation, data cleansing, data profiling, data deplication, data enrichment, and data standardization.
To give you a better idea of the richness of features offered, the “data cleansing” solutions include address data cleansing, CRM data cleansing, database cleansing, data migration cleansing, list cleaning, and several more niche solutions.
If you have data cleaning needs, Data Ladder can probably help.
6. Determined for deep learning model training
Determined is a deep learning model training platform on GitHub that enables you to train models faster, do advanced hyperparameter tuning, GPU smart scheduling, and track and reproduce your work.
For advanced AI practitioners, this is a no-brainer.
7. speedrun for experiment management
Speedrun is a lesser-known, but powerful tool for experiment management.
This helps keep your code organized by dealing with low-level details like reading from configuration files and managing experiment directories.
8. Joblib for quickly writing code and experiments
Joblib is a set of tools providing lightweight pipelining in Python, including transparent disk-caching of functions and easy simple parallel computing.
Their stated vision is for users to “easily achieve better performance and reproducibility.”
9. LabML for monitoring model training on your phone
Modeling is often a time-intensive process, and you probably don’t look forward to sitting in front of your computer all day to monitor training.
LabML lets you monitor PyTorch and TensorFlow model training on your phone.
10. Cortex for deploying to production
Cortex is a platform to deploy, manage, and scale ML in production. It supports TensorFlow, PyTorch, sklearn, and more, with high availability, scalability, and manages traffic splitting for A/B testing.
AI can be a struggle to get running from start to finish, but it doesn’t have to be overwhelming. With these tools, you can work smarter, not harder, and deploy models in record time.