Список из сотен полезных открытых наборов данных для специалистов по данным .
- WorldData.AI datasets, the world’s largest external curated data platform, integrates data from all leading global sources.
Data repositories
- Anacode Chinese Web Datastore: a collection of crawled Chinese news and blogs in JSON format.
- AssetMacro, historical data of Macroeconomic Indicators and Market Data.
- Awesome Public Datasets on github, curated by caesar0301.
- AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
- Bioassay data, described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.
- Bitly 1.usa.gov data, anonymized clicks on gov links.
- Canada Open Data, pilot project with many government and geospatial datasets.
- Causality Workbench data repository.
- Corral Big Data repository at Texas Advanced Computing Center, supporting data-centric science.
- Credit Risk Analytics Data: a home equity loans credit data set, mortgage loan level data set, Loss Given Default (LGD) data set and corporate ratings data set.
- Data Source Handbook, A Guide to Public Data, by Pete Warden, O'Reilly (Jan 2011).
- Datacatalogs.org, open government data from US, EU, Canada, CKAN, and more.
- Data.gov/Education, central guide for education data resources including high-value data sets, data visualization tools, resources for the classroom, applications created from open data and more.
- DataMarket, visualize the world's economy, societies, nature, and industries, with 100 million time series from UN, World Bank, Eurostat and other important data providers.
- Datamob, public data put to good use.
- Data Planet, The largest repository of standardized and structured statistical data, with over 25 billion data points, 4.3 billion datasets, 400+ source databases.
- Datasets.co, datasets for data geeks, find and share Machine Learning datasets.
- DataSF.org, a clearinghouse of datasets available from the City & County of San Francisco, CA.
- DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Government datasets.
- Delve, Data for Evaluating Learning in Valid Experiments
- EconData, thousands of economic time series, produced by a number of US Government agencies.
- data.world, discover and share cool data, connect with interesting people, and work together to solve problems faster.
- Enron Email Dataset, data from about 150 users, mostly senior management of Enron.
- Europeana Data, contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana - the trusted and comprehensive resource for European cultural heritage content.
- FEDSTATS, a comprehensive source of US statistics and more
- FIMI repository for frequent itemset mining, implementations and datasets.
- Financial Data Finder at OSU, a large catalog of financial data sets.
- GDELT: The Global Data on Events, Location and Tone, described by Guardian as "a big data history of life, the universe and everything."
- Generated Photos, free dataset with AI-generated photos to help students and teachers with any research.
- GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.
- GeoDa Center, geographical and spatial data.
- Google ngrams datasets, text from millions of books scanned by Google.
- Grain Market Research, financial data including stocks, futures, etc.
- HitCompanies Datasets, comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine Learning.
- ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.
- Infochimps, an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything.
- Investor Links, includes financial data
- KDD Cup center, with all data, tasks, and results.
- KONECT, the Koblenz Network Collection, with large network datasets of all types in order to perform research in the area of network mining.
- Linking Open Data project, at making data freely available to everyone.
- LoveTheSales data request page, free access to data for editors and academics to mine stats on the retail industry.
- Lyst Fashion Data Trends, tracking 10 million global fashon searches a month, easily and freely accessible to academics as a valuable resource.
- MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.
- ML Data, the data repository of the EU Pascal2 networks.
- NASDAQ Data Store, provides access to market data.
- National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.
- National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.
- NetworkRepository: Interactive Data Repository, has many collections of graph and networks from social science, machine learning, scientific computing, and other areas.
- Open Data Census, assesses the state of open data around the world.
- OpenData from Socrata, access to over 10,000 datasets including business, education, government, and fun.
- Open Source Sports, many sports databases, including Baseball, Football, Basketball, and Hockey.
- PubGene(TM) Gene Database and Tools, genomic-related publications database
- Quandl, a collaboratively curated portal to millions of financial and economic time-series datasets.
- SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.
- Jerry Smith dataset collection, with Finance, Government, Machine Learning, Science, and other data.
- SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users' activities at the project management web site.
- Sports Statistics, with data for Soccer, NBA, NFL, NHL, and more.
- StatLib, CMU Datasets Archive.
- Vhinny, provides fundamental financial information on the website and in .csv datasets for download.
- UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.
- UCR Time Series Data Archive, offering datasets, papers, links, and code.
- UK Open Postcode Geo, UK/British postcodes with easting, northing, latitude, and longitude.
- Web Data Commons, structured data from the Common Crawl, the largest public web corpus.
- Wikiposit, a (virtual) amalgamation of (mostly financial) data from many different sites, allowing users to merge data from different sources
- WorldData.AI, connect your data to many of 3.5 Billion WorldData datasets and improve your Data Science and Machine Learning models! Subscribe to KDnuggets to get free access to Partners plan.
- Yahoo Sandbox datasets, Language, Graph, Ratings, Advertising and Marketing, Competition
- Yelp Dataset, a subset of Yelp businesses, reviews, and user data for use in personal, educational, and academic purposes.