Metadata-Version: 2.1
Name: magi-dataset
Version: 1.0.4
Summary: Convenient access to massive corpus of GitHub repositories
Home-page: https://github.com/Enoch2090/magi_dataset
Author: Enoch2090
Author-email: ycgu2090@gmail.com
License: GPLv3
Platform: UNKNOWN
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.7.0
Description-Content-Type: text/markdown
Requires-Dist: Markdown
Requires-Dist: PyGithub
Requires-Dist: beautifulsoup4
Requires-Dist: deep-translator
Requires-Dist: hn
Requires-Dist: langdetect
Requires-Dist: lxml
Requires-Dist: networkx
Requires-Dist: numpy (>=1.15.4)
Requires-Dist: pandas (>=1.2.0)
Requires-Dist: python-hn
Requires-Dist: requests
Requires-Dist: scipy
Requires-Dist: setuptools
Requires-Dist: spacy
Requires-Dist: tqdm


# MAGI Dataset

## Install
```python
pip install magi_dataset
```
If you plan on using magi_dataset to periodically crawl data, set the following variables in your environment:

```shell
export GH_TOKEN="Your token"
```

Read [Creating a personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) for more information on creating GitHub personal access token. If using the default data without crawling new data, you may safely ignore this token.

## Usage

The recommended way to use magi_dataset is to run the collection process in chunked mode. First create an empty dataset and initiate index from GitHub:

```python
```







Initialize an empty instance and collect data:

```python
>>> from magi_dataset import GitHubDataset

>>> github_dataset = GitHubDataset(
...     empty = True
... )

github_dataset.init_repos(fully_initialize=True)
```

Download default data (not guranteed to be the newest):

```python
>>> from magi_dataset import GitHubDataset

>>> github_dataset3 = GitHubDataset(
...	    empty = False
... )
```

The default data may be found at [Enoch2090](https://huggingface.co/Enoch2090)/[github_semantic_search](https://huggingface.co/datasets/Enoch2090/github_semantic_search/blob/main/list.json) on HuggingFace. We will update the data periodically.

After the dataset is created, access the data with either number index:

```python
>>> github_dataset[5]
GitHubRepo(name='ytdl-org/youtube-dl', stars=114798, description='Command-line program to download videos from YouTube.com and other video sites', _fully_initialized=True)
```

Or the full name:

```python
>>> github_dataset['ytdl-org/youtube-dl']
GitHubRepo(name='ytdl-org/youtube-dl', stars=114798, description='Command-line program to download videos from YouTube.com and other video sites', _fully_initialized=True)
```

And you can access the corpus by accessing the `readme` and `hn_comments` attributes of `GitHubRepo` objects.

```python
>>> github_dataset[5].readme[0:100]
'[![Build Status](https://github.com/ytdl-org/youtube-dl/workflows/CI/badge.svg)](https'
```

## Future Works

- The current idle handler design is primordial, will switch to async pipelines to relieve CPU sleep time.
- Elasticsearch database builder
- Pinecone database builder (wrapper only)
- Hash verification of persistence files


