Initial commit
This commit is contained in:
89
research/utils/s3.md
Normal file
89
research/utils/s3.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# How to Download Needed Data from S3
|
||||
|
||||
## Install AWS CLI v2
|
||||
|
||||
Install AWS CLI v2 by following the instructions at https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html.
|
||||
|
||||
|
||||
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
|
||||
unzip awscliv2.zip
|
||||
sudo ./aws/install
|
||||
|
||||
## Configure SSO
|
||||
|
||||
Run the following command:
|
||||
|
||||
```console
|
||||
aws configure sso
|
||||
```
|
||||
|
||||
Example output:
|
||||
```
|
||||
(retrieval_scaling) (base) ➜ retrieval_scaling git:(main) ✗ aws configure sso
|
||||
SSO session name (Recommended): yichuan
|
||||
SSO start URL [None]: https://ucberkeley.awsapps.com/start#/
|
||||
SSO region [None]: us-west-2
|
||||
SSO registration scopes [sso:account:access]:
|
||||
Attempting to automatically open the SSO authorization page in your default browser.
|
||||
If the browser does not open or you wish to use a different device to authorize this request, open the following URL:
|
||||
|
||||
https://oidc.us-west-2.amazonaws.com/authorize?response_type=code&client_id=i3YtHZTRneXEIApSyvdgSHVzLXdlc3QtMg&redirect_uri=http%3A%2F%2F127.0.0.1%3A37899%2Foauth%2Fcallback&state=5f52320e-0929-4e44-83c7-f6bd9b492010&code_challenge_method=S256&scopes=sso%3Aaccount%3Aaccess&code_challenge=HYnZ4Pc-tqI8CdJb6qEAR0LjI1_UjN-zln26lqJKeL8
|
||||
The only AWS account available to you is: 976193267581
|
||||
Using the account ID 976193267581
|
||||
There are 2 roles available to you.
|
||||
Using the role name "UCB-FederatedAdmins"
|
||||
Default client Region [None]:
|
||||
CLI default output format (json if not specified) [None]:
|
||||
Profile name [UCB-FederatedAdmins-976193267581]:
|
||||
To use this profile, specify the profile name using --profile, as shown:
|
||||
|
||||
aws sts get-caller-identity --profile UCB-FederatedAdmins-976193267581
|
||||
```
|
||||
|
||||
After configuration, you must include `--profile UCB-FederatedAdmins-976193267581` with each AWS operation to use the SSO credentials.
|
||||
|
||||
## Refresh the SSO
|
||||
|
||||
If you encounter the error `Error when retrieving token from sso: Token has expired and refresh failed`, simply run the SSO configuration command again.
|
||||
|
||||
## Download S3 Data
|
||||
|
||||
All data is stored in `s3://retrieval-scaling-out`, which includes 4 directories:
|
||||
- embeddings/
|
||||
- examples/
|
||||
- indices/
|
||||
- passages/
|
||||
|
||||
Download the data using AWS CLI:
|
||||
|
||||
```console
|
||||
aws s3 cp s3://retrieval-scaling-out ~/scaling_out --profile UCB-FederatedAdmins-976193267581
|
||||
|
||||
aws s3 cp s3://retrieval-scaling-out/examples/test_c4.jsonl ~/examples/scaling_out --profile UCB-FederatedAdmins-976193267581
|
||||
```
|
||||
|
||||
### Faster Download Options
|
||||
|
||||
To accelerate downloads, you can try the following methods:
|
||||
|
||||
Use multipart downloads:
|
||||
```console
|
||||
aws s3 cp s3://retrieval-scaling-out ~/scaling_out --profile UCB-FederatedAdmins-976193267581 --recursive --multipart-threshold 128MB --multipart-chunksize 512MB
|
||||
```
|
||||
|
||||
Configure higher concurrency:
|
||||
```console
|
||||
aws configure set default.s3.max_concurrent_requests 50
|
||||
aws configure set default.s3.max_queue_size 10000
|
||||
```
|
||||
|
||||
Utilize S3 Transfer Acceleration:
|
||||
```console
|
||||
aws s3 cp s3://retrieval-scaling-out ~/scaling_out --profile UCB-FederatedAdmins-976193267581 --recursive --endpoint-url https://s3-accelerate.amazonaws.com
|
||||
```
|
||||
|
||||
Or use alternative tools like `s5cmd`:
|
||||
```console
|
||||
pip install s5cmd
|
||||
s5cmd --profile UCB-FederatedAdmins-976193267581 cp s3://retrieval-scaling-out/* ~/scaling_out/
|
||||
```
|
||||
Reference in New Issue
Block a user