No.029 Compact Data Structures for Big Data

Icon

NII Shonan Meeting Seminar 029


P1090816


Buddha


R1102645

Organizers

Kunihiko Sadakane (National Institute of Informatics), Japan

Wing-Kin Sung (National University of Singapore), Singapore

 

Participants

Rajeev Raman University of Leicester
Venkatesh Raman The Institute of Mathematical Sciences
Srinivasa Rao Satti Seoul National University
Ian Munro University of Waterloo
Moshe Lewenstein Bar-Ilan University
Gonzalo Navarro University of Chile
Francisco Claude Universidad Diego Portales / Akori
Travis Gagie University of Helsinki
Simon Gog The University of Melbourne
Jesper Larsson IT University of Copenhagen
Sebastiano Vigna Università degli Studi di Milano
Giuseppe Ottaviano University of Pisa
Ankur Gupta Butler University
Hiroshi Sakamoto Kyutech
Tetsuo Shibuya University of Tokyo
Taku Onodera University of Tokyo
Takuya Akiba University of Tokyo
Hiroki Arimura Hokkaido University
Shin-ichi Minato Hokkaido University
Takuya Kida Hokkaido University
Shuhei Denzumi Hokkaido University
Yasuo Tabei Japan Science and Technology Agency
Koji Tsuda AIST
Anish Shrestha Department of Computational Biology, University of Tokyo
Martin Frith CBRC, AIST
Takeaki Uno NII
Alexander Bowe NII
Atsushi Koike NII
Wing-Kin Sung National University of Singapore
Kunihiko Sadakane NII

Schedule

Arrival Day ? Sep 26

15:00 – 19:00 Check-in

19:00 ‐21:30??Welcome Reception

Day 1 ? Sep 27

7:30 ‐9:00??Breakfast

9:00 ‐10:30??Session 1

Ian Munro Succinct data structures for representing equivalence classes
Rajeev Raman Encodings for top-k and range selection
Moshe Lewenstein Two Dimensional Range Minimum Queries and Fibonacci Lattices

11:00 ‐12:00??Session 2

Sebastiano Vigna Quasi-Succinct Indices
Simon Gog Integer Alphabet-based Self-Indexes at Terabyte Scale

 

12:00 – 14:00 Lunch

14:00 – 15:40 Session 3

Travis Gagie An Alignment-Based Index for Genomic Datasets
Martin Frith Bio-sequence similarity search with spaced suffix arrays and subset suffix arrays
Alexander Bowe Succinct de Bruijn Graphs
Taku Onodera Detecting Superbubbles in Assembly Graphs

 

16:00 – 18:00 Session 4 ? ? Discussions

18:00 – 19:30 Dinner

 

Day 2 ? Sep 28

7:30 ‐9:00??Breakfast

9:00 ‐11:00??Session 5

Srinivasa Rao Satti Selection from Read-Only Memory with Limited Workspace
Venkatesh Raman Improved Selection Algorithms for Integers in Read-only Memory and Restore Models
Ankur Gupta Online Multiselection
Gonzalo Navarro Document Retrieval on General Sequences

11:30 ‐12:30??Session 6

Francisco Claude Adaptive Data Structures for Permutations and Binary Relations
Giuseppe Ottaviano Compressed tries and top-k string completion

12:30 – 14:00 Lunch

14:00 -?16:20 Session 7

Anish Shrestha New Challenges to Processing DNA Data from Modern-day Sequencers
Yasuo Tabei Succinct data structures for scalable similarity search in ChemBioinformatics
Tetsuo Shibuya Fast Indexing Method for Protein 3-D Structure Searching
Jesper Larsson Encoding and modeling for set compression
Takuya Akiba Fast Exact Shortest-Path Distance Queries on Large Networks by Pruned?Landmark Labeling

16:20 – 18:00 Session 8 ? ? ? Discussions

18:00 – 19:30 Dinner

Day 3 ? Sep 29

7:30 ‐9:00??Breakfast

9:00 ‐11:00??Session 9

Hiroshi Sakamoto An application of stream compression
Shirou Maruyama Fully-Online Grammar Compression
Hiroki Arimura Faster Broad-Word Pattern Matching Algorithms for?Regular Expressions and Trees
Takuya Kida Data Compression using Variable-to-Fixed Length Codes

11:00 – 11:30 Group Photo Shooting

11:30 ‐12:30 ?Session 10

Shin-ichi Minato ZDD-Based Representation for Large-Scale Sparse Datasets?and Z-Skip-Links for Fast Traversal

12:30 – 13:30 Lunch

13:30 -?19:00 Excursion to Kamakura

19:00 – 21:30 Banquet

 

Day 4 ? Sep 30

7:30 ‐9:00??Breakfast

( ? ? ? ? ?- 10:00 ?Check-out)

9:00 ‐10:30??Session 11

Shuhei Denzumi DenseZDD: A Fast and Compact Data Structure for Family of Sets &PathSeqBDD:?A DAG Index based on Sequence BDD
Koji Tsuda Enumeration Algorithms and Statistical Significance
Takeaki Uno Similarity based Approach for Compression of Noisy Data

10:30 ‐12:00??Session 12 ? ? Disucssions

12:00 – 13:30 Lunch

13:30 ? ? ? ? ? ? ? Dismiss

 

Overview

Big Data are structured and unstructured datasets whose size is in the order of billions?or trillions. Because of their diversity and size, it is difficult to store, search and?analyze them. This meeting therefore focuses on algorithms and data structures for?efficient manipulation of Big Data. Especially, the meeting is devoted to compact data?structures for managing Big Data.

Typical examples of big data are genomic sequences and gene expression data, Web and?SNS data, sensor data in intelligent transport systems, etc. Traditional data structures?do not scale to handle such data, and therefore we should design new data structures to?handle them.

 

Although the amount of data explodes, the amount of the underlying information inside?the data may not be exploded. It is observed that many big datasets are redundant. In?the Web, many webpages were copies of others. In global positioning system (GPS),?GPS position data change continuously, which can be compressed using differential?encoding. In genomics, although different individuals have different genomes, the?individual genomes have highly similarity. Therefore we can compress such data by?identifying the similar parts. After the data is compressed, other issues are how to?access and search them efficiently. Traditional data structures are not designed to?handle compressed data and they may not manipulate Big Data well because the size of?the data structures exceeds the limit of memory usage, or searching time increases due?to their size. To handle these problems, researchers have worked on developing?compact data structures. Such data structures are also called compressed or succinct?data structures. They are much smaller than standard data structures, while keeping?the same access time to data in theory. However actual performance of such compact?data structures for storing Big Data is unknown or unsatisfactory.

The aim of this workshop is to bring together researchers active in the areas of compact?data structures to exchange ideas for handling Big Data. We will discuss methods for?compressing and storing Big Data. We will also discuss how to design time- and?space-efficient data structures for them Through discussion and sharing knowledge, we?hope to promote collaborations and further improve data structures for Big Data.

Railway Time Table

Narita Express (NEX)

From Tokyo Narita airport (NRT) to the Shonan Village, you can take JR trains. ?Some local trains (Soubu and Yokosuka lines) directly go to Zushi station, where you can take a bus to the village. ?It takes two hours and fifteen minutes from the airport to Zushi station. ?The trains have toilets and first-class cars (GREEN CAR).

Narita Express (NEX) is faster than local trains, but you have to change the train to local (Yokosuka) line. ?You can change trains at Tokyo or Musashi-Kosugi station. ?Trains will depart from the same platform as NEX at the transfer stations. ?Please pay attention to the destination of the trains; you can take trains bound for only Zushi, Yokosuka, or Kurihama?station.

JR Soubu/Yokosuka Line

The following table shows connections of trains and buses. ?Each row represents a journey. ?All trains leave Narita airport station (Terminal 1), then stop at Airport Terminal 2 station. ?They also stop at Tokyo and Musashi-Kosugi. ?However their arrival and departure times are not shown in the table if train change is not necessary. ?Prices are for trains and they do not include the bus fare (340 yen). ?Note that you should buy a train ticket from the airport to Zushi station (2,210 yen) even if you change trains on the way. ?In the table, “rush at Zushi” means that you may not have enough time to catch the bus, and “bad connection” means that you have to wait for more than 30 minutes for the next bus.

You can also find other routes using airport limousine buses etc. using Google maps.

NRT T1 NRT T2 Tokyo Musashi-Kosugi Zushi Shonan
dep. dep. arr. dep. arr. dep. arr. dep. arr. Time Price Remarks
9:02 9:04 11:48 12:02 12:32 3:30 2,210
NEX 10 9:15 9:19 10:38 10:44 11:30 12:02 12:32 3:17 3,670 ? bad connection
NEX 12 9:45 9:48 11:02 11:07 11:48 12:02 12:32 2:47 3,670
9:59 10:02 12:48 13:01 13:31 3:32 2,210
NEX 14 10:15 10:18 11:32 11:37 12:17 13:01 13:31 3:16 3,670 ? bad connection
NEX 16 10:45 10:48 12:02 12:06 12:48 13:01 13:31 2:46 3,670
10:59 11:01 13:48 14:04 14:34 3:35 2,210
NEX 18 11:14 11:18 12:32 12:37 13:19 14:04 14:34 3:20 3,670 ? bad connection
NEX 20 11:45 11:48 12:44 12:53 13:58 14:04 14:34 2:49 3,670 ? rush at Zushi
11:57 12:00 14:48 14:52 15:22 3:25 2,210 ? rush at Zushi
NEX 22 12:19 12:22 13:32 13:37 14:19 14:52 15:22 3:03 3,670
12:27 12:29 15:08 15:38 16:08 3:41 2,210
12:59 13:01 15:42 16:37 17:07 4:08 2,210 ? bad connection
NEX 24 13:14 13:17 14:32 14:37 15:19 15:38 16:08 2:54 3,670
NEX 26 13:45 13:48 15:02 15:06 15:48 16:37 17:07 3:22 3,670 ? bad connection
NEX 28 14:18 14:20 15:14 15:24 16:29 16:37 17:07 2:49 3,670
14:31 14:33 17:09 17:49 18:19 3:48 2,210 ? bad connection
NEX 30 14:44 14:48 16:02 16:11 17:00 17:05 17:35 2:51 3,670 ? rush at Zushi
14:58 15:00 17:42 17:49 18:19 3:21 2,210
NEX 32 15:14 15:18 16:32 16:37 17:22 17:49 18:19 3:05 3,670
NEX 34 15:44 15:47 17:02 17:06 17:48 18:34 19:04 3:20 3,670 ? bad connection
15:57 16:00 18:42 19:02 19:32 3:35 2,210
NEX 36 16:18 16:21 17:38 17:45 18:30 18:34 19:04 2:46 3,670 ? rush at Zushi
16:32 16:35 19:20 19:31 20:01 3:29 2,210
NEX 38 16:44 16:48 18:02 18:06 18:49 19:02 19:32 2:48 3,670
16:57 17:00 19:47 20:20 20:50 3:53 2,210
NEX 40 17:16 17:19 18:37 18:42 19:25 19:31 20:01 2:45 3,670
NEX 42 17:44 17:47 19:06 19:13 19:55 20:20 20:50 3:06 3,670
17:59 18:02 20:48 21:10 21:40 3:41 2,210
NEX 44 18:15 18:19 19:36 19:43 20:25 20:45 21:15 3:00 3,670
18:30 18:33 21:18 21:41 22:11 3:41 2,210
NEX 46 18:48 18:52 20:09 20:16 20:58 21:10 21:40 2:52 3,670
19:02 19:04 21:51 22:17 22:47 3:45 2,210
NEX 48 19:12 19:15 20:32 20:36 21:18 21:41 22:11 2:59 3,670
19:35 19:37 22:13 22:17 22:47 3:12 2,210
NEX 50 19:46 19:49 20:51 21:00 22:00 22:17 22:47 3:01 3,670